Mike Kiev - Fotolia
Hadoop is a powerful distributed processing technology, but it's hard to describe to the C-suite. So vendors came up with an easy-to-grasp metaphor: They want organizations to dive into the data lake, an architectural approach that positions Hadoop as a central repository for the diverse streams of data flowing into systems -- relegating the enterprise data warehouse (EDW) to the IT backwaters.
The buzz: Hadoop clusters based on commodity computers are a relatively inexpensive destination for data. And their waters can hold a variety of structured, unstructured and semi-structured information, including the hallmark of big data applications -- log files, Web clickstreams, sensor data, social media posts. Data stored in Hadoop also doesn't have to be cleansed and consolidated up front, as in an EDW; it can be harbored in raw form and schematized as needed for different analytics uses.
The reality: As a term, data lake invites sarcastic variations; data swamp, data marshland and data puddle are examples from the #datalake Twitter stream. More substantively, many organizations are just getting their feet wet with Hadoop and aren't ready to plunge in. Also, a reservoir of raw Hadoop data eventually needs to be refined to make it fit for consumption by business users. And Hadoop systems don't exist on an island: Traditional data warehouses likely will still play a big role in combination with them, leaving IT teams with new development and integration challenges to navigate.