Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
In many organizations, the growing volume and increasing complexity of data are straining performance and highlighting the limits of the traditional data warehouse. IT and data management professionals can respond by tweaking and tuning existing system implementations, but the rush to incorporate a variety of unstructured information into the data warehouse environment may call for new technologies that help power big data analytics...
In particular, Hadoop systems and related big data technologies are popping up alongside data warehouses to manage the flow of unstructured and semi-structured data, including Web server and other system and network log files, text data, sensor readings and social network activity logs. Hadoop and cohorts such as MapReduce and NoSQL databases can complement data warehouse systems in such cases, creating what analysts describe as a logical or hybrid data warehouse architecture that puts processing workloads on the platforms best able to handle them.
The available building blocks for high-performance business intelligence (BI) and data warehouse environments also include a selection of other technologies that can fill specific roles -- for example, data warehouse appliances, columnar databases and in-memory databases. Used together, the various tools can boost warehousing speed, but they also challenge an organization's data architecture and integration skills.
''Architecture is becoming increasingly important. You thread everything together with architecture,'' said William McKnight, president of McKnight Consulting Group in Plano, Texas. Companies need to think about pushing some data warehouse workloads out to technologies that can better handle them, especially when unstructured or semi-structured data is involved, he said.
Most companies do see a need for warehousing speed, according to a report published by The Data Warehousing Institute (TDWI) in October 2012. Sixty-six percent of 278 IT professionals, business users and consultants surveyed by TDWI said getting high levels of performance from data warehouses and related platforms was ''extremely important.'' Only 6% said performance wasn't a pressing issue for them.
Topping the list of processes that respondents thought would benefit the most from high-performance data warehousing were advanced analytics, cited by 62% of the people surveyed, and the use of big data for analytics, chosen by 40%.
Going to extremes with Hadoop, big data
Report author Philip Russom, research director for data management at TDWI in Renton, Wash., wrote that Hadoop has come into prominence in no small part due to its ability to manage and process the extremes of big data. Massively parallel Hadoop clusters can scale out to meet the demands of ever-larger workloads, Russom said, adding that what he described as Hadoop's "data-type-agnostic file system'' makes it a better fit for unstructured and semi-structured data than relational databases are.
Yet Hadoop big data systems should be viewed as part of a larger picture, he asserted. ''Hadoop is a wonderful complement to the data warehouse, but no one that has worked their way through it would see it as a replacement for the data warehouse,'' Russom wrote in the report.
More on Hadoop and the data warehouse
Learn about the 2013 outlook for the Hadoop framework and big data
Take a trip back to Hadoop World 2012
Read about the logical next step for the data warehouse
''Until Hadoop came along there really wasn't a good way to handle unstructured data,'' said Wayne Eckerson, director of the BI Leadership Research unit at TechTarget Inc., the Newton, Mass., parent company of SearchDataManagement.com. Organizations had to use text mining tools to parse the data into rows and columns and then load that into fields in a data warehouse, Eckerson said. But, he added, ''it was a two-step process, and a lot of people just didn't use it.''
Hadoop, MapReduce and related tools enable developers to automate the data parsing process, according to Eckerson and other consultants. In addition, a variety of Hadoop, data warehouse and data integration vendors have released software connectors that make it easier to transfer data between Hadoop and data warehouse systems.
Ben Harden, a managing director at consultancy CapTech Ventures Inc. in Richmond, Va., sees Web server logs as a good example of data that's best channeled to Hadoop to offload processing from conventional systems and improve the overall performance of a data warehouse environment.
Side-by-side on big data
Instead of loading Web logs directly into a data warehouse, they can be stored on a Hadoop system and crunched there, Harden said. Aggregated results then can be fed into a relational model in the data warehouse for analysis by business users, he said. Again, that scenario places upstart Hadoop alongside the venerable data warehouse. ''The relational database doesn't go away,'' Harden said, adding that the ''hardcore processing'' of BI and analytics data still has to be done there.
''Everyone is suddenly very log-happy. That's where Hadoop comes in: We need a place to put this stuff, then we have to make sense of it,'' said Joe Caserta, president of Caserta Concepts LLC, a New York-based data warehouse consulting and training company. He is also co-author -- with BI and data warehousing consultant Ralph Kimball -- of The Data Warehouse ETL Toolkit.
Caserta and other consultants caution that there are still barriers to wider Hadoop use in data warehouse architectures. The open source technology requires advanced programming skills and can benefit from the addition of custom-built tools and functionality, they said. Moreover, Hadoop is a batch-oriented technology that doesn't intrinsically lend itself to real-time processing of big data. That has led to the use of a variety of advanced messaging and event-oriented technologies to help Hadoop systems keep up with the rapid velocity of data updates, Caserta said.
Overall, though, the pieces are available to extend a data warehouse environment to deal with big data, said Colin White, president and founder of consultancy BI Research in Ashland, Ore. Nowadays, ''I don't think it's practical to put everything in the data warehouse,'' he said. ''The key will be to make all the different pieces work together.''
Jack Vaughan asks:
What do you think of the pace of new technology introductions in BI and data warehousing? Can your data group keep up?
1 ResponseJoin the Discussion