Big data applications: Real-world strategies for managing big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
Richard Winter, president of consulting firm Winter Corp. in Cambridge, Mass., says there are two principal development trends taking place in connection with “big data” and big-data technologies. First, traditional data warehouse vendors are investing in scalability improvements to better accommodate growing volumes of transaction data. And second, new open source technologies such as Hadoop, MapReduce and NoSQL databases are emerging for use primarily as data warehouse alternatives in tackling other forms of big data -- Web activity logs and sensor data, for example.
“When you have very large volumes of data to manage and analyze, a data warehouse can look like a very expensive solution,” Winter said. That isn’t necessarily a valid perception when it comes to transaction data: He noted that in general, data warehouse technology has demonstrated a strong return on investment for uses in which data is highly structured, tightly managed and used widely within an organization on an ongoing basis.
But the economics of a Hadoop-style approach to big-data management can be better in certain kinds of use cases. Winter said. For example, scientific research processes can produce enormous volumes of data -- about 15 petabytes of raw sensor data annually in the case of the Large Hadron Collider, located near Geneva and designed for use in high-energy physics experiments. Dealing with such information is the kind of challenge for which Hadoop is well suited, Winter said.
Hadoop is a framework that enables large data sets to be processed in a distributed fashion across clustered systems; its MapReduce component is the programming model used to write Hadoop-based applications. Forrester Research Inc. analyst James Kobielus agrees with Winter that the Hadoop technologies have an important role to play in managing big data. And in a June 2011 blog post, he wrote that Hadoop-related inquiries from Forrester clients “have moved well beyond the ‘What exactly is Hadoop?’ phase to the stage where the dominant query is ‘Which vendors offer robust Hadoop solutions?’”
Hadoop: There but not all there
Kobielus thinks Hadoop is real but still immature. He said in the blog post that Hadoop has already been successfully adopted by many companies to support “extremely scalable” analytic applications. On the other hand, he added that Hadoop won’t be ready for broader enterprise uses until more data warehouse vendors embrace the technology and early adopters coalesce around a core technology stack.
Of the vendors mentioned in a Forrester Wave report on data warehouse platforms that was released earlier this year, only two had incorporated Hadoop into their core product offerings, according to Kobielus. Other vendors “interface with Hadoop only partially and only at arm’s length,” he wrote, although he went on to say that he expects most of the leading vendors to “embrace Hadoop more fully in the coming year,” potentially through acquisitions.
On the user side, Kobielus said MapReduce appears to be the only common element in the Hadoop installations of companies he has interviewed. “We can’t say Hadoop is ready-to-serve soup until we all agree to swirl some common ingredients into the bubbling broth of every deployment,” he wrote. In addition, he said, the Apache open source community, which manages Hadoop and related projects, should submit the technology to a formal standardization process to ensure cross-platform interoperability.
For now, Hadoop deployments are often done by application developers without the involvement of an organization’s IT and data warehousing managers, according to analysts. But in the long term, Hadoop likely will become more integrated in the mainstream data warehousing process, said Lyndsay Wise, president and founder of Toronto-based consulting firm WiseAnalytics Inc.
Big-data technologies to come in from the cold?
“A few years ago, the concepts of master data management and data governance existed mostly outside the data warehouse space, but now organizations are focusing on those issues within the context of their data warehouses,” Wise said. Similarly, she predicts, data warehousing teams increasingly will end up with responsibility for managing Hadoop and MapReduce implementations as the data they hold “grows more complex” and organizations realize that more effective management as part of their data warehouse strategies could help derive increased business value from the information.
Still, Wayne Eckerson, research director for TechTarget Inc.’s business applications and architecture media group, said that organizations weighing investments in big-data technologies should be wary of market hype, both against traditional data warehouses and in favor of Hadoop and other newer technologies.
Eckerson noted that Hadoop isn’t necessarily a free lunch, despite its open source nature. In addition to hardware and other technology costs, there’s the issue of internal resources: “No matter the technology, you still need to hire people -- and some of them are pretty rare talents,” he said.
Hadoop could also foster classic garbage in, garbage out scenarios, Eckerson cautioned. “I think the Hadoop people need to figure out if the information they’re processing is junk, and if so they need to clean it up or spend time compensating for the fact that it isn’t good,” he said. It isn’t a question of whether Hadoop can be useful, Eckerson added; it’s a question of how Hadoop is actually being used within a particular organization.
ABOUT THE AUTHOR
Alan R. Earlsis a Boston-area freelance writer focused on business and technology.