Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
At last week's Strata + Hadoop World conference in New York, enterprise users recounted their experiences in combining...
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Hadoop with a variety of other big data frameworks to create powerful data processing platforms -- even as some conference speakers pointed to a possible future where Hadoop itself moves out of the big data spotlight and becomes part of the processing fabric in organizations.
Along with Hadoop, big data tools and technologies such as Spark, SparQL and Hive were on display at the event, both in commercial and open source versions. For data management teams tasked with putting them together to support big data applications, some of the experience is familiar and some of it is new.
"Five or 10 years ago, we had a big data challenge. We just didn't have big data technology to address it," said Peter Ferns, CTO for compliance technology at New York-based financial services firm Goldman Sachs. "Now these frameworks have come of age where you can distribute your processing across a cluster."
According to Ferns, Goldman Sachs has always been a company that leaned toward the build side of the build-or-buy equation for software tools. What's new is the open source nature of the big data tools and the reduced cost of data storage, he said. "It's exciting. The standards have come of age, too."
In the past, IT vendors wouldn't necessarily apply open standards, Ferns said. Now, such standards have become more ubiquitous and easier to leverage, he added during a presentation about a project to build a homegrown graph database that provides a unified view of data for compliance and customer relationship management uses. To feed the database, Goldman Sachs is using a Hadoop system along with two technologies standardized by the World Wide Web Consortium: the Resource Description Framework (RDF) data interchange model and SparQL, a query language that supports RDF.
A cavalcade of open source tools
The tsunami of big data framework projects surrounding Hadoop can be hard to track. But some data pros welcome the profusion of open source options.
"Open source technologies are appearing at lightning speed. Now we don't have just our own developers working on advances," said Kevin Murray, vice president of information management infrastructure and integration at American Express Co. in New York.
At Strata + Hadoop World, Murray and a colleague described a Web recommendation engine for embedding in partner websites that was built with a trove of Hadoop elements and add-ons: Hive, Pig, Mahout, HBase, MapReduce and the Hadoop Distributed File System. Murray said the combination is intended as a repeatable big data platform that can eventually be used enterprisewide in a variety of applications. And other technologies may be applied as well. "We will advance the platform with new tools as they come to market," he said.
He contrasted the present plethora of tools to the traditional situation in IT. "Before, when you were building a data warehouse," he said, "you had perhaps four choices. Now you have hundreds."
Still, Murray noted, it remains necessary to decide whether "to Hadoop or not to Hadoop." Use cases must be selected carefully: Unstructured data tends to be Hadoop's sweet spot, he said, emphasizing that the new efforts don't supplant Amex's data warehouses, which "are not going away."
Withering away of the state of Hadoop?
Hadoop was often in the mix as users discussed alternatives to traditional enterprise data warehouses at the conference, co-sponsored by O'Reilly Media Inc. and Hadoop vendor Cloudera Inc. But it was overshadowed at times by Apache Spark, one of the new kids on the big data frameworks block.
Designed as a faster replacement for the batch-oriented MapReduce processing engine that Hadoop 1.0 was tied to, Spark is particularly being touted for use in iterative machine learning applications, due partly to its in-memory processing architecture.
Various Hadoop distribution providers are among those that have taken interest in Spark, releasing new products and integration plans that incorporate the technology in recent weeks. Cloudera released Cloudera Enterprise 5.2 with enhancements to its Spark component; MapR Technologies announced an initiative to integrate Spark with Apache Drill, a SQL-on-Hadoop tool; Pivotal Software included a Spark bundle in its HD 2.0.1 distro; and Hortonworks did the same in its HDP 2.2 release.
Mike Olson, Cloudera's founder and chief strategy officer, foresees a slowdown in Hadoop hoopla, though not in adoption and usage of the technology.
"We are going to see Hadoop disappear," Olson told a conference keynote crowd. What he meant is that it will fade into the big data background: He described Hadoop as a complicated, foundational piece of technology that will become truly mainstream when business users can employ it without even realizing that it is underlying their analytics applications.
"Whether or not Hadoop will disappear, the definition of Hadoop is definitely evolving," said Matthew Aslett, an enterprise software analyst at The 451 Group. "It's becoming an integral part of a new big data platform, for want of a better phrase."
Ron Kasabian, vice president and general manager of big data solutions at Intel Corp., also said to look for more such evolution, with more technologies joining the core components of Hadoop to power big data systems. "You will continue to see the definition of Hadoop expand," he said. "The definition is not done growing."
Get guidance on building a big data architecture
Watch a video in which consultant Claudia Imhoff assesses Hadoop
Hear a podcast that describes big data activity across the Atlantic