Hadoop co-creator Doug Cutting said yesterday the world of data management has changed significantly in the 10 years since the distributed processing framework first appeared on the scene. And he thinks the continuing changes it helped unleash likely will result in a diminished role for the Hadoop core technology itself in future big data applications.
In his opening keynote speech at Strata + Hadoop World 2016 in San Jose, Calif., Cutting said he expects the vast ecosystem of big data technologies that has been built up around Hadoop to increasingly steal the spotlight away from what officially constitutes the Hadoop framework. For example, he pointed to the Spark data processing engine -- an unproven emerging technology two years ago that now is becoming "a key part of the canon" for users of big data tools.
In particular, Cutting said he sees Spark as a replacement for MapReduce, the batch processing engine that he and fellow co-creator Mike Cafarella paired with the Hadoop Distributed File System in the initial incarnation of Hadoop. Spark "is a better execution engine," Cutting said. But he added that in the open source environment of Hadoop and related technologies, Spark's rise "doesn't threaten Hadoop users. It complements, and expands, and improves things for everyone who's using the platform."
Hadoop's "biggest legacy," Cutting contended, may end up being not the data management and analytics capabilities it ushered in, but the new enterprise software development ethos and processes it helped establish. The open source approach, he said, is "a better way of building an ecosystem and a platform" -- one in which the development agenda is no longer controlled by software vendors, and new technologies such as Spark can take the place of existing ones without causing big disruptions for users.
Ten years in the books for Hadoop
Cutting, who now is chief architect at Hadoop distribution vendor Cloudera Inc., expanded on his thoughts during a joint session with Cafarella at the Strata conference. The two looked both back and ahead on the development of Hadoop, which, by some measures, turns 10 this year. For example, it became a separate Apache subproject and acquired its name in January 2006, and the first Hadoop code release was issued on April 2 of that year.
Doug Cuttingchief architect at Cloudera, Hadoop co-creator
Cutting reiterated the relative importance of the original Hadoop core components to the big data ecosystem "is going to shrink" in the years ahead -- a point that was seconded by Cafarella, an assistant professor of computer science and engineering at the University of Michigan, and co-founder and CEO of unstructured data analytics startup vendor Lattice Data Inc.
Asked by one attendee about the proliferation of technology options available to users looking to build big data platforms, Cutting said the new reality is that no single tool is the right answer for every data need -- a change from the pre-Hadoop days, when relational databases were essentially synonymous with data management. "It is a little confusing," he acknowledged. "But I think it's generally a good thing for users and the ecosystem to have some choices available at each level."
Hadoop management gets short shrift
Cafarella, though, noted that Spark and other data engines get far more attention from big data vendors and open source developers than do metadata and systems management tools. He said more investment is needed on the management side, and in guaranteeing data integrity and reliability in Hadoop clusters and other big data systems.
The original Hadoop core design was focused on "cheap storage and processing that could be a little sloppy," because full transactional integrity wasn't required to support the analytics applications being eyed at the time, Cafarella said. Now, he added, the big data community needs to work to implement some of the processing guarantees of relational databases "to make data safe to query."
Brian Hopkins, an analyst at Forrester Research, agreed that more comprehensive capabilities are needed for managing big data systems and the distributed workloads they run. And the management process could get more complex, as organizations supplement core Hadoop with a wider variety of big data technologies. "If you think about Hadoop purely as a file store and MapReduce, its life is sort of limited," Hopkins said. "You're going to see a much more amorphous environment [going forward]."
Nonetheless, the emergence of Hadoop, Spark and other open source technologies is changing the way enterprise software is developed and deployed, according to Hopkins. Now, he said, "you throw some stuff out there, and let the strong mutations survive." In addition, the rapid pace of development and new releases on open source projects is putting pressure on large IT vendors, Hopkins said, noting that he expects more and more of them to switch to a "continuous release process," with product updates every three months or so.
More from Doug Cutting on Hadoop's past, present and future
Our guide to Strata + Hadoop World conference coverage
Hadoop software evaluation tips from consultant David Loshin