What exactly is Hadoop? Answering that question is becoming more complicated as new technologies are introduced...
to augment -- or in some cases replace -- the core Hadoop components that originally defined the open source big data framework. That trend was taken to a new level last week at Strata + Hadoop World 2015 in New York, where Hadoop distribution vendors Cloudera and Hortonworks rolled out offerings that, for some uses, look beyond the centerpiece Hadoop Distributed File System (HDFS) as a data store.
Cloudera, the leading Hadoop distro provider, introduced a new columnar data store called Kudu as an alternative to HDFS in real-time analytics applications involving streaming data or other information updated frequently; meanwhile, competitor Hortonworks announced software for managing data flows between systems that can work either with or without HDFS. Seen in the light of the addition of a new cluster resource manager called YARN in late 2013 -- and the recent rise of Apache Spark as a development and data processing alternative to the MapReduce compute engine that initially was tightly paired with HDFS -- the moves peg Hadoop as a moving target of increasing diversity.
Mike Olson, Cloudera's co-founder and chief strategy officer, told Strata + Hadoop World attendees that Hadoop has come to mean "many ways to store data and many ways to analyze data." The fact that Hadoop users need more ways to support analytical data, in particular, has been driven home as Cloudera has gained more experience in the market, according to Olson.
"HDFS was the original Google-style file system -- a way to do large data ingestion, to suck up the entire Internet and land it on disk," Olson said during a keynote session. "But if you wanted to update some of that data, or if you wanted to pull out some records at random, it really didn't perform well. It wasn't designed for that. "
There's a gap, he said, in common Hadoop's ability to handle certain workloads -- to do "fast analytics on fast data." Cloudera hopes Kudu will fill that gap. The new data store is intended to complement HDFS and Apache HBase, a NoSQL database that is also one of the common components of Hadoop clusters. Its objective is to enable users to apply real-time analytics to fast-changing data via rapid data inserts and updates, a combination that isn't a strongpoint of either HDFS or HBase.
In addition, Cloudera is looking to expand the use of Impala, a SQL-on-Hadoop query engine it offers, by connecting that technology to Kudu. The company released a public beta version of Kudu last week under an Apache Software Foundation open source license, and it said the data store will become an Apache Incubator project in the future.
Kudos for Kudu?
Cisco Systems Inc.'s WebEx unit uses a Cloudera-based Hadoop cluster to collect and store data on user interactions with its Web and video conferencing service, for analyzing performance and troubleshooting technical issues. Joe Hsy, director of WebEx's cloud services platform and tools, said he hadn't taken a deep look at Kudu but could see possible uses for the new technology in accelerating SQL-based analytics applications.
"We have more and more use cases geared to using SQL with Hadoop," Hsy said. "Optimization for SQL workloads -- maybe Kudu could help with that." For example, he pointed to the possibility of putting Apache Hive, one of the many SQL-on-Hadoop query tools vying for attention from users, on top of Kudu instead of HDFS to boost data analysis throughput.
YP LLC, a Tucker, Ga., company that produces the Yellow Pages telephone listings, is another Cloudera user -- its Hadoop cluster captures 3 billion data points related to the listings on a daily basis. William Theisinger, YP's vice president of engineering, said during a session at the Strata conference that his team is in the late stages of evaluating SQL-on-Hadoop tools, including Impala, Hive and software that Hewlett-Packard sells for use with its Vertica analytical database.
But Theisinger added after the session that he isn't in any rush to try out Kudu. "I'll wait until somebody else guinea-pigs it before I dive into it," he said.
Hortonworks gets into the data flow
Jobs beyond HDFS are also in the picture for the new Hortonworks DataFlow technology, or HDF. A separate product from Hortonworks Data Platform (HDP), the vendor's Hadoop distribution, HDF is designed to help users manage the movement of large data streams between different systems. The new offering doesn't require the use of HDFS, and customers can deploy it without also implementing HDP.
"We're not going to force customers to buy both subscriptions," said Tim Hall, vice president of product management at Hortonworks. "There definitely will be customers who buy an HDF subscription standalone, and others who will buy it along with HDP."
Doug HenschenAnalyst at Constellation Research Inc.
Mainstream relational databases, SAP HANA and the Cassandra NoSQL database are some of the other platforms that Hall said could be supported by HDF, which is based on Apache NiFi -- an open source data-in-motion technology created by a small vendor named Onyara that Hortonworks agreed to acquire in August.
According to Hall, Hortonworks is also getting some user requests for a standalone version of the Spark processing engine, which it currently supports as part of HDP. "That's something we're evaluating and talking to customers about," he said. "And if it makes sense, that might be something to add to HDF down the road."
Nonetheless, Hall vowed that Hortonworks remains "very wedded" to Hadoop. "I would look at this as an adjacency," he said, referring to the addition of HDF. "I think it's a sign of maturation and identifying opportunities that customers are asking us to help them with." HDF, he added, could be "an on-ramp product" in some cases, attracting new customers who later might add HDP-based Hadoop clusters.
More Hadoop components to consider
Doug Henschen, an analyst at Constellation Research Inc., said Cloudera isn't looking to fully replace HDFS and HBase with Kudu. Instead, he thinks the new technology can be viewed as "a third musketeer" among the Hadoop data storage options supported by the vendor.
"The idea is fast analytics and fast scans -- these are capabilities hard to get either in HDFS or HBase," Henschen said. "What has been happening is that a lot of people have been copying a lot of data, trying to force-fit those systems to try [to] get an analytic edge. The new software will support Impala and some streaming scenarios."
Further changes to the Hadoop ecosystem are likely going forward, with vendors seeking to add functionality and fill holes in the big data framework. "Hadoop is a place where a thousand flowers are blooming," Henschen said. But he added that as the new technologies arise, existing Hadoop components will fade slowly, if at all.
Listen to a podcast to hear about Geoffrey Moore's predictions for big data
Revisit last year's Strata conference in New York, where Spark ignited interest
Drop in on a Strata panel discussion covering Hadoop best practices