The focus on Hadoop's commercial prospects has become more intense since Hortonworks, the company started by the...
open source processing framework's Yahoo originators, went public last December -- making it the first of the top three independent Hadoop distributors to do so.
And leading up to last week's Hadoop Summit 2015, an annual conference hosted by Hortonworks and Yahoo, the scrutiny got even closer, given recent Gartner survey results showing the Hadoop platform still stymied at the early adopter stage. At the same time, new open source and proprietary tools continue to arrive to help organizations leap from proof-of-concept prototypes to full production applications supporting many users.
During a keynote session at the event, held in San Jose, Calif., Hortonworks CEO Rob Bearden didn't directly counter Gartner's data. Instead, he urged the audience "to look at it in context. We're just starting into an industry that is going to be very transformative." Bearden argued that Hadoop is seeing faster uptake than the relational database was at a roughly analogous point in its development 25 to 30 years ago.
Hadoop cup: Half full or half empty?
What Gartner reported was that 54% of the 284 IT and business leaders who responded to the survey have no plans to invest in Hadoop in the next two years. Some see the study as an instance where the glass is half empty -- others, half full. The survey, conducted earlier this year, found that only 26% of respondents were actually deploying, piloting or experimenting with Hadoop projects at this stage. But Bearden asked the crowd to focus on data showing an additional 18% of respondents joining those Hadoop user ranks over the next 24 months.
Gartner's study and others suggest a skills gap is one of the impediments that highly-hyped Hadoop faces as it looks to get out of the early adopter corner and grow into a broader enterprise platform. Involved are skills from the early programming and configuration stages right through to back-end analytics efforts.
In particular, the analytics phase, where Hadoop's heaping helpings of unstructured and semi-structured data are turned into business assets, continues to be a hotbed of product introductions, usually taking the form of SQL-on-Hadoop offerings. Such tools open up Hadoop data to the legions of SQL-savvy workers in companies.
"People already have a huge, wonderful infrastructure to run their business. SQL is where they go for answers," said Mike Hoskins, CTO at data management and analytics technology vendor Actian. The company released its own survey at the Hadoop Summit, with results that somewhat aligned with Gartner's assessments, and it cited SQL integration as a gating factor to Hadoop adoption.
Bumper tools crop seeds SQL in Hadoop
In some respects, the cornucopia of SQL-on-Hadoop products has been almost too bountiful. In a keynote at the conference, Forrester Research analyst Mike Gualtieri pointed to all of the SQL tools that have become available for Hadoop users. "Fortunately, there are at least 13," he joked.
Since Facebook's inception in 2007 of what became the open source Apache Hive data warehouse software, the SQL-on-Hadoop tools group has come to include Actian Vortex, Pivotal Hawq, Cloudera Impala, JethroData's eponymously named SQL engine and others. New releases of some of those technologies were on view at the Hadoop Summit. For example, Hadoop distribution provider MapR demonstrated updates to Apache Drill, an open source tool released in a 1.0 version in May.
"Drill helps address the concern that data is there, but it is difficult to get at," said Jack Norris, MapR's chief marketing officer. Norris said flat budgets in central IT are mandating well-founded proofs of concept before Hadoop implementations can be ramped up and moved into production. "It's a chicken-egg thing," he said. "That's where Drill and SQL-based data discovery come in."
Also at the conference, Teradata said it would contribute to Presto, an open source SQL-on-Hadoop query engine project that Facebook initiated as a follow-up to its Hive efforts. The Presto engine, which can also work with data stores other than the Hadoop Distributed File System, uses pipelining and other computing techniques to improve on Hive performance.
Teradata's contributions are likely to center on ODBC and JDBC drivers, integration with Hadoop's YARN resource management software, installers, monitoring tools and documentation -- all things that would make Presto more like commercial software.
Hadoop platform still a moving target
Presto, Drill and the slew of other add-ons to the Hadoop ecosystem sometimes obscure what exactly the Hadoop platform is and what it will become. The Spark processing engine, seen in part as a replacement for Hadoop's original MapReduce engine, is another major case in point.
While he emphasized that it isn't clear which portions of the original Hadoop stack will continue to be employed in the long term, analyst Curt Monash sees significance in the ongoing development around the Hadoop architecture.
"The whole thing is very real," said Monash, president of Monash Research. "We're at the point where there is significant innovation in analytics that is based on new data management, data movement and data analysis stacks." But for now, he continued, areas in which data workloads will be shifted from traditional systems to the new stacks are still somewhat limited. He discussed the topic further in a blog post last week, noting that in general, "Hadoop adoption is happening earlier for new applications, rather than in replacement or rehosting of old ones."
The impediments to faster Hadoop growth also include an overall change in organizational mindset needed to exploit the streams of data coming into companies. That can be true even for an original big data shop like Schlumberger Ltd., an oil and gas technology and services supplier based in Houston.
There's a challenge in thinking about data organizationally; or, "how you frame what the journey is," according to Anil Varma, Schlumberger's vice president of data and analytics. "Data has become a strong foundation of how any modern company needs to operate," Varma said during a panel discussion at the Hadoop Summit. "But I don't think organizational structures have caught up to that yet."
Listen to a podcast outlining SQL-on-Hadoop activity at Hadoop Summit 2014
Learn about the role SQL-on-Hadoop plays in the Hadoop ecosystem
Check out the recently introduced SQL-based big data engine called Jethro Data