michelangelus - Fotolia
When telling the story of big data analytics, we may have sung the praises of the mythical DevOps Genius one too many times. I refer to the apocryphal figure who works at a legendary Web concern -- let's call it TwittleBookOhBoy.com. We take you to the company's lab somewhere on the West Coast.
This person can write a Java machine learning classifier, configure a Hadoop cluster -- break for some bike repairs and an incredibly complex cappuccino -- and then return to fix a broken Python query and generate a report on Latvian keyword trends, all before calling it a day.
But a look at the history of Hadoop shows us that, while advanced Java skills for MapReduce programming can take an organization part way to data processing efficiency, they don't get an organization all the way there. Developers, even ones west of Berkeley, have been busy building tools that work at a higher level of abstraction to address this issue.
Waiting for the SQL
This came to mind while I was talking recently about SQL-on-Hadoop technologies with Michael Fabacher, vice president of data architecture and database development at Cardlytics in Atlanta, which slices and dices point-of-sale data to create targeted retail and restaurant promotions aimed at credit and debit card users.
"We are 'East Coast,' and have a lot of people well-versed in SQL," Fabacher chided -- a bit of a dig, perhaps, at the West Coast of Hadoop and frequent claims of DevOps Genius sightings. His group is using Hadoop, supplied by MapR, yes. It is also using MapR's Drill software for SQL-on-Hadoop applications. Drill is among a multitude of new SQL-style tools that are following a trail blazed partly by Hive, a technology that originated at Facebook. The irony that top-notch West Coast-based developers working at Facebook decided to create software that enabled them and their comrades to use SQL to work with Hadoop data is not lost on Fabacher.
"As talented as the Facebook engineers are, there is a better use of their time than to write MapReduce jobs," he said. He is right. Maybe the painting of the DevOps and Hadoop hotshot needs retouching.
Dremel, Impala, Drill and Presto …
For many, the evolution of SQL-on-Hadoop starts with Hive. Its journey began at Facebook in 2007; it went on to become an Apache open source project. Hive converted some SQL commands to MapReduce jobs, cutting out a layer of complex programming. The software opened up Hadoop to broader use inside Facebook and elsewhere, and it's now available as part of all the major Hadoop distributions. Like MapReduce, its lineage is in batch processing jobs on data in the Hadoop Distributed File System (HDFS).
"If you look at the origination of Hive, it was intended to enable people with skills in writing SQL queries to bring that to bear on HDFS. It was about bringing the SQL skill set to Hadoop," said Matthew Aslett, an analyst at 451 Research. Since that origination, Hive has been updated considerably. But a new class of tools has emerged, too.
Those tools include Impala, Drill and Presto, each of which has a corporate sponsor (respectively, Cloudera, MapR and Teradata), but is also available with Apache Software Foundation open source licenses. The tools take a page out of MapReduce originator Google's playbook for Dremel, Aslett said, referring to a sort of SQL-like language on MapReduce tool, which was described in a noted Google technical paper published in 2010.
Right tool for the job
These tools each strive not to accomplish long-running batch jobs but to provide high-performance interactivity for big data analytics. Interestingly, Hive creator Facebook is also the originator behind Presto, which the company first used in-house in 2012, and which it has invited Teradata to turn into a product.
Google created Dremel as a complement to, not a replacement for, MapReduce to enable at-scale interactive analysis of crawled Web documents, tracking of install data for applications on the Android Market site, crash reporting for Google products, spam analysis and much more.
Now, the industry push to create SQL-on-Hadoop alternatives to raw MapReduce analytics might meet with suspicion. Some observers may ask if the refining tools are really needed or if they live up to their hyperbole. Still, it is interesting to look at the tools' origins -- many of them coming from a part of the world that seemingly has rock star programmers to spare, but which, in fact, believes in the adage about the right tool for the right job. The tools arose from real needs. And, even at TwittleBookOhBoy.com, it's not just one DevOps hotshot that has the magic potion; it is a broad team now, one with complementary talents.
Learn about the hurdles to ethical digital research
Find out how cloud is changing the database equation
Look behind today's Hadoop-Spark mini-flare-up