This content is part of the Essential Guide: Managing Hadoop projects: What you need to know to succeed
Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

SQL-on-Hadoop tools help users navigate enterprise Hadoop course

Hadoop may be a technology in waiting, unless SQL-on-Hadoop tools turn it into an enterprise mainstay.

Hadoop can handle large amounts, and all kinds, of data, but its core Java APIs are often complex to use – same goes for the ones in MapReduce, its original programming environment and compute engine. As a result, the open source distributed processing framework runs the risk of remaining the king of big data proof-of-concept projects. But SQL-on-Hadoop technology opens up a possible path to wider adoption.

In a June 2014 blog post, Ovum analyst Tony Baer wrote that SQL could serve as the "gateway drug" to Hadoop use in enterprise applications. The reason: It enables organizations to tap the armies of workers adept in SQL, the standard programming language for relational databases, to build and run analytical queries against Hadoop data.

Assorted new tools provide ways to bring SQL into Hadoop applications -- and they're starting to find a place in IT environments.

"The workloads I now see are 80% to 90% SQL on top of Hadoop," said Mac Noland, a solutions architect at phData, which offers Hadoop consulting services to organizations in the Minneapolis-St. Paul, Minn., area. "That's because there are a lot of skilled SQL people out there. It's a very natural fit."

What the data doctor ordered

For ZirMed Inc., a vendor of software as a service applications for healthcare administration uses based in Louisville, Ky., SQL querying support was a crucial factor in making the deployment of a Hadoop cluster feasible. "If we didn't have a SQL level on top of Hadoop, we would not have implemented Hadoop," said Chris Schremser, ZirMed's CTO.

One pixel Sherman: Hadoop needs enterprise business

Schremser said that programming Java in jobs in MapReduce wouldn't have been "conducive to developer productivity" in his organization. Instead, ZirMed is using Hortonworks' Hadoop distribution along with Apache Hive, open source software that lets SQL-savvy developers and business users at the company query data stored in the Hadoop Distributed File System (HDFS).

ZirMed installed the Hadoop cluster in the fall of 2014, with 29 compute nodes and 1.2 PB of raw storage capacity. The system is being used to collect data on medical payments and insurance claims from the company's customers for analysis. The information is straightforward transaction data, not the unstructured or semi-structured forms of data often found in HDFS, and the jobs that Schremser's team runs against it with Hive aren't highly interactive. But he called them "big queries" -- for example, answering questions on insurance eligibility that require churning through more than two years of patient records and transactions.

"We don't have an unstructured [data] problem today," Schremser said. "Our issue is quantity. We had so much data that querying it could be a problem."

Previously, ZirMed ran some queries on a data warehouse appliance and others directly on its transaction processing system because of storage limitations in the data warehouse. In the latter case, query jobs that took hours -- or multiple days -- to complete against the transaction database have been reduced to a handful of minutes on the Hadoop cluster, according to Schremser.

ZirMed also put Hadoop and Hive into production with an eye toward reduced costs going forward, he continued. According to a case study write-up posted on the Hortonworks website, the hardware for the Hadoop cluster cost $235,000, about 30% of what Schremser said the company had spent on the data warehouse system. And for that price, the cluster contains five times more usable storage space than the warehouse provided.

Drill-ing into Hadoop data

Michael Fabacher is another SQL-on-Hadoop user. Fabacher is vice president of data architecture and database development at Cardlytics, an Atlanta-based company that analyzes credit and debit card purchase data to help retailers and restaurants send targeted promotional offers to consumers through card-issuing banks. Cardlytics began working with the Hadoop distribution from MapR Technologies in September 2014, and it has also deployed Apache Drill, an open source SQL-on-Hadoop tool whose development is being spearheaded by MapR.

As at ZirMed, using the Hadoop cluster to process and analyze pressing volumes of structured transaction data is the first order of business for Cardlytics. Fabacher's team did a bake-off between four SQL-on-Hadoop tools before selecting Drill to aid with the analytics. "None of them could support everything we wanted," he said, pointing to the relative newness of the technologies. But he gave credit to MapR for working with Cardlytics to add specific SQL functionality that the company required, including window functions for gathering values from a set of rows in a data store, automatic partitioning of data to boost query performance and new types of joins.

"Drill is not fully matured, but we think it will be," Fabacher said. The tool just became available in a 1.0.0 version in May, followed by a 1.1.0 release earlier this month that incorporated the window functions and auto partitioning sought by Cardlytics along with other new features.

While the large ranks of SQL-skilled workers should give some comfort to organizations embarking on Hadoop journeys, the growing ranks of SQL-on-Hadoop options could be unsettling. Since the first days of Apache Hive, the field has become increasingly crowded. In a presentation at the 2015 Pacific Northwest BI Summit in Grants Pass, Ore., this month, Gartner analyst Merv Adrian listed 14 different tools -- and that's not a full count of what's available. In addition to Hive and Drill, other prominent examples include Impala, developed by Hadoop vendor Cloudera; IBM's Big SQL; Pivotal Software's Hawq; Spark SQL; and Presto, an open source tool that like Hive originated at Facebook.

One gateway, many paths

According to Matt Aslett, an analyst at 451 Research, the sources of commercial SQL-on-Hadoop tools can be broadly broken down into three categories: independent Hadoop distribution providers, incumbent relational database and data warehouse providers that are "looking to get into the act," and independent startups specifically focused on the technology.

Aslett said Hive has matured significantly, but its use is still oriented largely toward long-running MapReduce batch jobs. He added that he expects interest to continue to grow in newer tools like Drill, Impala and Presto that take inspiration from Google's Dremel distributed query tool and focus on applications supporting higher degrees of interactivity than is common on Hive queries.

Overall, though, it's still early going for SQL on Hadoop, and Aslett said some patience is required when looking to put the tools into production use. "They are definitely a work in progress," he noted.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at, and follow us on Twitter: @sDataManagement.

Executive editor Craig Stedman contributed to this story.

Next Steps

Find out how Western Union linked Hadoop to the enterprise

Check out advances in machine learning for data preparation

Learn whether organizations are catching up to Hadoop

Dig Deeper on Hadoop framework

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Is SQL the gateway to Hadoop? Are there other means?
Hadoop does tend to require a java app to run, and that does raise the bar - but the power of Hadoop is in processing unstructured data. SQL is a structured query language. For it to work, you'd need to make the data in hadoop look like tables. Googling, I'm not sure how this could work, and I'm not finding answers. Basically, I worry you'd lose the unstructured processing power of hadoop. (Think about querying google like it was SQL - sure, you could reduce the internet to key/value pairs, but what do you lose?)
The unstructured route seemed the natural route to me too. Funny, tho, have lately (in this article, actually) seen more than a few people who found value in running structured data throught Hadoop. The sweeter spot may be jobs on the RDBMS that are long running - that run appreciably faster on Hadoop. But I could be wrong. As they say "your mileage may vary"
Thanks for this article. I am currently tracking over 20 SQL on Hadoop initiatives at present with significant differences between them. Many are only SQL 1992 compliant which is 23 years ago and a level of SQL that does not have much in the way of analytic functions beyond MIN, MAX, COUNT, SUM and AVG. SQL 1992 is not enough for analytics in my humble opinion. Also people should expect consolidation and acquisitions to occur here. Another issue is how good the optimisation is given that not all data resides in Hadoop and many organisations want to join Hadoop data with non-Hadoop data e.g. Data in data warehouses or MDM systems. Not all SQL on Hadoop technologies can join to data outside Hadoop. An alternative is to use a data virtualisation server if you can't do this with the SQL on Hadoop technology you have chosen. Many data virtualisation products connect to Hadoop and non-Hadoop data sources. Also navigating complex data such as schema variant JSON and nested JSON also varies across SQL on Hadoop initiatives - many require a schema to be created first whereas vendors like MapR with Drill can interpret a JSON schema on the flyd. Also some vendors are focussed on SQL on HBASE only. It is a bit of a minefield at the moment. One thing worth considering is multiple SQL engines on Hadoop data (something RDBMSs don't support).
Thanks, Mike. These are all useful criteria to consider in this most interesting time for data integration. Your points are well taken.