The present pace of SQL additions to the open source Hadoop data framework is torrid. The reason is simple.
By submitting your email address, you agree to receive emails regarding relevant topic offers from TechTarget and its partners. You can withdraw your consent at any time. Contact TechTarget at 275 Grove Street, Newton, MA.
Although the Hadoop Distributed File System (HDFS) brings parallel processing commodity clusters to big data, it performs better in the enterprise if it has SQL-style interactive querying.
Early Hadoop applications tended to require specialized data scientists to administer queries. The queries ran long -- think "get a cup of coffee" long -- and were far from interactive. The Apache Tez framework -- the Stinger interactive query accelerator for the Hive data warehouse -- and Spark analytical engine, are just some of the tools beginning to fill the query void.
"It is very important for us to ensure we are giving users interactive query access. With Tez we are able to provide that capability to the business," said Anu Jain, group manager of strategy and architecture at retailer Target Corp. Jain spoke as part of a user panel at last week's Hortonworks- and Yahoo-sponsored Hadoop Summit 2014 in San Jose, California.
Introducing tools that let business users exploit parallel Hadoop applications as part of existing workflows is a key step in justifying Hadoop initiatives, she said.
"Our goal is to make sure we are providing the right data on the right platform and at the right cost," Jain said of Target's Hadoop program.
Swimming in the mainstream
SQL-on-Hadoop technology is the most typical addition to Hadoop efforts this year, according to Gartner Inc. analyst Merv Adrian. "After people first get familiar with the Hadoop style of batch processing, the next thing they want to do is interactive analytics," he said.
After people first get familiar with the Hadoop style of batch processing, the next thing they want to do is interactive analytics.
Gartner has mapped out the varieties of SQL-on-Hadoop products. A survey conducted earlier this year of 164 Gartner webinar participants that use Hadoop found the leading approach is to use interfaces to HDFS/Hbase that are provided by vendors (32%). Self-created SQL queries via Hive (27%) and Hadoop distribution-specific SQL tools (23%), such as Cloudera Impala, Pivotal HAWQ and others, followed as other popular SQL-on-Hadoop approaches, Adrian said.
A new stage has begun. Most early adopters weren't worrying much about SQL integration, but as Hadoop moves into the mainstream, that is changing, Adrian said. In fact, the rise of SQL-on-Hadoop may play to strong points of established vendors that are quite familiar with SQL, according to Adrian.
Beyond the SQL tinge
Hadoop may take on a SQL tinge, but a contrarian view could hold that SQL could be "Hadoopicized" as a result of programming shifts. Such a view is supported by Hadoop Summit user panelist John Williams, who sees potential swings in data development methods "as data gets really massive."
Williams, senior vice president for platform operations at TrueCar, which provides an online car-buying platform, maintains there is a lot of development overhead around SQL, and there may be benefits to working in a different programming environment.
"SQL execution time on a large data set is slow. Meanwhile, Hadoop-on-SQL is getting faster with things like Yarn and Tez," said TrueCar's Williams.
But speed of execution may not be the only measure. Williams is concerned about time to market as well.
"For us, there is an even greater metric around SQL being slow, and that is the time it takes for SQL development," he said.
For more on data architecture
Williams said a lot of the time required to develop SQL software goes to "pure SQL wizardry." What he describes as SQL overhead includes studying data, conceiving a schema, normalization, index creation and query creation. The time required to rework established programs may be the bigger issue, he insisted.
"If anything in the application changes, you have to redo all that work," he said, suggesting that development techniques centered more on Java or Python languages be used where possible for unstructured data. Still, TrueCar is working with Hive, Tez and other SQL-on-Hadoop technologies as well, he admitted.
Different types of organizations may take different paths to SQL-on-Hadoop. But SQL additions to Hadoop could, in many cases, be a great enabler for a mainstream Hadoop movement. Ovum Ltd. analyst Tony Baer made that point when he recently asked if SQL is not "a gateway drug" to Hadoop. In any case, the SQL-on-Hadoop area does seem to be where the action will be for some time to come.