michelangelus - Fotolia
The flowering of the Hadoop ecosystem is both a blessing and a curse for prospective users. The numerous technologies revolving around the distributed processing framework augment the functionality found in Hadoop itself. But there are so many to choose from that evaluating them and finding the right one can be difficult. That's particularly true in the emerging SQL-on-Hadoop space, where tools such as Drill, Hawq, Hive, Impala and Presto vie for attention.
To get a better view of them, SearchDataManagement recently turned to Tripp Smith, CTO at Clarity Solution Group LLC, a Chicago-based data management and analytics consultancy that works with user organizations on Hadoop deployments and other big data projects. In an interview, Smith said the path to selecting among the surge of SQL-on-Hadoop tools begins with understanding use cases.
Hadoop has been around for a while, but in terms of going mainstream, it still seems very new to a lot of people. And when they seek to tame Hadoop to gain business benefits from big data, it often turns into a multiyear effort.
Tripp Smith: I think SQL interfaces to Hadoop are helping to bridge that gap. They also enhance portability for business logic from legacy applications, both to Hadoop and to different execution engines that now run within the Hadoop platform. We saw it start with the introduction of Hive. A lot of very smart folks at Facebook introduced that to the Hadoop ecosystem, and now the concept has expanded in a lot of different directions, not the least of which are Spark SQL, Impala and Presto, the latter also [coming] out of Facebook.
What SQL is doing for Hadoop is to bring kind of a common language for the average business user working on the legacy analytics platforms, as well as to the seasoned engineers and data scientists. It's easier now to trade off information and data processing between different components when you have Agile data teams using SQL on Hadoop.
By most counts, there are even more Hadoop tools than we've just talked about. What parameters do you look at when trying to evaluate products in this wide group of tools?
Tripp SmithCTO, Clarity Solution Group
Smith: What you find is that the decision you make on SQL-on-Hadoop tools should be based on the use cases that you have. We look at Hadoop through the lens of what we call MESH -- that's a strategic architecture framework for 'mature, enterprise-strength Hadoop.' It looks at data management and analytical capabilities, as well as data governance capabilities and platform components.
Tool selection and approaches vary depending on the nuance of the problem you're trying to solve -- depending on whether you're looking at doing more of an extract, transform and load or to do extract, load and transform data integration, or you're looking at a real-time data integration use case, or whether you're looking at interactive queries. Each of the tools has a specialization. But that is where there's still a lot that needs to be fleshed out.
What are the steps people take as they walk through the process of choosing between these new technologies?
Smith: Most of the people we work with are not 'greenfield' -- they're into managing these tools without arbitrarily increasing their portfolio diversity. Admittedly, that may be a buzzword-full answer. But usually, they have an idea of how to judge how their workloads fit with the different SQL-on-Hadoop tools.
They will find that some of these tools have a limited type of [SQL] grammar for the things they want to do. I would throw Impala, as it first emerged, into that group. It was leading the pack around performance but maybe providing a limited subset of capabilities. Hive has been around the longest, and is relatively mature for the Hadoop ecosystem -- that is probably more focused to your data integration batch processing workload.
In each case, there is a bit of discovery required around taking your business use cases, what your infrastructure is today [and] where the new Hadoop components would fit in within the context of managing an IT portfolio. You have to have a process to introduce new components for your analytical workloads.
Read about the roots of SQL on Hadoop