Numerous SQL-on-Hadoop engines are available for accessing data stored in HDFS using the familiar SQL language. They all look promising, they all support a rich SQL dialect, but which ones is the fastest? Performance is important, especially when business users interactively use BI tools to access big data via these SQL-on-Hadoop engines.
So, which one is the fastest for an interactive, ad-hoc, and OLAP-like workload? Until now, there wasn’t much information available on this topic. That is, until AtScale published benchmark results across three SQL-on-Hadoop engines: Apache Hive, Cloudera Impala, and Spark SQL. Of course, we have the TPC-H and the TPC-DS benchmarks, but these two don’t represent interactive, ad-hoc, OLAP-like workloads.
They developed a benchmark that represents an interactive, ad-hoc query, OLAP-like workload. The benchmark is defined on the publicly available Star Schema Benchmark data set. AtScale completed this by defining a set of typical OLAP queries. These queries can be classified in three groups: quick metric queries (compute a particular metric value for a period of time), product insight queries (compute metrics aggregated against a set of product and date based dimensions) and customer insight queries (compute metrics aggregated against a set of product, customer, and date-based dimensions). Together, all these queries represent the types of queries appearing in real life BI environments in which users use tools such as Business Objects, Tableau, Excel, and Qlikview.
The performance results that have come out of this benchmark are intriguing, although they may not be what some people expect. One clear result is that not one SQL-on-Hadoop is the fastest for all of the queries. For some queries Apache Hive is the fastest, and for others it’s Spark SQL or Cloudera Impala.
Life would be easy if one of the engines would always be the fastest. Because that would mean that when an organization wants to select the fastest, they can pick just that one. This benchmark clearly shows that this is not the case. By itself, this is quite interesting, because some specialists have a favorite SQL-on-Hadoop engine and they really think their favorite is always the fastest. This is not confirmed by this benchmark.
Important to understand is that these three engines can access the same HDFS files and the same table descriptions documented in HCatalog. This means that any solution like AtScale and those that generate SQL code for SQL-on-Hadoop engines, such as some ETL tools, should support all three SQL-on-Hadoop engines to access data in HDFS files. They must be smart enough to know which one is the best to use for a particular SQL query. In fact, all data virtualization tools and BI on Hadoop tools that generate SQL code for the SQL-on-Hadoop engines have to be aware of the strengths and weaknesses of these engines.
I am interested to see how this is going to evolve the coming years. But we have to thank AtScale for doing this benchmark. It has given us some more information on the performance aspects of SQL-on-Hadoop engines. I strongly recommend to read the benchmark results. One thing we definitely learned from this benchmark is that we can’t answer the question (yet) which SQL-on-Hadoop is the fastest.