BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Apache Hive is an open source data warehouse system for querying and analyzing large data sets that are principally stored in Hadoop files. It is commonly a part of compatible tools deployed as part of the software ecosystem based on the Hadoop framework for handling large data sets in a distributed computing environment.
Like Hadoop, Hive has roots in batch processing techniques. It was originated in 2007 by developers at Facebook who sought to provide SQL access to Hadoop data for analytics users. Like Hadoop, Hive was developed to address the need to handle petabytes of data accumulating via web activity. Release 1.0 became available in February 2015.
How Apache Hive works
Initially, Hadoop processing relied solely on the MapReduce framework, and this required users to understand advanced styles of Java programming in order to successfully query data. The motivation behind Apache Hive was to simplify query development, and to, in turn, open up Hadoop unstructured data to a wider group of users in organizations.
Hive has three main functions: data summarization, query and analysis. It supports queries expressed in a language called HiveQL, or HQL, a declarative SQL-like language that, in its first incarnation, automatically translated SQL-style queries into MapReduce jobs executed on the Hadoop platform. In addition, HiveQL supported custom MapReduce scripts to plug into queries.
When SQL queries are submitted via Hive, they are initially received by a driver component that creates session handles, forwards requests to a compiler via Java Database Connectivity/Open Database Connectivity interfaces, which subsequently forwards jobs for execution. Hive enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.
How Hive has evolved
Like Hadoop, Hive has evolved to encompass more than just MapReduce. Inclusion of the YARN resource manager in Hadoop 2.0 helped developers' ability to expand use of Hive, as it did other Hadoop ecosystem components. Over time, HiveQL has gained support for the Apache Spark SQL engine as well as the Hive engine, and both HiveQL and the Hive Engine have added support for distributed process execution via Apache Tez and Spark.
Early Hive file support comprised text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and Record Columnar Files (RCFiles), which store columns of a table in a columnar database way). Hive columnar storage support has come to include Optimized Row Columnar (ORC) files and Parquet files.
Hive execution and interactivity were a topic of attention nearly from its inception. That is because query performance lagged that of more familiar SQL engines. In 2013, to boost performance, Apache Hive committers began work on the Stinger project, which brought Apache Tez and directed acyclic graph processing to the warehouse system.
Also accompanying Stinger were new approaches that improved performance by adding a cost-based optimizer, in-memory hash joins, a vector query engine and other enhancements. Query performance reaching 100,000 queries per hour and analytics processing of 100 million rows per second, per node have been reported for recent versions of Hive.
Additions accompanying releases 2.3 in 2017 and release 3.0 in 2018 furthered Apache Hive's development. Among highlights were support for Live Long and Process (LLAP) functionality that allows prefetching and caching of columnar data and support for atomicity, consistency, isolation and durability (ACID) operations including INSERT, UPDATE and DELETE. Work also began on materialized views and automatic query rewriting capabilities familiar to traditional data warehouse users.
Hive supporters and alternatives
Committers to the Apache Hive community project have included individuals from Cloudera, Hortonworks, Facebook, Intel, LinkedIn, Databricks and others. Hive is supported in Hadoop distributions. As with the Hbase NoSQL database, it is very commonly implemented as part of Hadoop distributed data processing applications. Hive is available by download from the Apache Foundation, as well as from Hadoop distribution providers Cloudera, MapR and Hortonworks, and as a part of AWS Elastic MapReduce. The latter implementation supports analysis of data sets residing in Simple Storage Service object storage.
Apache Hive was among the very first attempts to bring SQL querying capabilities to the Hadoop ecosystem. Among a host of other SQL-on-Hadoop alternatives that have arisen are BigSQL, Drill, Hadapt, Impala and Presto. Also, Apache Pig has emerged as an alternative language to HiveQL for Hadoop-oriented data warehousing.