Definition

Apache Hive

Contributor(s): Delaney Rebernik
This definition is part of our Essential Guide: Managing Hadoop projects: What you need to know to succeed

Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in a distributed computing environment.

Hive has three main functions: data summarization, query and analysis.  It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries. Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.

According to the Apache Hive wiki, "Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs)." 

Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way.)

This was last updated in November 2012

Continue Reading About Apache Hive

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

File Extensions and File Formats

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close