Investigating Hadoop distributions: Which is right for you?
A collection of articles that takes you from defining technology needs to purchasing options
Hadoop is an open source technology that today is the data management platform most commonly associated with big data applications. The distributed processing framework was created in 2006, primarily at Yahoo and based partly on ideas outlined by Google in a pair of technical papers; soon, other Internet companies such as Facebook, LinkedIn and Twitter adopted the technology and began contributing to its development. In the past few years, Hadoop has evolved into a complex ecosystem of infrastructure components and related tools, which are packaged together by various vendors in commercial Hadoop distributions.
Running on clusters of commodity servers, Hadoop offers a high-performance, low-cost approach to establishing a big data management architecture for supporting advanced analytics initiatives. As awareness of its capabilities has increased, Hadoop's use has spread to other industries, for both reporting and analytical applications involving a mix of traditional structured data and newer forms of unstructured and semi-structured data. This includes Web clickstream data, online ad information, social media data, healthcare claims records, and sensor data from manufacturing equipment and other devices on the Internet of Things.
This is the first article in a four-part series on Hadoop distributions for big data management. This article laid the groundwork while article No. 2 examines specific use cases for buying a Hadoop distribution. Article No. 3 will help you determine your must-have features. The concluding article will examine Hadoop distributions from the leading vendors, comparing and contrasting their features.
What is Hadoop?
The Hadoop framework encompasses a large number of open source software components with a set of core modules for capturing, processing, managing and analyzing massive volumes of data that's surrounded by a variety of supporting technologies. The core components include:
- The Hadoop Distributed File System (HDFS), which supports a conventional hierarchical directory and file system that distributes files across the storage nodes (i.e., DataNodes) in a Hadoop cluster.
- MapReduce, a programming model and execution framework for parallel processing of batch applications.
- YARN (short for the good-humored Yet Another Resource Negotiator), which manages job scheduling and allocates cluster resources to running applications, arbitrating among them when there's contention for the available resources. It also tracks and monitors the progress of processing jobs.
- Hadoop Common, a set of libraries and utilities used by the different components.
In Hadoop clusters, those core pieces and other software modules are layered on top of a collection of computing and data storage hardware nodes. The nodes are connected via a high-speed internal network to form a high-performance parallel and distributed processing system.
As a collection of open source technologies, Hadoop isn't controlled by any single vendor; rather, its development is managed by the Apache Software Foundation. Apache offers Hadoop under a license that basically grants users a no-charge, royalty-free right to use the software. Developers can download it directly from the Apache website and build a Hadoop environment on their own. However, Hadoop vendors provide prebuilt "community" versions with basic functionality that can also be downloaded at no charge and installed on a variety of hardware platforms. They also market commercial -- or enterprise -- Hadoop distributions that bundle the software with different levels of maintenance and support services.
In some cases, vendors also offer performance and functionality enhancements over the base Apache technology -- for example, by providing additional software tools to ease cluster configuration and management, or data integration with external platforms. These commercial offerings make Hadoop increasingly more attainable for companies of all sizes. This is especially valuable when the commercial vendor's support services team can jump-start a company's design and development of their Hadoop infrastructure, as well as guide the selection of tools and integration of advanced capabilities to quickly deploy high-performance analytical solutions to meet emerging business needs.
The components of a typical Hadoop software stack
What do you actually get when you obtain a commercial version of Hadoop? In addition to the core components, typical Hadoop distributions will include -- but aren't limited to -- the following:
- Alternative data processing and application execution managers such as Tez or Spark, which can run on top of or alongside YARN to provide cluster management; cached data management; and other means of improving processing performance.
- Apache HBase, a column-oriented database management system modeled after Google's BigTable project that runs on top of HDFS.
- SQL-on-Hadoop tools such as Hive, Impala, Stinger, Drill and Spark SQL, which provide varying degrees of compliance with the SQL standard for direct querying of data stored in HDFS.
- Development tools such as Pig that help developers build MapReduce programs.
- Configuration and management tools such as ZooKeeper or Ambari, which can be used for monitoring and administration.
- Analytics environments such as Mahout that supply analytical models for machine learning, data mining and predictive analytics.
Because the software is open source, you don't purchase a Hadoop distribution as a product, per se. Instead, the vendors sell annual support subscriptions with varying service-level agreements (SLAs). All of the vendors are active participants in the Apache Hadoop community, although each may promote its own add-on components that it has contributed to the community as part of its Hadoop distribution.
Who manages the Hadoop big data management environment?
It's important to recognize that getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals who collaborate on architecture planning, design, development, testing, deployment, and ongoing operations and maintenance to ensure peak performance. Those IT teams will typically include:
- Requirements analysts to assess the system performance requirements based on the types of applications that will be run in the Hadoop environment.
- System architects to evaluate performance requirements and design hardware configurations.
- System engineers to install, configure and tune the Hadoop software stack.
- Application developers to design and implement applications.
- Data management professionals to do data integration, create data layouts and perform other management tasks.
- System managers to do operational management and maintenance.
- Project managers to oversee the implementation of the various levels of the stack and application development work.
- A program manager to oversee the implementation of the Hadoop environment and prioritization, development and deployment of applications.
The Hadoop software platform market
In essence, the evolution of Hadoop as a viable large-scale data management ecosystem has also created a new software market that's transforming the business intelligence and analytics industry. This has expanded both the kinds of analytics applications that user organizations can run and the types of data that can be collected and analyzed as part of those applications. The market includes three independent vendors that specialize in Hadoop -- Cloudera Inc., Hortonworks Inc. and MapR Technologies Inc. Other companies that offer Hadoop distributions or capabilities include Pivotal Software Inc., IBM, Amazon Web Services and Microsoft.
Evaluating vendors that provide Hadoop distributions requires understanding the similarities and differences between two aspects of the product offerings. First is the technology itself: What's included in the different distributions; what platforms are they supported on; and, most important, what specific components are championed by the individual vendors? Second is the service and support model: What types of support and SLAs are provided within each subscription level, and how much do different subscriptions cost?
Understanding how these aspects relate to your specific business requirements will highlight the characteristics that are important for a vendor relationship. The next article in this series will examine several business use cases for a Hadoop big data management platform so you can determine your organization's needs and requirements.
Use this guide to learn about Hadoop's uses for big data and storage
How do you compare salary-wise with your big data and BI peers?
Big data is driving the demand for business intelligence analysts