Buyer's Handbook: Investigate Hadoop distributions for your organization Article 2 of 4

Sergey Nivens - Fotolia

Explore Hadoop distributions to manage big data

Discover the uses of Hadoop distributions and the first steps in evaluating these products, as well as how the merger of rivals Cloudera and Hortonworks affects the market.

David Loshin

By

David Loshin, Knowledge Integrity Inc.

Published: 25 Feb 2019

Hadoop is an open source technology that is the data management platform most commonly associated with big data distributions today. Its creators designed the original distributed processing framework in 2006 and based it partly on ideas that Google outlined in a pair of technical papers.

Yahoo became the first production user of Hadoop that year. Soon, other internet companies, such as Facebook, LinkedIn and Twitter, adopted the technology and began contributing to its development. Hadoop eventually evolved into a complex ecosystem of infrastructure components and related tools that several vendors package together in commercial Hadoop distributions.

Running on clusters of commodity servers, Hadoop offers users a high-performance, low-cost approach to establishing a big data management architecture to support advanced analytics initiatives.

As awareness of Hadoop's capabilities has increased, its use has spread to other industries for both reporting and analytical applications involving a mix of traditional structured data and newer forms of unstructured and semi-structured data. This includes web clickstream data, online ad information, social media data, healthcare claims records, and sensor data from manufacturing equipment and other internet of things devices.

What is Hadoop?

The Hadoop framework encompasses a large number of open source software components with a set of core modules to capture, process, manage and analyze massive volumes of data that are surrounded by a variety of supporting technologies. The core components include:

The Hadoop Distributed File System (HDFS): Supports a conventional hierarchical directory and file system that distributes files across the storage nodes -- i.e., DataNodes -- in a Hadoop cluster.
YARN (short for the good-humored Yet Another Resource Negotiator): Manages job scheduling and allocates cluster resources to running applications, arbitrating among them when there's contention for the available resources. It also tracks and monitors the progress of processing jobs.
MapReduce: A programming model and execution framework for parallel processing of batch applications.
Hadoop Common: A set of libraries and utilities that the other components utilize.
Hadoop Ozone and Hadoop Submarine: Newer technologies that offer users an object store and a machine learning engine, respectively.

In Hadoop clusters, those core pieces and other software modules layer on top of a collection of computing and data storage hardware nodes. The nodes connect via a high-speed internal network to form a high-performance parallel and distributed processing system.

As a collection of open source technologies, no single vendor controls Hadoop; rather, the Apache software foundation manages its development. Apache offers Hadoop under a license that grants users a no-charge, royalty-free right to use the software.

Developers and other users can download the software directly from the Apache website and build Hadoop environments on their own. However, Hadoop vendors provide prebuilt, community versions with basic functionality that users can also download at no charge and install on a variety of hardware platforms. The vendors also market commercial -- or enterprise -- Hadoop distributions that bundle the software with different levels of maintenance and support services.

In some cases, vendors also offer performance and functionality enhancements over the base Apache technology -- for example, by providing additional software tools to ease cluster configuration and management or data integration with external platforms. These commercial offerings make Hadoop increasingly more attainable for companies of all sizes.

This is especially valuable when the commercial vendor's support services team can jump-start a company's design and development of their Hadoop infrastructure. It is also helpful to guide the selection of tools and the integration of advanced capabilities to deploy high-performance analytical systems to meet emerging business needs.

The components of a typical Hadoop software stack

What do you actually get when you use a commercial version of Hadoop? In addition to the core components, typical Hadoop distributions will include -- but aren't limited to -- the following:

Alternative data processing and application execution managers, such as Spark, Kafka, Flink, Storm or Tez, that can run on top of or alongside YARN to provide cluster management, cached data management and other means of improving processing performance.
Apache HBase: A column-oriented database management system modeled after Google's Bigtable project that runs on top of HDFS.
SQL-on-Hadoop tools, such as Hive, Impala, Presto, Drill and Spark SQL, that provide varying degrees of compliance with the SQL standard for direct querying of data stored in HDFS.
Development tools, such as Pig, that help developers build MapReduce
Configuration and management tools, such as ZooKeeper or Ambari, that are useful for monitoring and administration.
Analytics environments such as Mahout, which supplies analytical models for machine learning, data mining and predictive analytics.

Because the software is open source, companies don't have to purchase a Hadoop distribution as a product, per se. Instead, the vendors sell annual support subscriptions with varying service-level agreements (SLAs). All of the vendors are active participants in the Apache Hadoop community, although each may promote its own add-on components that it contributes to the community as part of its Hadoop distribution.

Who manages the Hadoop big data management environment?

It's important to recognize that getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals who collaborate on architecture planning, design, development, testing, deployment, and ongoing operations and maintenance to ensure peak performance. Those IT teams typically include:

requirements analysts to assess the system performance requirements based on the types of applications that will run in the Hadoop environment;
system architects to evaluate performance requirements and design hardware configurations;
system engineers to install, configure and tune the Hadoop software stack;
application developers to design and implement applications;
data management professionals to prepare and run data integration jobs, create data layouts and perform other management tasks;
system managers to ensure operational management and maintenance;
project managers to oversee the implementation of the various levels of the stack and application development work; and
a program manager who oversees the implementation of the Hadoop environment and prioritization, development and the deployment of applications.

The Hadoop software platform market

The evolution of Hadoop as a viable, large-scale data management ecosystem has also created a new software market that's transforming the business intelligence and analytics industry. This has expanded both the kinds of analytics applications that user organizations can run and the types of data that the companies can collect and analyze as part of those applications.

The market now includes two major independent vendors that specialize in Hadoop -- Cloudera Inc. -- Cloudera and Hortonworks merged in October 2018 to form this new company -- and MapR Technologies Inc. Other companies that offer Hadoop distributions or capabilities include cloud platform market leaders AWS, Google and Microsoft, which uses Hortonworks as part of a big data distributions managed service.

Getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals.

Over the years, the Hadoop market has matured -- and consolidated -- significantly. IBM, Intel and Pivotal Software all dropped out of the market, but the combination of Cloudera and Hortonworks is the biggest change for users to date. The merger of the former rivals gives the new Cloudera a larger share of the market and could enable it to compete more effectively in the cloud.

In fact, Cloudera's new messaging is that it will deliver "the industry's first enterprise data cloud" -- an indication of its desire to compete with the AWS, Microsoft Azure and Google clouds.

Cloudera plans to develop a unified offering called the Cloudera Data Platform, although it hasn't said when it will become available. In the meantime, the company will continue to develop the existing Cloudera and Hortonworks platforms and support them until at least January 2022.

Although the new Cloudera may be more competitive, a potential downside to the merger is that Hadoop users now have fewer options. That's why it's even more critical to evaluate the vendors that provide Hadoop distributions and understand the similarities and differences between the two primary aspects of the product offerings.

First is the technology itself: what's included in the different distributions, what platforms are they supported on, and, most importantly, what specific components do the individual vendors support?

Second is the service and support model: what types of support and SLAs do vendors provide within each subscription level, and how much do different subscriptions cost?

Understanding how these aspects relate to your specific business requirements will highlight the characteristics that are important for a vendor relationship.

Linda Rosencrance contributed to this report.

Dig Deeper on Data management strategies

Buyer's Handbook: Investigate Hadoop distributions for your organization

Article2 of 4

Up Next

Hadoop software distributions help manage big data

Hadoop distributions help organizations manage mass volumes of data. It is important to research options, features and vendors before you make a final buying decision.

Explore Hadoop distributions to manage big data

Discover the uses of Hadoop distributions and the first steps in evaluating these products, as well as how the merger of rivals Cloudera and Hortonworks affects the market.

4 factors to consider in a Hadoop distributions comparison

Examine the key characteristics necessary to evaluate in a Hadoop distribution comparison, focusing on enterprise features, subscription options and deployment models.

The main picks for Hadoop distributions on the market

Check out the current top Hadoop distribution vendors in the market to help you determine which product is best for your company.

Business Analytics

Snowflake targets enterprise AI with launch of Arctic LLM
The data cloud vendor's open source LLM was designed to excel at business-specific tasks, such as generating code and following ...
AI-fueled efficiency a focus for SAS analytics platform
The vendor's latest product development plans include an AI assistant and prebuilt AI models that enable workers to be more ...
Customer segmentation analytics evolve with GenAI, ML
GenAI, machine learning and advanced analytics techniques automate time-consuming aspects of customer segmentation, freeing up ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

5 benefits of enterprise search
With a proper enterprise search strategy in place, organizations can improve their employees' efficiency and ensure customers ...
OpenText expands GenAI for enterprise content, IoT
OpenText finds a novel use for generative AI: combing through, sorting and summarizing massive amounts of IoT data. It also ...
Traditional CMS vs. headless CMS: What's the difference?
Traditional CMSes let users design websites, yet they lack the flexibility of headless systems. Differences between these tools ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP earnings for Q1 indicate strong cloud growth
SAP's cloud revenue for the first quarter of 2024 indicates healthy growth and sets the stage as customers plan cloud migrations ...
SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...

Close