Investigating Hadoop distributions: Which is right for you?
A collection of articles that takes you from defining technology needs to purchasing options
Although the software components that constitute the Hadoop ecosystem stack are open source technologies, there are numerous benefits to paying a vendor for a subscription to use its commercial Hadoop platform. For example, a subscription provides technical support and training, as well as access to enterprise features not available to the open source community. While the enterprise editions of vendor Hadoop distributions all provide the core components of the Hadoop ecosystem stack, the key differentiators are what these vendors offer beyond the openly accessible functionality.
Recent changes in the market have thinned the ranks of Hadoop vendors. Just this month, for example, Pivotal Software pulled the plug on its own Hadoop distribution and said it would start reselling Hortonworks' instead. But there's still a diverse group of suppliers to consider, including independent Hadoop specialists, cloud providers and two of the largest IT vendors.
To help you determine which Hadoop provider is right for your organization, this article distinguishes the top Hadoop distributions based on several key characteristics; these include deployment models, enterprise-class features, security and data protection features, and support services.
Note that while the Hadoop big data management ecosystem is engineered to support scalable data storage and high-performance distributed computing, your actual performance may vary for several reasons, including the software implementation. But many performance issues are dependent on the planned applications themselves. To address this, we'll further examine how the Hadoop product distributions are targeted to meet the business needs of user organizations.
1. Hadoop deployment models
Most of the Hadoop vendors support a mix of deployment methods, but Hadoop offerings from Microsoft and Amazon Web Services are deployed solely in cloud environments. Microsoft leverages its Azure cloud infrastructure for HDInsight, a managed service based on the Hortonworks Data Platform (HDP) -- the same Hadoop distribution that Pivotal is now reselling. AWS uses its Amazon Elastic Cloud Computing platform and S3 data store to underpin Amazon Elastic MapReduce (EMR), which bundles its Hadoop distribution with various other tools and technologies. In addition, Amazon EMR provides the option of using MapR's Hadoop distribution instead of the Amazon one.
The cloud deployment model provides a rapid yet low-effort means of provisioning a Hadoop cluster, and both Microsoft and AWS enable users to resize their environments on demand to handle dynamic computing and storage capacity needs. This elasticity is desirable for organizations with computational and storage needs that may vary over time.
While the other major Hadoop vendors -- Cloudera, Hortonworks, IBM and MapR -- all offer cloud-based deployments, they aren't limited to that model. They allow users to download distributions that can be deployed on-premises or in private clouds on a variety of servers, including Linux and Windows systems. In addition, Cloudera and MapR also provide sandbox versions that can be run in a virtual environment such as VMware.
The bottom line: Consider whether your organization prefers to manage its big data environment in-house or use a hosted service. In-house management implies oversight and maintenance of the software environment and continuous monitoring of the system, whether that environment is a physical platform on premises or housed using a cloud-based service. The on-premises option may be preferable if you have experienced staff and know the proper system sizing characteristics, or if security concerns warrant managing the system behind a trusted firewall.
The alternative is to use a vendor with a hosted services platform that will help configure, launch, manage and monitor your operations. This may be preferable if you aren't sure what size system you will need or expect that the system size will grow based on increasing demand. The benefit of working with a cloud or hosted service is that it will provide the necessary elasticity for both storage and processing resources.
2. Enterprise-class features of the top Hadoop distributions
There are some notable differences in the development approaches of the three independent Hadoop vendors. Cloudera often augments the Hadoop core with internally developed add-on technologies -- for example, its Impala SQL-on-Hadoop query engine; Cloudera Manager administration tools; and Kudu, an alternative data store to the Hadoop Distributed File System (HDFS) for use in real-time analytics applications. Typically, the company now open sources such technologies after doing the initial development work itself. Hortonworks, on the other hand, promotes that it's "innovating 100% of its software in the Apache Hadoop community, and there are no proprietary extensions." Add-on technologies that it's the driving force behind, such as the Ambari provisioning and management software, are launched as open source projects from the outset. In addition, Hortonworks has banded together with IBM and other companies to form the Open Data Platform Initiative (ODPi), an organization devoted to creating a common set of core technical specifications for Hadoop platforms. ODPi members claim that will improve interoperability and minimize vendor lock-in.
MapR has taken a third path by developing its own file system, MapR-FS, instead of using HDFS, as well as its own NoSQL database, MapR-DB, and other foundational technologies in an effort to support deployments of large clusters with enterprise-class performance needs. MapR also is increasingly focusing on real-time and stream processing applications. In late 2015, the company rebranded its product as the MapR Converged Data Platform, which combines Hadoop and the MapR file system and database with the Apache Spark processing engine and a new event streaming technology called MapR Streams in order to handle both batch and real-time jobs.
From a features standpoint, the enterprise version of the Cloudera CDH distribution provides tools for operational management and reporting and for supporting business continuity. This includes such items as configuration history and rollbacks, rolling updates and service restarts, and automated disaster recovery. MapR's enterprise offering provides tools to better manage and ensure the resiliency and reliability of data in Hadoop clusters, as well as multi-tenancy and high availability capabilities. Hortonworks provides proactive monitoring and maintenance with its HDP support subscriptions.
IBM, meanwhile, has adopted an analytics-oriented strategy on its BigInsights for Apache Hadoop distribution, in keeping with its broader focus on selling business intelligence and advanced analytics tools. IBM offers different value-add modules with enterprise-grade features as part of BigInsights, including separate Analyst and Data Scientist modules. Its Analyst module provides Big SQL for federated SQL access to Hadoop and other data sources. BigSheets, which is part of the Analyst module, allows users to explore, transform and perform visualizations on large data sets stored in Hadoop, using an intuitive spreadsheet-like interface. The BigInsights Data Scientist Module includes a version of the R language, text analytics and a machine learning library called SystemML that has been contributed to the open source community.
While its cloud platform is AWS' primary calling card for Amazon EMR, it also offers tools for monitoring and managing clusters and enabling application and cluster interoperability as part of the Hadoop service.
Amazon EMR collects metrics that are used to track progress and measure the health of a cluster. Cluster health metrics can be accessed through the command line interface, software developer kits or APIs and can be viewed through the EMR management console. Additionally, Amazon's CloudWatch monitoring service can be used along with its implementation of the Apache Ganglia performance monitoring component to check the cluster and set alarms on events triggered by these metrics.
The bottom line: Choosing a vendor that provides value-add components as part of its enterprise subscription may mean committing to a long-term relationship -- especially if these components are tightly integrated with its standard stack distribution. If you're concerned about vendor lock-in, consider those vendors that are participating in the OPDi.
3. Security and protection offerings from the Hadoop vendors
Despite the expanding use of open source software for enterprise-class applications, there remain suspicions about its suitability for production use from a security and protection perspective. Several Hadoop vendors have taken steps to alleviate some of this anxiety.
For example, Hortonworks has teamed up with other vendors and customers to launch a Data Governance Initiative for Hadoop, with an initial focus on a new Apache project called Atlas for managing shared metadata, data classification, auditing, and security and policy management for data protection. It's also working to integrate Atlas with Ranger, an open source security tool for enforcing data access policies. Cloudera provides tools that enable users to manage data security and governance for the CDH platform, supporting an organization's need to meet compliance and regulatory requirements.
In addition, Hortonworks, Cloudera, MapR and IBM all provide data encryption. Both Hortonworks and Cloudera support encryption of data at rest. MapR provides encryption of data transmitted to, from and within a cluster. IBM offers the product InfoSphere Guardium, which enforces data privacy as well as provides encryption and masking of confidential data.
The bottom line: The Hadoop vendors provide different approaches to authentication, role-based access control, security policy management and data encryption. Carefully specify your security and protection requirements and review how each vendor addresses those needs.
4. Support subscriptions for the top Hadoop distributions
The fundamental value proposition for the open source software model is the bundling and simplification of system deployment with support and services. One alternative for deploying Hadoop involves downloading the source code for each component from the open source repository and then building and integrating all the parts together. This takes both skill and effort, and is likely to be an iterative process. Open source vendors have already done the heavy lifting, providing preconfigured distributions and maintaining an up-to-date integrated stack.
What differentiates the vendors to a large degree is their support models. Hortonworks provides several models, ranging from its Jumpstart edition with Web-based support during business hours and one-day response time to its Enterprise edition with 24/7 support and much shorter response times depending on the severity of the issue. Cloudera offers a support subscription with one-hour and 24/7 support options for enterprise license holders. It also offers premium support for organizations with the Flex or Data Hub edition licenses that include a 15-minute response time for critical issues.
All AWS accounts include basic support, which provides 24/7 customer service, access to community forums and documentation, as well as access to the AWS Trusted Advisor application. Developer support includes one-hour response for severe issues -- with 12- or 24-hour response times for most issues. Business-level support provides 24/7 email access to cloud support engineers as well as shortened response times based on severity. Enterprise-level support adds less than 15-minute response time for critical issues as well as a dedicated technical account manager, plus additional launch and operation support benefits.
MapR offers a Premium support service that adds Web and email support, custom portal, training, urgent bug fixes, follow-the-sun support and 24/7 phone support for priority issues. The company's Premium+ Support adds priority queuing of tickets and single point of contact support, and offers options for onsite or remote dedicated support. IBM provides support for organizations that purchase the licensed components -- also referred to as their value-add modules -- that extend their Open Platform with Apache Hadoop.
The bottom line: If support services are the source of added value from the vendor, the costs for the different support subscriptions should be aligned with customer expectations. Subscriptions providing one-hour or even 15-minute response times on a 24/7 basis with dedicated support staff will cost a lot more than 24-hour response time from a Web-based interface during business hours.
Hadoop has transformed the business intelligence and analytics industry during the past 10 years. But, as we've examined, the open source Hadoop framework offers only so much, and companies that need more robust performance and functionality capabilities as well as maintenance and support are turning to commercial Hadoop software distributions. Hopefully, this information will help you make a more informed choice when purchasing a Hadoop distribution.
Hadoop and Kafka originators discuss streaming in big data applications
Exploring the depths of Hadoop data lakes