This content is part of the Buyer's Guide: Investigating Hadoop distributions: Which is right for you?

A look at Amazon Elastic MapReduce cloud-based Hadoop

The Amazon Elastic MapReduce Web service offers a managed Hadoop framework that enables users to distribute and process big data across dynamically scalable Amazon EC2 instances.

Amazon Elastic MapReduce provides users access to a cloud-based Hadoop implementation for analyzing and processing large amounts of data. Built on top of Amazon's cloud services, EMR leverages Amazon's Elastic Compute Cloud and Simple Storage services, enabling users to provision a Hadoop cluster quickly.

Amazon's cloud elasticity and setup tools also give users a way to temporarily scale up a cloud-based Hadoop cluster for short-term increased computing capacity. Amazon EMR lets users focus on the design of their workflow without the distractions of configuring a Hadoop cluster. As with other Amazon cloud services, users pay for only what they use.

Amazon Elastic MapReduce features

The current version of Amazon EMR, 4.3.0, bundles several open source applications, a set of components for users to monitor and manage cluster resources, and components that enable application and cluster interoperability with other services.

The following open source applications come bundled as part of Amazon:

  • Apache Hadoop 2.7.1
  • Apache Hive 1.0.0
  • Apache Mahout 0.11.0
  • Apache Pig 0.14.0
  • Apache Spark
  • Hue
  • Ganglia 3.7.2

AWS Elastic MapReduce also provides users with the option of using MapR's Hadoop distribution in place of Apache Hadoop.

The EMR Web service supports several file system storage options used for data processing. These include Hadoop Data File System for local and remote file systems and S3 buckets using EMR File System as well as other Amazon data services. Amazon EMR also integrates with several data services, including Amazon Dynamo DB, a fast NoSQL database; Amazon Relational Database Service; Amazon Glacier; Amazon Redshift, a petabyte data warehouse service; and AWS Data Pipeline, a service used to move data between AWS services.

Other AWS Elastic MapReduce features enable users to perform the following tasks:

Provision an EMR cluster. An EMR management console helps users quickly navigate through the process of spinning up and autoconfiguring an EMR instance. Through the console, users select the applications from the EMR bundle to install, the types of server instances to use for the cluster nodes, and the security access policies and controls for the cluster.

Load data into the cluster. Users with typical size data needs can transfer data to an Amazon S3 bucket to be available to the cluster for processing. Users with petabyte-scale needs may opt to use AWS Snowball, a secure, high-speed appliance that's shipped to the user, or AWS Direct Connect, an established high-speed data connection between AWS and the user's data center.

Monitor and manage. Amazon EMR collects metrics that are used to track progress and measure the health of a cluster. While these metrics can be accessed through the command line interface, software developer kits or APIs, they can also be viewed through the EMR management console. Additionally, Amazon CloudWatch can also be used along with Apache Ganglia to monitor the cluster and set alarms on events triggered by these metrics.

AWS Elastic MapReduce pricing

Amazon's EMR pricing model is based on the company's approach to pricing for its other Web services. Users pay per amount of time and the types of instance servers used. Spot instances can also be used for some or all of the nodes in a cluster, providing users with a level of elasticity that can be changed based on their dynamic computing needs.

Amazon provides developers with a wide range of online technical documentation, guides, tutorials and sample code.  

Next Steps

Learn how to buy a Hadoop distribution to manage big data

Are data lakes an alternative to enterprise data warehouses?

Learn how to manage Hadoop projects

Dig Deeper on Hadoop framework