NEW YORK -- The combination of big data and cloud computing is becoming a veritable elephant in the room for Hadoop...
vendors and other big data technology companies -- and some are responding with new approaches that could make it easier and less expensive for users to deploy Hadoop cloud systems.
For example, Cloudera Inc. this week added usage-metering capabilities to its Cloudera Director tool for managing clusters in the cloud built on the company's Hadoop distribution. That lets Cloudera users adopt a usage-based pricing model instead of having to pay for cloud clusters on a per-node basis, making it more feasible for them to run transient systems that are set up, used for a specific purpose and then taken out of service to avoid incurring an ongoing cost.
Users can also now deploy clusters in multiple regions and availability zones within a big data cloud environment from a single instance of Cloudera Director. In addition, a new release of Cloudera Enterprise, the vendor's Hadoop-based big data platform, lets the Apache Impala SQL-on-Hadoop query engine run directly against Amazon Simple Storage Service (S3) data stores. That eliminates the need to move data to the Hadoop Distributed File System for querying, another step in enabling transient cluster deployments by users in the Amazon Web Services (AWS) cloud.
The pay-as-you-go pricing and Impala-on-S3 support are both welcome developments for Narasimhan Sampath, a systems architect at Choice Hotels International Inc., based in Rockville, Md. The hospitality company runs Cloudera-based clusters in the AWS cloud, along with technologies such as the Spark data-processing engine and Kafka message-queueing system, to support a variety of self-service analytics applications.
Bring your own cluster to the cloud
During a session at Strata + Hadoop World 2016 here, Sampath said Choice follows a BYOC approach -- for bring your own cluster -- that enables on-demand computing for business operations in the cloud environment. For example, a cluster for the marketing department "can just come up, do a job and be shut down when it's done," he said. Similarly, a development cluster for the IT team is set to run for 12 hours daily and then be taken offline overnight to save on the company's AWS bill.
Narasimhan Sampathsystems architect at Choice Hotels International Inc.
Cloudera's metered pricing fits nicely with that approach, Sampath said after the session. "I don't have to buy 500 [Cloudera] licenses if I don't need them all the time. It's the same model as Amazon."
He added that Choice worked closely with Cloudera over the past six months on its linking of S3 and Impala, which was originally developed by Cloudera before being released as open source software. Choice uses S3 as a data store, and Sampath said the new querying support in Impala provides added flexibility for the BYOC strategy. "Otherwise, you'd need a central place to put the data."
David Tishgart, director of cloud product marketing at Cloudera, said the Palo Alto, Calif., vendor has seen growing interest in using the cloud among customers over the past 18 months. But until now, "we didn't have a great solution for transient, spin-up, spin-down workloads," he acknowledged. As a result, most Cloudera users who did go to the cloud have had to run persistent clusters there.
Keeping up in the Hadoop cloud
As more users look to the cloud, that limitation could have made it harder for Cloudera -- the top on-premises Hadoop vendor -- to compete against Amazon Elastic MapReduce (EMR), the Hadoop cloud platform offered by AWS. Cloudera could also have found itself at a disadvantage against Microsoft's Azure HDInsight big data cloud service, which is based on Hortonworks Inc.'s Hadoop distribution.
Already, EMR has made AWS the largest Hadoop vendor overall from the standpoint of number of users, according to Gartner analyst Merv Adrian. AWS initially lagged "way behind" its Hadoop rivals in implementing new releases of various Apache big data tools as part of EMR, Adrian said at the 2016 Pacific Northwest BI Summit in July. But that changed two years ago, and he said AWS now has more Hadoop users "than all the other vendors in the market combined."
Hortonworks also focused on expanded Hadoop cloud capabilities at the Strata conference, saying HDInsight is now running on version 2.5 of its Hortonworks Data Platform (HDP) distribution, which was released at the end of August. In addition, Hortonworks now supports integration between Microsoft's Azure Active Directory service and Apache Ranger, a framework for managing data security and user-access privileges in Hadoop systems.
Despite its close ties to Microsoft in the cloud, Hortonworks is also offering a technical-preview version of HDP that lets AWS users set up transient Hadoop clusters with Spark and Apache Hive, a SQL-on Hadoop query engine. "We understand the need to facilitate workloads on all clouds," said Matt Morgan, senior vice president of global marketing at the Santa Clara, Calif., company. He declined to disclose a general availability schedule for the AWS platform, which was quietly released for trial use in late June.
Paxata Inc. got into the cloud act, too. The vendor of self-service data preparation software is offering a new tool, called Paxata Connect, for pulling together data from clusters running different Hadoop distributions, including ones in separate cloud platforms. A lot of Hadoop workloads are moving to the cloud, said Nenshad Bardoliwalla, Paxata's chief product officer. And a big part of the cloud's allure, he added, is the ability to create "ephemeral" clusters that run a particular job and then "go away."
Mixed reactions among users on running cloud-based big data analytics
Mainstream organizations seek tangible benefits from big data systems
Big data in the cloud doesn't always mean Hadoop in the cloud