Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Databricks Cloud platform aims to open up Spark data processing

Databricks Cloud is a big data platform that's based on the Apache Spark processing engine, with added features designed to simplify big data management and analytics for users.

Product: Databricks Cloud

Release date: In limited availability since June 2014 -- general availability plans not yet announced

What it does

Databricks Cloud is a cloud-based big data platform, developed by Databricks Inc., with the open source Apache Spark processing engine at its core. Spark supports both in-memory and disk-based processing, and proponents claim it can run batch jobs on Hadoop data up to 100 times faster than MapReduce can; in addition, Spark's ability to iteratively process data via a mini-batch approach lets it run machine learning, stream processing and other non-batch applications that MapReduce can't handle. Databricks was co-founded by Spark creator Matei Zaharia and continues to be among the chief contributors to the Spark project within The Apache Software Foundation. The Berkeley, Calif., company said Databricks Cloud adds automated management capabilities to Spark and surrounds the engine with application templates designed around common workflows, an interactive data-exploration workspace and a production pipeline scheduler called Jobs that was added in a March 2015 update.

Why it matters

Databricks Cloud holds interest partly because the technical team at Databricks includes Zaharia and other individuals who had a major hand in Spark's original development as an academic project at the University of California at Berkeley's AMPLab. The company also organizes the Spark Summit conferences focused on the processing engine, and it was one of the first vendors to announce a full Spark-based offering -- though for now, Databricks Cloud remains in a limited availability release open to prospective users who register on the Databricks website.

The product offering also marks an effort to simplify the big data management and analytics process. Databricks describes Databricks Cloud as a "zero-management" platform designed to enable users to quickly process and analyze sets of big data in distributed computer clusters. For example, the platform supports interactive "notebooks" that Databricks said can ease development and management of Spark applications. The notebooks provide interfaces that developers work with, first to program Spark jobs in Python, Scala or SQL and then to schedule them. In effect, the notebooks become programs that can be repeatedly run as automatically executing production jobs, according to the company.

Databricks Cloud initially runs on the Amazon Web Services cloud, but Databricks said it's looking to add support for other cloud hosting platforms going forward. The company currently supports Version 1.2 of Spark, which was released last December, and Version 1.3, which became available in March and was updated with a 1.3.1 release this month. Besides bug fixes, Spark 1.2 brought performance improvements to Spark's core engine and added a new machine learning API, updates to Spark SQL, infrastructure changes that boost the reliability of Spark Streaming applications and an enhanced Python interface that makes Spark-style processing available to a wider band of software developers. Spark 1.3 added a DataFrames API developed by DataBricks that's designed to improve operations with large data sets, plus further improvements to Spark SQL.

What users say

Automatic Labs makes devices that collect driving and engine data from cars and send it via wireless connections to cloud-based servers; the San Francisco company uses Databricks Cloud on AWS to analyze the data and create reports that are sent to drivers' smartphones after they've completed trips. Robert Ferguson, director of engineering at Automatic, said he particularly has found favor with the additional Python support provided by Spark 1.2.

"DataBricks now gives us a very familiar Python notebook experience," Ferguson said. "We may have new developers that have experience with Python but not with big data, and they can work effectively with the notebooks." He added that bringing Python developers into the loop speeds development time versus Java-centric MapReduce implementations.

Another element of Databricks Cloud that Ferguson pointed to is the user interface, which also offers benefits for SQL programmers, he said. "If you run a query using Spark SQL, you get a visualization of the data that is tremendously valuable."

Sharethrough Inc., an online advertising platform provider that's also based in San Francisco, is another Databricks Cloud user -- it runs Internet clickstream and ad-visibility data through the Spark Streaming module to power machine learning applications that analyze ad performance for clients. Russell Cardullo, the technical lead on the implementation, said the combination of Spark and Databricks Cloud enables Sharethrough to easily convert the stream processing jobs into batch programs for creating reports, saving the company considerable development time.

Sharethrough was one of the earliest users of the Databricks platform, and Cardullo said the Spark technology had some instability prior to Version 1.0 of the engine becoming available last June. "It was a little painful to upgrade each time they upgraded," he said. But Cardullo added that the software has been stable since the Spark 1.0 release.

Drilldown

*Runs Apache Spark on cloud-based clusters with automated management capabilities and support for importing data from Hadoop, AWS S3, relational databases and NoSQL technologies such as Cassandra and MongoDB.

*Offers an interactive and collaborative notebook format for writing Spark commands in Python, Scala or SQL, plus built-in data visualizations and dashboard development tools.

*Provides tools for scheduling workflows and setting up production pipelines that encompass data imports, ETL integration, processing and data exports.

*Supports the standard Spark API, as well as ODBC and JDBC connectivity and a native REST API for integrating third-party applications.

Pricing

Databricks hasn't released detailed pricing information for Databricks Cloud, but it said there are various subscription tiers defined according to usage capacity, support model and feature set. The tiered pricing starts at a rate of "several hundred dollars per month," the vendor said.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at jvaughan@techtarget.com, and follow us on Twitter: @sDataManagement.

Executive editor Craig Stedman contributed to this story.

Next Steps

Learn about Spark's use in a data preparation product

Read about machine learning tools that support Spark

Find out how Spark is being handled by the big data vendors

Examine how cloud complicates the data processing pipeline

Learn about when Spark cloud deployments make sense

This was last published in April 2015

Dig Deeper on Big data management

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

2 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What's your take on Spark? Do you see it as an add-on to Hadoop, or a standalone phenomenon?
Cancel
I think Spark is definitely it's own thing at this point. It doesn't have it's own file system, which is why people use it with Hadoop, but Hadoop isn't the only tool that can provide that. And when you consider that Spark is much faster than Hadoop, and works with other third-party tools, it's easy to make the case that Spark is a standalone tool that can work with Hadoop, or work with other tools.
Cancel

-ADS BY GOOGLE

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close