Databricks Cloud platform aims to open up Spark data processing

Databricks Cloud is a big data platform that's based on the Apache Spark processing engine, with added features designed to simplify big data management and analytics for users.

Product: Databricks Cloud

Release date: In limited availability since June 2014 -- general availability plans not yet announced

What it does

Databricks Cloud is a cloud-based big data platform, developed by Databricks Inc., with the open source Apache Spark processing engine at its core. Spark supports both in-memory and disk-based processing, and proponents claim it can run batch jobs on Hadoop data up to 100 times faster than MapReduce can; in addition, Spark's ability to iteratively process data via a mini-batch approach lets it run machine learning, stream processing and other non-batch applications that MapReduce can't handle. Databricks was co-founded by Spark creator Matei Zaharia and continues to be among the chief contributors to the Spark project within The Apache Software Foundation. The Berkeley, Calif., company said Databricks Cloud adds automated management capabilities to Spark and surrounds the engine with application templates designed around common workflows, an interactive data-exploration workspace and a production pipeline scheduler called Jobs that was added in a March 2015 update.

Why it matters

Databricks Cloud holds interest partly because the technical team at Databricks includes Zaharia and other individuals who had a major hand in Spark's original development as an academic project at the University of California at Berkeley's AMPLab. The company also organizes the Spark Summit conferences focused on the processing engine, and it was one of the first vendors to announce a full Spark-based offering -- though for now, Databricks Cloud remains in a limited availability release open to prospective users who register on the Databricks website.

The product offering also marks an effort to simplify the big data management and analytics process. Databricks describes Databricks Cloud as a "zero-management" platform designed to enable users to quickly process and analyze sets of big data in distributed computer clusters. For example, the platform supports interactive "notebooks" that Databricks said can ease development and management of Spark applications. The notebooks provide interfaces that developers work with, first to program Spark jobs in Python, Scala or SQL and then to schedule them. In effect, the notebooks become programs that can be repeatedly run as automatically executing production jobs, according to the company.

Databricks Cloud initially runs on the Amazon Web Services cloud, but Databricks said it's looking to add support for other cloud hosting platforms going forward. The company currently supports Version 1.2 of Spark, which was released last December, and Version 1.3, which became available in March and was updated with a 1.3.1 release this month. Besides bug fixes, Spark 1.2 brought performance improvements to Spark's core engine and added a new machine learning API, updates to Spark SQL, infrastructure changes that boost the reliability of Spark Streaming applications and an enhanced Python interface that makes Spark-style processing available to a wider band of software developers. Spark 1.3 added a DataFrames API developed by DataBricks that's designed to improve operations with large data sets, plus further improvements to Spark SQL.

What users say

Automatic Labs makes devices that collect driving and engine data from cars and send it via wireless connections to cloud-based servers; the San Francisco company uses Databricks Cloud on AWS to analyze the data and create reports that are sent to drivers' smartphones after they've completed trips. Robert Ferguson, director of engineering at Automatic, said he particularly has found favor with the additional Python support provided by Spark 1.2.

"DataBricks now gives us a very familiar Python notebook experience," Ferguson said. "We may have new developers that have experience with Python but not with big data, and they can work effectively with the notebooks." He added that bringing Python developers into the loop speeds development time versus Java-centric MapReduce implementations.

Another element of Databricks Cloud that Ferguson pointed to is the user interface, which also offers benefits for SQL programmers, he said. "If you run a query using Spark SQL, you get a visualization of the data that is tremendously valuable."

Sharethrough Inc., an online advertising platform provider that's also based in San Francisco, is another Databricks Cloud user -- it runs Internet clickstream and ad-visibility data through the Spark Streaming module to power machine learning applications that analyze ad performance for clients. Russell Cardullo, the technical lead on the implementation, said the combination of Spark and Databricks Cloud enables Sharethrough to easily convert the stream processing jobs into batch programs for creating reports, saving the company considerable development time.

Sharethrough was one of the earliest users of the Databricks platform, and Cardullo said the Spark technology had some instability prior to Version 1.0 of the engine becoming available last June. "It was a little painful to upgrade each time they upgraded," he said. But Cardullo added that the software has been stable since the Spark 1.0 release.


*Runs Apache Spark on cloud-based clusters with automated management capabilities and support for importing data from Hadoop, AWS S3, relational databases and NoSQL technologies such as Cassandra and MongoDB.

*Offers an interactive and collaborative notebook format for writing Spark commands in Python, Scala or SQL, plus built-in data visualizations and dashboard development tools.

*Provides tools for scheduling workflows and setting up production pipelines that encompass data imports, ETL integration, processing and data exports.

*Supports the standard Spark API, as well as ODBC and JDBC connectivity and a native REST API for integrating third-party applications.


Databricks hasn't released detailed pricing information for Databricks Cloud, but it said there are various subscription tiers defined according to usage capacity, support model and feature set. The tiered pricing starts at a rate of "several hundred dollars per month," the vendor said.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Executive editor Craig Stedman contributed to this story.

Next Steps

Learn about Spark's use in a data preparation product

Read about machine learning tools that support Spark

Find out how Spark is being handled by the big data vendors

Examine how cloud complicates the data processing pipeline

Learn about when Spark cloud deployments make sense

Dig Deeper on Big data management