ra2 studio - Fotolia
The Apache Spark processing engine has become a core component of many big data architectures, and Databricks Inc. is the prime mover behind the open source technology.
Matei Zaharia, who created Spark in 2009 when he was a graduate student at the University of California, Berkeley, co-founded Databricks four years later and is its chief technologist. The company, based in San Francisco, continues to lead development of the Spark software; it also markets a cloud-based Spark platform that became generally available in mid-2015 and competes with Spark implementations from various other vendors.
Spark can be used both alongside and apart from Hadoop, running against data in the Hadoop Distributed File System or alternative data stores, such as the Amazon Simple Storage Service (S3) in the Amazon Web Services cloud. Big data users initially leaned toward on-premises systems, but use of the cloud is increasing. Gartner says Amazon is now the largest Hadoop vendor based on number of users. And in a 2016 survey conducted by Forrester Research, 40% of 2,094 respondents said public cloud services were part of their big data plans, according to a report published last month.
In an interview with SearchDataManagement, Databricks CEO Ali Ghodsi discussed the adoption of big data systems in the cloud and other issues, including the rapid pace of Spark updates and the different technologies developed for doing stream processing with Spark. Excerpts from the interview follow.
Where do things stand on the use of Spark, Hadoop and other big data technologies in the cloud?
Ali Ghodsi: Especially starting in 2016, we've just seen an explosive increase in the rate of adoption. Virtually every company we talk to is looking at the cloud. The question for many is how fast to move over and do analytics in the cloud. Some are already there -- and we're talking about large, Fortune 500 companies. Others are still thinking about it. But that's much different than it was two years ago.
Why do you think that's happening?
Ghodsi: No. 1, the prices [of cloud services] have become extremely competitive. Amazon alone has made 40 price cuts over the past three years -- storing data in S3 on Amazon is much cheaper than doing it in a Hadoop system in your own data center. The second thing is that, two years ago, we used to hear a lot of questions [from prospective customers] about the security of their data. That has changed, and now we hear that people want to move into the cloud to improve their security.
The third thing is that the rate of innovation is faster in the cloud. You can stay at the forefront [on technologies] much more easily. If you're on premises, you're maybe updating your systems once or twice a year; in the cloud, it's a matter of months or weeks [between updates]. Databricks does new releases in the cloud every week, and there's nothing special about Databricks [in that regard]. It's just the nature of the distribution cycle in the cloud.
Is there any possibility that you'll release an on-premises version of your Spark processing platform?
Ghodsi: That's a question we revisit every year, but right now, we're seeing so much demand in the cloud that we're just trying to keep up with it there.
Ali GhodsiCEO, Databricks
Ghodsi: We would slow down [on development] if the adoption pace by customers fell off. I think it's a good thing overall -- it shows that there's a lot of innovation going on in Spark. For us at Databricks, we release Spark updates in the cloud a lot more often than that, and we don't see any problems with adoption.
On-premises deployment is slower -- a lot of on-premises users are still on Spark 1.5 or 1.6, as opposed to Spark 2.1. But I don't think there will be major changes [in the development process].
As an open source technology, is it good for Spark to have Databricks driving so much of its development?
Ghodsi: Many other companies contribute, so we're not the only one. It is true that we're the main driving force behind probably all of the major initiatives -- they're initiated by Databricks employees, but then with a lot of other contributors.
On the flip side, if you have two equally strong vendors involved in an open source project, that can lead to a lot of politics and stalemate. We have close to half the committers on Spark, but it's far from being a single-vendor thing. It's a healthy project.
Several stream processing technologies have been introduced for Spark: Spark Streaming, Structured Streaming, now Drizzle. How do they all fit together in the Spark processing framework?
Ghodsi: You don't have to make any changes to use Drizzle -- it's an optimizer that just makes Spark's streaming engines faster. However, in the Spark 2 series, Structured Streaming supersedes Spark Streaming. In Spark 2.0, we went in and modified the core engine of Spark and made it a streaming engine out of the box [via Structured Streaming]. It's the same fundamental engine now for batch jobs and streaming. There are still some improvements happening for Spark Streaming, but all of the exciting new developments are in Structured Streaming. We're spending very little energy on Spark Streaming.
Do you still use microbatching for data streaming in the new engine, as in Spark Streaming?
Ghodsi: In Structured Streaming, that's completely removed -- there's no microbatching anymore in any of the APIs. Under the hood, there is some, but the high-level message is that microbatching is gone from the APIs. Spark now works exactly the same on streams and batch data. That's a big change.
Functionality gaps aren't holding users back on Spark processing applications
Cloud-based data lake with Hadoop and Spark beefs up analytics architecture
Genomic data analysis applications get a data processing charge from Spark