agsandrew - Fotolia
With Spark 3.0 and its new query capabilities, Databricks boasts its most powerful release yet.
Speaking at the Spark + AI Summit 2020 on June 24, Matei Zaharia, CTO of Databricks and creator of Apache Spark, outlined the evolution of Spark over its decade of existence and highlighted the innovations that have come in Spark 3.0, noting that it's Databricks' largest release, with more than 3,000 patches from the community.
The biggest change in Spark 3.0 is the new Adaptive Query Execution (AQE) feature in the Spark SQL query engine, Zaharia said. With AQE, Spark's SQL engine can now update the execution plan for computation at runtime, based on the observed properties of the data.
"This [AQE] makes it much easier to run Spark because you don't need to configure these things in advance, so it will actually adapt and optimize based on your data and also leads to better performance in many cases," Zaharia said.
Spark 3.0 and Delta Engine
Spark 3.0, released on June 18, is also the basis for the new Delta Engine, which Reynold Xin, co-founder and chief architect at Databricks, detailed in a keynote at the June 23-24 virtual conference.
Delta Engine is a high-performance query engine for Delta Lake. Delta Engine takes Spark 3.0 and integrates additional capabilities for Delta Lake workloads, including a caching layer and a query optimizer.
With Spark 3.0, Delta Engine and Delta Lake, Databricks is trying to better enable data teams, Databricks CEO and co-founder Ali Ghodsi said in his keynote.
"Every company wants to be a data company," Ghodsi said. "If you think about it what that actually means -- it requires a new way of empowering people working with data, enabling them to organize around the data they need to collaborate and get to the answers they need more quickly."
Brewing data on Delta Lake
Databricks introduced Delta Lake in 2019. It has already been adopted by large organizations, including Starbucks.
In a conference session on June 24, Vish Subramanian, director of data and analytics engineering at the Seattle-based coffee giant, outlined how Starbucks uses Delta Lake and Spark to help enable data-driven decisions. Starbucks uses real-time and historical transactional data to help inform reporting applications and make decisions.
Starbucks built its own data analytics platform called "BrewKit" that is based on a foundation of Microsoft Azure and Databricks Delta Lake.
Vish SubramanianDirector of data and analytics engineering, Starbucks
"Delta Lake has now helped us build out our historical data and live data aggregations together, to make sure we are now giving our store partners real-time insights on data based on history and on current time," Subramanian said.
Starbucks now has petabytes of data located on Delta Lake at massive scale, with hundreds of data pipelines built on Spark to enable business insights.
"Overall our strategic view has been to commoditize data ingestion to such an extent so that the teams can focus on business problems up the value chain rather than focusing on how to move data from point A to point B," Subramanian said.
Improving data quality with Delta Lake at Cerner Corp.
During another session, Madhav Agni, lead software engineer at electronic health record vendor Cerner, outlined how the organization has benefitted from Delta Lake. Based in Kansas City, Miss., Cerner is one of the biggest EHR vendors.
Cerner pulls data from many different sources into a data lake and needed to ensure data quality, as well as to analyze and effectively use the data, Agni said.
Delta Lake is an open source data layer that brings ACID transactions to Apache Spark workloads, which Cerner has used to enable integrated data analysis from data stored in its data lakes.
Another key attribute of Delta Lake that has helped Cerner improve data quality is a feature called "time travel," a data versioning system that enables users to see what data looked like at a point in time.