Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.
Spark became a top-level project of the Apache Software Foundation in February 2014, and version 1.0 of Apache Spark was released in May 2014. Spark version 2.0 was released in July 2016.
The technology was initially designed in 2009 by researchers at the University of California, Berkeley as a way to speed up processing jobs in Hadoop systems.
Spark Core, the heart of the project that provides distributed task transmission, scheduling and I/O functionality provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework to which early versions of Hadoop were tied. Spark's developers say it can run jobs 100 times faster than MapReduce when processed in-memory, and 10 times faster on disk.
How Apache Spark works
Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores, such as Apache Hive. Spark supports in-memory processing to boost the performance of big data analytics applications, but it can also perform conventional disk-based processing when data sets are too large to fit into the available system memory.
The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The RDD is designed in such a way so as to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn't have to define where specific files are sent or what computational resources are used to store or retrieve files.
In addition, Spark can handle more than the batch processing applications that MapReduce is limited to running.
The Spark Core engine functions partly as an application programming interface (API) layer and underpins a set of related tools for managing and analyzing data. Aside from the Spark Core processing engine, the Apache Spark API environment comes packaged with some libraries of code for use in data analytics applications. These libraries include:
- Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query data stored in disparate applications using the common SQL language.
- Spark Streaming -- This library enables users to build applications that analyze and present data in real time.
- MLlib -- A library of machine learning code that enables users to apply advanced statistical operations to data in their Spark cluster and to build applications around these analyses.
- GraphX -- A built-in library of algorithms for graph-parallel computation.
Spark was written in Scala, which is considered the primary language for interacting with the Spark Core engine. Out of the box, Spark also comes with API connectors for using Java and Python. Java is not considered an optimal language for data engineering or data science, so many users rely on Python, which is simpler and more geared toward data analysis.
There is also an R programming package that users can download and run in Spark. This enables users to run the popular desktop data science language on larger distributed data sets in Spark and to use it to build applications that leverage machine learning algorithms.
Apache Spark use cases
The wide range of Spark libraries and its ability to compute data from many different types of data stores means Spark can be applied to many different problems in many industries. Digital advertising companies use it to maintain databases of web activity and design campaigns tailored to specific consumers. Financial companies use it to ingest financial data and run models to guide investing activity. Consumer goods companies use it to aggregate customer data and forecast trends to guide inventory decisions and spot new market opportunities.
Large enterprises that work with big data applications use Spark because of its speed and its ability to tie together multiple types of databases and to run different kinds of analytics applications. As of this writing, Spark is the largest open source community in big data, with over 1,000 contributors from over 250 organizations.