For data analytics and management, Spark was the word of the year. In 2015, the open source Apache analytical engine joined and, in some ways, even surpassed its big brother Hadoop at the top of the charts for big data trends. So, it isn't surprising that Spark and its impact on areas like machine learning, MapReduce, data streaming and the Hadoop file system is the central theme addressed by editors in this end-of-year edition of the Talking Data podcast.
Only a little more than five years ago, Spark was just a project idea at a University of California, Berkeley computer science lab; today, it is offered, typically along with MapReduce and Hive, by all the Hadoop distribution providers. In 2015, mainline database companies got Spark fever, too. Oracle made Spark a central element of its Cloud Platform for Big Data, Microsoft showed off Apache Spark on Azure HDInsight and IBM said that Spark was central to its analytics strategy and it would invest $1 billion in Spark-related development.
A clear target of the Spark movement is MapReduce, the programming framework at the heart of the original Hadoop 1.0 system. When Hadoop 2.0 introduced YARN (Yet Another Resource Negotiator) as a resource manager, it opened a door to new data architectures, and Spark has been the most prominent. The podcast looks at the potential for Spark disruption to go much further than MapReduce.
Spark's utility for machine learning applications is a big part of its appeal, according to Talking Data podcast participants. They discussed, for example, NBC Universal's use of the Spark MLlib component to automate some aspects of video file management. While development of such apps is still far from being the province of typical technical mortals, Spark does seem to show promise as a means to move machine learning out of the specialized realm of the elite data scientist.
The editors do hold out cautions. Spark is still somewhat raw technology, and former data analytics words of the year, such as Hadoop and MapReduce, won't quietly recede into memory, they suggest. But, in the year just past, the echo of Spark in the corridors of big data was truly resonant.
Be there as machine learning goes a'fishing for the whales