Hadoop and the Spark data processing engine now share the spotlight at Strata + Hadoop World, which focuses on big data management and analytics technologies. And the Strata + Hadoop World 2016 conferences held in the U.S., one in the spring and one in the fall, took place at a noteworthy time in the evolution of both Hadoop and Spark.
Hadoop turns 10 this year, at least by some measures. For example, The Apache Software Foundation created a separate open source subproject for managing Hadoop development in January 2006, and the first public release of Hadoop code followed that April. Now the distributed processing framework is at something of a crossroads, as its original core components get augmented -- and perhaps supplanted -- by other technologies. In particular, MapReduce, initially Hadoop's all-in-one cluster resource manager and programming and processing environment, is being shunted aside by a combination of Spark, SQL-on-Hadoop query engines and the YARN resource management platform underpinning Hadoop 2. Users also have alternatives to the Hadoop Distributed File System (HDFS) for storing data -- for example, the Kudu columnar data store that Hadoop distribution market leader Cloudera Inc. introduced at the 2015 Strata conference in New York for use in real-time analytics applications involving streaming data.
Meanwhile, a 2.0 version of the Apache Spark open source software was released in July 2016 with updates to its stream processing, machine learning and Spark SQL modules, plus a promised performance boost. Hadoop and Spark are often paired together in deployments, with the latter being used to accelerate the processing of data stored in the former. But Spark can also run on its own against data in other platforms, such as NoSQL database systems or cloud-based Amazon Simple Storage Service implementations. Spark proponents and some industry analysts foresee a possible future in which today's Hadoop ecosystem is joined by one surrounding Spark.
New Spark and Hadoop developments, and the relationship between the two technologies, were prominent discussion topics at Strata + Hadoop World 2016 in both San Jose, Calif., in March and New York in September. Numerous presentations on big data trends and best practices for managing deployments were also in the spotlight at the conferences, which were jointly organized by Cloudera and O'Reilly Media Inc. In the sections below, you'll find our coverage of the two events, plus stories from last year's Strata conferences and other content on Hadoop, Spark and related technologies for managing and analyzing pools of big data.
1Strata conference stories-
Reporting from Strata + Hadoop World 2016 and 2015
Strata + Hadoop World includes a mix of sessions featuring IT managers, data scientists and data engineers from user organizations, as well as CTOs, software developers and other representatives from big data vendors. This section compiles news, trend and feature stories based on presentations and interviews at the 2015 and 2016 Strata conferences.
Comcast Corp. and other companies are turning to real-time processing and analytics technologies like Apache Kudu to help find useful information in streams of big data. Continue Reading
As user continue to shift big data management and analytics apps to the cloud, vendors are working to ease the process -- and lower the price -- of migrating Hadoop operations. Continue Reading
Hadoop-based applications are increasingly including Spark Streaming, Kafka and other components that enable real-time streaming analytics capabilities. Find out why. Continue Reading
Will Spark replace MapReduce and diminish the role of the Hadoop framework in the enterprise? Hadoop co-creator Doug Cutting weighed in at Strata + Hadoop World 2016. Continue Reading
Initially a critical cog in Hadoop clusters, MapReduce is being reduced in stature by newer technologies -- a sense that was palpable at Strata + Hadoop World 2015 in New York. Continue Reading
We asked attendees at the 2015 Strata conference in New York whether they see the Spark processing engine more as a complement to Hadoop or a possible alternative to it. Continue Reading
Hadoop distribution vendors Cloudera and Hortonworks announced new technologies that look beyond the Hadoop Distributed File System as a data store for some applications. Continue Reading
In a Strata session, the CTO of the retail giant's Walmart.com unit detailed its use of a Hadoop-based repository to drive several applications mixing online and in-store data. Continue Reading
The Strata event usually focuses on what users can do with big data technologies. But one speaker in New York recommended disconnecting at times to do some creative thinking. Continue Reading
A plan by some Hadoop vendors to create an Open Data Platform initiative sparked competing claims between them and nonparticipants at Strata + Hadoop World 2015 in San Jose. Continue Reading
Strata attendees said Hadoop, Spark and other big data analytics and management tools are helping to ease processing limitations that have held back machine learning applications. Continue Reading
Some companies are involving corporate lawyers in analytics efforts to ensure that sensitive customer data isn't abused. But simply locking down data can hamstring data analysts. Continue Reading
At Strata 2015 in San Jose, analytics exec Pamela Peele said one of the key people on her team is a former journalist hired to help communicate analytical findings to business managers. Continue Reading
2Big data trends-
New developments in Hadoop, Spark and related technologies
Big data platforms are evolving rapidly, with some open source technologies getting four or more releases annually. Even Hadoop, now 10 years on from its initial development, is far from settled as a technology. This section includes stories on recent news, trends and user deployments involving Hadoop and Spark, as well as other big data technologies.
Online marketing services provider Sellpoints Inc. is using Spark, including the technology's SQL programming module, to prepare incoming streams of Web activity data for analysis. Continue Reading
Consultant Thomas W. Dinsmore sees Spark reaching a new level of maturity, one that includes more realistic assessments of the performance improvements it can provide. Continue Reading
Hadoop vendor Hortonworks is creating separate release streams for the big data framework's core components and fast-evolving technologies such as Spark, Hive and HBase. Continue Reading
Several users who spoke at Spark Summit East 2016 in New York discussed their reasons for deploying Spark, including its ability to outpace MapReduce on many Hadoop batch jobs. Continue Reading
At this year's Spark Summit East event, Spark creator Matei Zaharia detailed what's coming in the next version of the technology, including enhancements to its Spark Streaming module. Continue Reading
Emerging SQL-on-Hadoop query engines open up data in Hadoop to the legions of programmers with SQL skills, potentially enabling increased Hadoop adoption by organizations. Continue Reading
Spark is primarily known for pairing up with Hadoop, but connectors that tie it to NoSQL databases are also being tapped by users looking to analyze operational data in real time. Continue Reading
In a Q&A as Hadoop reached an initial 10-year milestone, co-creator Doug Cutting discussed user adoption, development priorities and what things will look like in another five years. Continue Reading
3Big data definitions-
Terms you'll hear at Strata + Hadoop World 2016
Read the definitions included in this section to learn more about big data technologies, techniques and processes.