This content is part of the Essential Guide: Using big data platforms for data management, access and analytics

Big data pipeline drives change in Hadoop architecture, development

A move to fast data analytics and real-time processing could bring change to the big data pipeline. Microservices are edging into a mostly monolithic Hadoop domain.

Open source, Hadoop-style data development has gained considerable attention over the years, but it has also taken a long time to find mainstream enterprise adoption. It has great value as a way to handle two of the big 'Vs' of big data -- that is, massive volumes and highly varied data -- but it also has plenty of complexities, being, in effect, a major change in the way data processing systems are built.

The complexity became no less challenging as a third big 'V' became a more familiar part of the equation. That's velocity, in the form of fast arriving data that companies want to act on quickly. Velocity is the driver behind the wide interest in a big data pipeline and a move from batch-only to real-time processing and operational data analytics.

As in the past, much of the history of computing over the last 10 years has been about dealing imaginatively with complexity. Within the developer community, containers, the Docker platform and microservices have arisen to provide loosely coupled and lightweight service-oriented architecture for such a purpose. Some of those efforts are now being addressed more on the data side of development, meaning Hadoop-style programming may be in for a bit of a shakeup, maybe even on a greater scale than was encountered when Hadoop 2 opened the doors to a wider set of distributed Hadoop add-ons.

Container excitement

Today, we are seeing a general onslaught of real-time analytics applications that combine different framework pieces. Spark, Kafka and Cassandra are among the most common, but there are many more that comprise the streaming big data pipeline. Containers and microservices are finding favor over monolithic architectures for numerous reasons, not the least of which is that it is a complex task getting these pieces to work together, as well as to make changes and updates to them once they are working.

Signs of containers and microservices for big data purposes are becoming more prevalent. BlueData recently created enhancements for managing Docker containers used to create Hadoop and Spark clusters on its EPIC platform. MapR disclosed a reference architecture that applies microservices techniques to streaming and real-time analytics applications. And Pachyderm, which produces a containerized processing engine for versioned data, took first place among contenders in the recent Strata + Hadoop World 2016 Startup Showcase.

Among the closely watched players is Mesosphere, which has built an operating system that runs containers and data services. The company's product is built on Apache Mesos, an open source distributed system kernel or cluster manager. Like Spark, it came out of research at the University of California, Berkeley. It has found use at Airbnb, Apple, Twitter and elsewhere.

Data points

Perhaps it shouldn't be surprising that Spark and Mesos container services are showing up together, as Spark originator Matei Zaharia was also among the Berkeley bunch connected with the inception of what would become Mesos. In fact, a study of nearly 500 Mesos users commissioned by Mesosphere this year found that 43% of those surveyed were using Spark on Mesos. The survey also found that 32% of respondents used Kafka and 24% used Cassandra on the Mesos platform.

Another young firm intently involved with data analytics and microservices is Lightbend, a company whose founders include the creators of Scala, the language behind Spark and Akka, a toolkit and runtime that are central to distributed communications in Apache Flink, which is another emerging element in the big data pipeline.

Lightbend sees a strong link between the use of containers and frameworks for fast, real-time data streaming applications. A survey the company conducted earlier this year of 2,100 developers running applications on the Java Virtual Machine showed that 34% of respondents did most of their data processing in real time, while 22% were doing equal amounts of batch and real-time processing.

Among respondents running microservices in production, 30% were using Kafka and 21% were running Spark Streaming, again confirming the association between the new data processing framework and service and container approaches. In the survey, both of these frameworks trailed yet another entry, Akka Streams, which was used by 35% of respondents.

As in the first days of Hadoop, there is a caveat here: It's early. For data microservices, much of the new infrastructure still needs to be built out. There is a bit of trial and error in that, where teams will find out both what works and what doesn't work.

In addition, the trademark traits of data management -- ''stateful'' sessions and data persistence -- are still new in the microservices realm. For the Hadoop development style, which looks to supplant or replace proprietary data warehouses with more dynamic open source systems, there is one more river to cross.

Next Steps

Learn about new developments in containers in the Hadoop ecosystem

Find out about services and data on the cloud

Look under the covers of a big data pipeline for integration

Dig Deeper on Hadoop framework