BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Google Cloud Dataflow is a cloud-based data processing service for both batch and real-time data streaming applications. It enables developers to set up processing pipelines for integrating, preparing and analyzing large data sets, such as those found in Web analytics or big data analytics applications.
The Cloud Dataflow software expands on earlier Google parallel processing projects, including MapReduce, which originated at the company. Cloud Dataflow is designed to bring to entire analytics pipelines the style of fast parallel execution that MapReduce brought to a single type of computational sort for batch processing jobs. It's based partly on MillWheel and FlumeJava, two Google-developed software frameworks aimed at large-scale data ingestion and low-latency processing.
Google Cloud Dataflow overlaps with competitive software frameworks and services such as Amazon Kinesis, Apache Storm, Apache Spark and Facebook Flux. A preview of the technology was shown at the Google I/O developer conference in June 2014; at the same time, Cloud Dataflow was made available on a limited basis as part of a controlled beta program. The first version is supported by a Java software development kit (SDK), with other language support to follow.
Cloud Dataflow can take data in publish-and-subscribe mode from Google Cloud Pub/Sub middleware feeds or, in batch mode, from any database or file system. It agnostically handles data of varying sizes and structures using a format called PCollections, which is short for "parallel collections." The Google Cloud Dataflow service also includes a library of parallel transforms, or PTransforms, which allow high-level programming of often-repeated tasks using basic templates; in addition, it supports developer customization of data transformations. The service optimizes processing tasks -- for example, by reducing multiple tasks into single execution passes. And it supports SQL queries via Google BigQuery, a cloud-based analytics service.