Definition

Google Cloud Dataflow

This definition is part of our Essential Guide: AWS vs. Google comparison guide
Contributor(s): Jack Vaughan

Google Cloud Dataflow is a cloud-based data processing service for both batch and real-time data streaming applications. It enables developers to set up processing pipelines for integrating, preparing and analyzing large data sets, such as those found in Web analytics or big data analytics applications.

The Cloud Dataflow software expands on earlier Google parallel processing projects, including MapReduce, which originated at the company. Cloud Dataflow is designed to bring to entire analytics pipelines the style of fast parallel execution that MapReduce brought to a single type of computational sort for batch processing jobs. It's based partly on MillWheel and FlumeJava, two Google-developed software frameworks aimed at large-scale data ingestion and low-latency processing.

Google Cloud Dataflow overlaps with competitive software frameworks and services such as Amazon Kinesis, Apache Storm, Apache Spark and Facebook Flux. A preview of the technology was shown at the Google I/O developer conference in June 2014; at the same time, Cloud Dataflow was made available on a limited basis as part of a controlled beta program. The first version is supported by a Java software development kit (SDK), with other language support to follow.

Cloud Dataflow can take data in publish-and-subscribe mode from Google Cloud Pub/Sub middleware feeds or, in batch mode, from any database or file system. It agnostically handles data of varying sizes and structures using a format called PCollections, which is short for "parallel collections." The Google Cloud Dataflow service also includes a library of parallel transforms, or PTransforms, which allow high-level programming of often-repeated tasks using basic templates; in addition, it supports developer customization of data transformations. The service optimizes processing tasks -- for example, by reducing multiple tasks into single execution passes.  And it supports SQL queries via Google BigQuery, a cloud-based analytics service.

This was last updated in September 2014

Continue Reading About Google Cloud Dataflow

PRO+

Content

Find more PRO+ content and other member only offers, here.

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

-ADS BY GOOGLE

File Extensions and File Formats

SearchBusinessAnalytics

SearchAWS

SearchContentManagement

SearchOracle

SearchSAP

SearchSQLServer

Close