Sergey Nivens - Fotolia
In the big data era, the task of preparing data for analytics often falls to the data scientists looking to analyze it. But IT and analytics teams increasingly are moving to reduce that burden by doing at least some data preparation work upfront as the data is ingested into systems or shortly thereafter.
That typically requires a mixed approach of applying schemas and formatting to data sets for some users while leaving the raw data available for others who want to fully prep it themselves. Balancing those different needs complicates the data management process. But a flexible strategy on data preparation for analytics can help broaden the use of big data environments beyond top-level data scientists, according to IT and analytics managers who have implemented such strategies.
For example, the big data platform team at Discover Financial Services Inc. initially left credit card transaction data, customer records and other types of information as is when pulling them into a Hortonworks-based Hadoop cluster for analysis.
"It was sort of a salad bar concept before," said Santosh Bardwaj, vice president of advanced analytics and decision platforms at the Riverwoods, Ill., company. But it was hard for Discover's analysts to fully avail themselves of all the different data ingredients they were presented with, Bardwaj added during a session at DataWorks Summit 2017 in San Jose, Calif. "We realized that we had to put in some sort of standard schema so people can consume the data."
The raw data is still stored in its original state, but it's made available with a set of schemas and light data modeling applied to make it easier to query. Bardwaj's team also provides a more-enriched version of the data with business logic and metadata built in to further streamline data preparation work for some users. Those steps have opened up a lot more of the available data for analysis, he noted.
Going with the flow on data prep
Additionally, Discover is working to deploy a flow-based setup to automate extract, transform and load (ETL) processes on incoming data via the Apache Spark processing engine, enabling analysts to build their own data pipelines. Currently, that's done by "a handful of data engineers who are very well-versed in Spark" and can hand code ETL jobs, Bardwaj said. "But we don't think hand coding is a way to scale things."
GoPro Inc.'s big data architecture team has set up a similar automated process that uses a data definition language (DDL) syntax to add a table-based schema on the fly to data streaming in from the company's wearable cameras, as well as other internal and external data being collected in a cloud-based Hadoop and Spark system.
The customized "dynamic DDL" approach makes analytics-ready data available to GoPro's data scientists within minutes, or even seconds, of it being ingested, said Hao Zou, a software engineer at the San Mateo, Calif., company. Zou added that the data scientists don't want to do what he described as the "tedious job" of data preparation for analytics themselves.
Chakra Sankaraiahsenior manager of big data and advanced analytics, Land O'Lakes Inc.
Biotechnology company CSL Behring is deploying a Hadoop-based platform to pull together manufacturing data from plants in the U.S., Australia, Germany and Switzerland for analysis. Mark Baker, a senior business systems architect who's in charge of the big data infrastructure, said he's doing some upfront work to harmonize the data as it comes in -- for example, removing umlauts and other language-specific marks and characters to avoid data consistency problems.
Beyond that, the data is left in its raw state during the ingestion process; some "very high-end" data scientists want to work with the raw data, Baker explained. But he then runs ETL jobs to prepare data sets for use by other analysts at CSL Behring, based in King of Prussia, Pa. "What they tell me is how they want their data, and then it's my job to put it in the form they want to see it in," said Baker, who uses Spark and other tools to process the data and load it into tables in either the Apache Hive or HBase repositories.
Prepared data at your command
The big data and data architecture teams at Land O'Lakes Inc. are also taking a proactive approach to data preparation for analytics applications. A Hadoop-based data lake is initially being used to feed website clickstream, internet search and social media data to a "digital command center" system for the agricultural cooperative's marketing department. But the raw data is pulled together under a common schema in Hive tables to support marketing analytics and campaign management.
Chakra Sankaraiah, senior manager of big data and advanced analytics at Land O'Lakes, said the Arden Hills, Minn., company's marketers use data from numerous systems to plan online marketing campaigns. "You can't just leave it at the raw level," he said. "You have to build something where you can make it accessible."
The data preparation plan will be tailored for other analytics needs as the data lake expands, starting with the addition of distribution and shipping data. Currently, "we shape the data to answer questions for digital marketing analysis," said Dwayne Beberg, who manages data architecture as the company's director of business intelligence. "But the next questions may not be concerned with that."
More on how big data is changing the data preparation process
JTV polishes up its data prep strategy to drive predictive modeling
Cloud systems add new data integration and preparation challenges