Big data makes data preparation steps more complicated to navigate

One-size-fits-all approach may not work in preparing data

Fundamental data management processes can get shoved to the side in these go-go days of big data systems and self-service analytics. Cleansing data, preparing data, governing data -- those tasks all seem a bit quaint when the talk turns to predictive analytics and machine learning. But ignoring them, or not planning sufficiently for them, is a perilous path.

On the other hand, there's a fine line to be walked on data preparation steps to meet the different needs of individual data scientists and analysts -- some of whom may want to dig into raw data, while others just want to see fully prepared data sets.

Effective data management is more important than ever when companies treat data as a corporate asset and competitive tool, says Bill Schmarzo, CTO for Dell EMC's big data consulting services unit. "Data is still crap; it still has holes in it," Schmarzo notes. "There's a lot of data cleansing, alignment and organizing that needs to be done." His general advice, though, is to leave the final stages of prepping data and building data models to data scientists, in what he described as a "schema on query" approach.

However, that process isn't always so cut and dried for companies. Medical insurer Health Care Service Corp. (HCSC) does have applications that involve analysts preparing data themselves -- for example, actuaries tailoring claims data for their own analytical uses. But Andy Ashta, executive director of information architecture and data management at HCSC, says his team also takes upfront data preparation steps to create ready-to-analyze "business products" in the insurer's Hadoop cluster.

This handbook offers more insight and advice on how to manage the data preparation process to, as Ashta puts it, enable "multiple users to do different sorts of analysis."