Andrea Danti - Fotolia
NEW YORK -- In the rush to capitalize on deployments of big data platforms, organizations shouldn't neglect data quality measures that can ensure the info used in analytics applications is clean and trustworthy, experienced IT managers said at the 2017 Strata Data Conference here last week.
Several speakers pointed to data quality as a big challenge in their big data environments -- one that required new processes and tools to help get a handle on quality issues, as both the volumes of data being fed into corporate data lakes and use of the info by data scientists and other analysts grow.
"The more of the data you produce is used, the more important it becomes, and the more important data quality becomes," said Michelle Ufford, manager of core innovation for data engineering and analytics at Netflix Inc. "But it's very, very difficult to do it well -- and when you do it well, it takes a lot of time."
Over the past 12 months, Ufford's team worked to streamline the Los Gatos, Calif., company's data quality measures as part of a broader effort to boost data engineering efficiency based on a "simplify and automate" mantra, she said during a Strata session.
A starting point for the data-quality-upgrade effort was "acknowledging that not all data sets are created equal," she noted. In general, ones with high levels of usage get more data quality checks than lightly used ones do, according to Ufford, but trying to stay on top of that "puts a lot of cognitive overhead on data engineers." In addition, it's hard to spot problems just by looking at the metadata and data-profiling statistics that Netflix captures in an internal data catalog, she said.
Calling for help on data quality
To ease those burdens, Netflix developed a custom data quality tool, called Quinto, and a Python library, called Jumpstarter, which are used together to generate recommendations on quality coverage and to set automated rules for assessing data sets. When data engineers run Spark-based extract, transform and load (ETL) jobs to pull in data on use of the company's streaming media service for analysis, transient object tables are created in separate partitions from the production tables, Ufford said. Calls are then made from the temporary tables to Quinto to do quality checks before the ETL process is completed.
In the future, Netflix plans to expand the statistics it tracks when profiling data and implement more robust anomaly detection capabilities that can better pinpoint "what is problematic or wrong" in data sets, Ufford added. The ultimate goal, she said, is making sure data engineering isn't a bottleneck for the analytics work done by Netflix's BI and data science teams and its business units.
Improving data consistency was one of the goals of a cloud-based data lake deployment at Financial Industry Regulatory Authority Inc., an organization in Washington, D.C., that creates and enforces rules for financial markets. Before the big data platform was set up, fragmented data sets in siloed systems made it hard for data scientists and analysts to do their jobs effectively, said John Hitchingham, director of performance engineering at the not-for-profit regulator, more commonly known as FINRA.
A homegrown data catalog, called herd, was "a real key piece for making this all work," Hitchingham said in a presentation at the conference. FINRA collects metadata and data lineage info in the catalog; it also lists processing jobs and related data sets there, and it uses the catalog to track schemas and different versions of data in the big data architecture, which runs in the Amazon Web Services (AWS) cloud.
To help ensure the data is clean and consistent, Hitchingham's team runs validation routines after it's ingested into Amazon Simple Storage Service (S3) and registered in the catalog. The validated data is then written back to S3, completing a process that he said also reduces the amount of ETL processing required to normalize and enrich data sets before they're made available for analysis.
Data quality takes a business turn
The analytics team at Ivy Tech Community College in Indianapolis also does validation checks as data is ingested into its AWS-based big data system -- but only to make sure the data matches what's in the source systems from which it's coming. The bulk of the school's data quality measures are now carried out by individual departments in their own systems, said Brendan Aldrich, Ivy Tech's chief data officer.
"Data cleansing is a never-ending process," Aldrich said in an interview before speaking at the conference. "Our goal was, rather than getting on that treadmill, why not engage users and get them involved in cleansing the data where it should be done, in the front-end systems?"
That process started taking shape when Ivy Tech, which operates 45 campuses and satellite locations across Indiana, deployed the cloud platform and Hitachi Vantara's Pentaho data integration and BI software in late 2013 to give its business users self-service analytics capabilities. And it was cemented in July 2016 when the college hired a new president who mandated that business decisions be based on data, Aldrich said.
The central role data plays in decision-making gives departments a big incentive to ensure information is accurate before it goes into the analytics system, he added. As a result, data quality problems are being found and fixed more quickly now, according to Aldrich. "Even if you're cleansing data centrally, you usually don't find [an issue] until someone notices it and points it out," he said. "In this case, we're cleansing it faster than we were before."
Read a book excerpt on how to identify data quality problems
Why metrics on data quality are a data governance must
How to estimate the time it will take to profile a data set