Data quality processes have become more prominent in organizations, often as part of data governance programs. For many companies, the growing interest in quality is commensurate with an increased need to ensure that analytics data is trustworthy. That's especially true with data quality for big data; more data usually means more data problems -- including data usability issues that should be part of the quality discussion.
One of the main challenges of effective data quality management is articulating what quality really means to a company. What are commonly referred to as the dimensions of data quality include accuracy, consistency, timeliness and conformity. But there are many different lists of dimensions, and even some common terms have different meanings from list to list. As a result, solely relying on a particular list without having an underlying foundation for what you're looking to accomplish is a simplistic approach to data quality.
This challenge becomes more acute with big data. In Hadoop clusters and other big data systems, data volumes are exploding and data variety is increasing. An organization might accumulate data from numerous sources for analysis -- for example, transaction data from different internal systems, clickstream logs from e-commerce sites and streams of data from social networks.
Time to rethink things
In addition, the design of big data platforms exacerbates the potential problems. A company might create data in on-premises servers, syndicate it to cloud databases and distribute filtered data sets to systems at remote sites. This new world of processing data creates issues that aren't covered in conventional lists of data quality dimensions.
To compensate, we need to re-examine what is meant by quality in the context of a big data analytics environment. Too often, we equate the concept of data quality with discrete notions such as data correctness or currency, putting in place processes to fix data values and objects that aren't accurate or up to date.
But the task of managing data quality for big data is also likely to include measures designed to help data scientists and other analysts figure out how to effectively use what we have. In other words, we must transition from simply generating a black-and-white specification of good versus bad data to supporting a spectrum of data usability.
The usability side of data quality
A focus on usability is aimed at boosting the degree to which validated data can contribute to actionable analytics results while reducing the risk of the data being misinterpreted or used improperly. Let's look at some aspects of efforts to improve big data usability that could be incorporated into data quality initiatives.
Findability. No matter how clean data sets stored in a data lake are, they won't be usable if analysts don't know they exist and can't access them. The ability to find relevant information can be increased if there's a well-organized data catalog that includes qualitative descriptions of what the data sets contain, where they're located and how to access the data.
Conformance to reference models. A reference data model provides a logical description of data entities, such as customers, products and suppliers. But unlike a conventional relational model, it abstracts the attributes and properties of the entity and its relationships, allowing for mappings among different physical representations in SQL, extensible markup language, JSON and other formats. When accumulating, processing and disseminating entity data among a cohort of source and target systems, conforming to a reference model helps ensure a consistent representation and semantic interpretation of data.
Data synchronization. In many big data environments, data sets are likely to be replicated between different platforms. As opposed to simply generating data extracts for particular users, replication synchronizes data among all the replicas. When a change is made to one copy, the modification is propagated to the others. An automated workflow of that sort helps enforce consistency and uniformity on shared data, thereby increasing its usability.
Making data identifiable. The precision of big data analytics applications can be enhanced by making it easier to accurately identify entity data so relevant data sets from different sources can be linked together for analysis. Therefore, ensuring that data is identifiable across its entire lifecycle -- from creation or capture to ingestion, integration, processing and production use -- should be a core facet of data quality for big data.
Reconsidering how data quality is framed in a big data environment to incorporate data usability measures broadens the scope of data quality processes and priorities. Alongside traditional data profiling and cleansing procedures, a usability-driven focus on data preparation, cataloging and curation offers a bigger-picture view of big data quality for organizations that are expanding their repositories -- and their use -- of analytics data.
More from David Loshin: Five steps to improved data quality processes
How big data systems change things for data management teams
Real-world examples of strategies for managing and governing data lakes