Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context. The quality of data is determined by factors such as accuracy, completeness, reliability, relevance and how up to date it is. As data has become more intricately linked with the operations of organizations, the emphasis on data quality has gained greater attention.
Why data quality is important
Poor-quality data is often pegged as the source of inaccurate reporting and ill-conceived strategies in a variety of companies, and some have attempted to quantify the damage done. Economic damage due to data quality problems can range from added miscellaneous expenses when packages are shipped to wrong addresses, all the way to steep regulatory compliance fines for improper financial reporting.
An oft-cited estimate originating from IBM suggests the yearly cost of data quality issues in the U.S. during 2016 alone was about $3.1 trillion. Lack of trust by business managers in data quality is commonly cited among chief impediments to decision-making.
The demon of poor data quality was particularly common in the early days of corporate computing, when most data was entered manually. Even as more automation took hold, data quality issues rose in prominence. For a number of years, the image of deficient data quality was represented in stories of meetings at which department heads sorted through differing spreadsheet numbers that ostensibly described the same activity.
Determining data quality
Aspects, or dimensions, important to data quality include: accuracy, or correctness; completeness, which determines if data is missing or unusable; conformity, or adherence to a standard format; consistency, or lack of conflict with other data values; and duplication, or repeated records.
As a first step toward data quality, organizations typically perform data asset inventories in which the relative value, uniqueness and validity of data can undergo baseline studies. Established baseline ratings for known good data sets are then used for comparison against data in the organization going forward.
Methodologies for such data quality projects include the Data Quality Assessment Framework (DQAF), which was created by the International Monetary Fund (IMF) to provide a common method for assessing data quality. The DQAF provides guidelines for measuring data dimensions that include timeliness, in which actual times of data delivery are compared to anticipated data delivery schedules.
Data quality management
Several steps typically mark data quality efforts. In a data quality management cycle identified by data expert David Loshin, data quality management begins with identifying and measuring the effect of business outcomes. Rules are defined, performance targets are set, and quality improvement methods as well as specific data cleansing, or data scrubbing, and enhancement processes are put in place. Results are then monitored as part of ongoing measurement of the use of the data in the organization. This virtuous cycle of data quality management is intended to assure consistent improvement of overall data quality continues after initial data quality efforts are completed.
Software tools specialized for data quality management match records, delete duplicates, establish remediation policies and identify personally identifiable data. Management consoles for data quality support creation of rules for data handling to maintain data integrity, discovering data relationships and automated data transforms that may be part of quality control efforts.
Collaborative views and workflow enablement tools have become more common, giving data stewards, who are charged with maintaining data quality, views into corporate data repositories. These tools and related processes are often closely linked with master data management (MDM) systems that have become part of many data governance efforts.
Data quality management tools include IBM InfoSphere Information Server for Data Quality, Informatica Data Quality, Oracle Enterprise Data Quality, Pitney Bowes Spectrum Technology Platform, SAP Data Quality Management and Data Services, SAS DataFlux and others.
Emerging data quality challenges
Over time, the burden of data quality efforts centered on the governance of relational data in organizations, but that began to change as web and cloud computing architectures came into prominence.
Unstructured data, text, natural language processing and object data became part of the data quality mission. The variety of data was such that data experts began to assign different degrees of trust to various data sets, forgoing approaches that took a single, monolithic view of data quality.
Also, the classic issues of garbage in/garbage out that drove data quality efforts in early computing resurfaced with artificial intelligence (AI) and machine learning applications, in which data preparation often became the most demanding of data teams' resources.
The higher volume and speed of arrival of new data also became a greater challenge for the data quality steward.
Expansion of data's use in digital commerce, along with ubiquitous online activity, has only intensified data quality concerns. While errors from rekeying data are a thing of the past, dirty data is still a common nuisance.
Protecting the privacy of individuals' data became a mild concern for data quality teams beginning in the 1970s, growing to become a major issue with the spread of data acquired via social media in the 2010s. With the formal implementation of the General Data Protection Regulation (GDPR) in the European Union (EU) in 2018, the demands for data quality expertise were expanded yet again.
Fixing data quality issues
With GDPR and the risks of data breaches, many companies find themselves in a situation where they must fix data quality issues.
The first step toward fixing data quality requires identifying all the problem data. Software can be used to perform a data quality assessment to verify data sources are accurate, determine how much data there is and the potential impact of a data breach. From there, companies can build a data quality program, with the help of data stewards, data protection officers or other data management professionals. These data management experts will help implement business processes that ensure future data collection and use meets regulatory guidelines and provides the value that businesses expect from data they collect.