Sergej Khackimullin - Fotolia

Make data usability a priority on data quality for big data

To help make big data analytics applications more effective, IT teams must augment conventional data quality processes with measures aimed at improving data usability for analysts.

David Loshin

By

David Loshin, Knowledge Integrity Inc.

Published: 17 Oct 2022

Data quality processes continue to become more prominent in organizations, often as part of data governance programs. For many companies, the growing interest in quality is commensurate with an increased need to ensure that analytics data is trustworthy and useable.

That's especially true with data quality for big data; more data usually means more data problems -- including data usability issues that should be part of the quality discussion.

One of the main challenges of effective data quality management is articulating what quality really means to a company. What are commonly referred to as the dimensions of data quality include accuracy, consistency, timeliness and conformity.

But there are many different lists of dimensions, and even some common terms have different meanings from list to list. As a result, solely relying on a particular list without having an underlying foundation for what you're looking to accomplish is a simplistic approach to data quality.

This challenge becomes more acute with big data. In Hadoop clusters and other big data systems, data volumes are exploding and data variety is increasing. An organization might accumulate data from numerous sources for analysis -- for example, transaction data from different internal systems, clickstream logs from e-commerce sites and streams of data from social networks.

Time to rethink things

In addition, the design of big data platforms exacerbates the potential problems. A company might create data in on-premises servers, syndicate it to cloud databases and distribute filtered data sets to systems at remote sites. This new world of processing data creates issues that aren't covered in conventional lists of data quality dimensions.

To compensate, reexamine what is meant by quality in the context of a big data analytics environment. Too often, people equate the concept of data quality with discrete notions such as data correctness or currency, putting in place processes to fix data values and objects that aren't accurate or up to date.

But the task of managing data quality for big data is also likely to include measures designed to help data scientists and other analysts figure out how to effectively use what they have. In other words, they must transition from simply generating a black-and-white specification of good versus bad data to supporting a spectrum of data usability.

Different aspects of data usability

The usability side of data quality

A focus on usability is aimed at boosting the degree to which validated data can contribute to actionable analytics results while reducing the risk of the data being misinterpreted or used improperly. Let's look at some aspects of efforts to improve big data usability that could be incorporated into data quality initiatives.

Findability. No matter how clean data sets stored in a data lake are, they won't be usable if analysts don't know they exist and can't access them. The ability to find relevant information can be increased if there's a well-organized data catalog that includes qualitative descriptions of what the data sets contain, where they're located and how to access the data.

Too often, people equate the concept of data quality with discrete notions such as data correctness or currency, putting in place processes to fix data values and objects that aren't accurate or up to date.

Conformance to reference models. A reference data model provides a logical description of data entities, such as customers, products and suppliers. But unlike a conventional relational model, it abstracts the attributes and properties of the entity and its relationships, allowing for mappings among different physical representations in SQL, extensible markup language, JSON and other formats. When accumulating, processing and disseminating entity data among a cohort of source and target systems, conforming to a reference model helps ensure a consistent representation and semantic interpretation of data.

Data synchronization. In many big data environments, data sets are likely to be replicated between different platforms. As opposed to simply generating data extracts for particular users, replication synchronizes data among all the replicas. When a change is made to one copy, the modification is propagated to the others. An automated workflow of that sort helps enforce consistency and uniformity on shared data, thereby increasing its usability.

Making data identifiable. The precision of big data analytics applications can be enhanced by making it easier to accurately identify entity data so relevant data sets from different sources can be linked together for analysis. Therefore, ensuring that data is identifiable across its entire lifecycle -- from creation or capture to ingestion, integration, processing and production use -- should be a core facet of data quality for big data.

Automated tools. AI and self-service analytics tools bolster the ability to democratize data and ensure its quality, governance and security. The implementation of AI or analytics tools can help boost the quality of data and improve the data trust when spreading it throughout the organizations.

Self-service BI tools allow people to get the results they need when they need them without having to wait for reports or analysis from someone else. These efforts, bolstered by natural language processing, are continually evolving to become more user friendly and help democratize data across an organization. When reports and analysis are needed, follow best practices for data visualization to communicate those findings clean and concisely to make it easier for people to get the answers they need.

Reconsidering how data quality is framed in a big data environment to incorporate data usability measures broadens the scope of data quality processes and priorities. Alongside traditional data profiling and cleansing procedures, a usability-driven focus on data preparation, cataloging and curation offers a bigger-picture view of big data quality for organizations that are expanding their repositories -- and their use -- of analytics data.

Next Steps

Top trends in big data for 2023 and beyond

Dig Deeper on Data governance

Business Analytics

MicroStrategy targets trusted AI with latest platform update
The vendor's platform now includes capabilities aimed at helping customers trust NLP responses, as well as automate workflows ...
Tableau adds generative AI tools, tightens Databricks bond
The analytics vendor's new features include a tool that enables customers to explore metrics using natural language as well as ...
Logi analytics suite to add new GenAI, SaaS capabilities
Insightsoftware, parent company of the embedded BI specialist, unveiled a new generative AI assistant and SaaS version of ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

Benefits and challenges of a headless CMS
Headless CMSes enable omnichannel publishing and improve front-end flexibility. Yet, these platforms can have steep learning ...
7 SharePoint problems that spur customers to leave the platform
SharePoint is a well-known content management and collaboration platform. Despite its popularity, it can introduce many ...
5 benefits of enterprise search
With a proper enterprise search strategy in place, organizations can improve their employees' efficiency and ensure customers ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP earnings for Q1 indicate strong cloud growth
SAP's cloud revenue for the first quarter of 2024 indicates healthy growth and sets the stage as customers plan cloud migrations ...
SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...

Close