Henrik Dolle - Fotolia
There are many ways an enterprise can benefit from adopting a data catalog. At their core, data catalogs provide a centralized way to organize information about data sources across the enterprise. This is growing in importance as enterprises look for ways to make sense of data from a variety of sources to understand the business, create new analytics and develop AI applications.
What is an enterprise data catalog?
The underlying goal of a data catalog is to capture and store metadata, which is data about data. This can include where it came from, what it describes, how it has been used and its quality and reliability. Data catalogs index this information to make the data easier to discover.
In many ways, data catalogs haven't changed much in 20 years, said Thomas LaRock, head geek at SolarWinds, an IT service management tools provider. Although the core idea is the same, LaRock has seen changes in data discovery techniques that can support new data catalog uses cases, including the ability to identify and classify potentially sensitive data.
Other improvements have been in tagging data with information about data custodians, which can improve data catalog use cases involving collaboration across teams.
Why a data catalog is important
All modern BI tools, cloud platforms and data discovery applications include some type of data cataloging capability that provide basic visibility within their own environments.
"But rarely are all of your data assets stored and managed in a single environment or repository," said Chandra Papudesu, vice president of product management, catalog and lineage at Collibra, a data intelligence company.
A centralized data catalog can provide a way to break down data silos and provide a system of record for data across the enterprise. Data catalogs can also provide a layer of governance on top of these various data sources to improve security and compliance with various privacy mandates such as GDPR and the California Consumer Privacy Act (CCPA).
"Governed data catalogs, that combine easy access to trusted data with compliance can drive widespread and confident adoption within an enterprise without the fear of repercussions," Papudesu said.
This makes it easier for everyone from business users to data scientists to discover, evaluate, trust and access data of all types across the enterprise.
Here are some of the most interesting data catalog use cases.
1. Personalized medicine
Healthcare systems are awash in data relating to patients from a variety of systems, including diagnostic equipment, doctors' notes, billing systems and -- increasingly -- wearable devices, which are all collected and managed differently.
"A data catalog empowers data scientists to provide new services to the hospital and shows how it supports the implementation of new processes to comply with data privacy and security regulations," said Fernando Velez, vice president and chief data technologist at Persistent Systems, an IT and services consultancy.
Velez is working with various medical providers to develop personalized medicine data catalog use cases. One project is improving the detection of a patient's risk of breast cancer. In this project, a data catalog provides a single point of reference across the hospital for existing patient data, as well as new data sets. The resulting risk prediction data set is also cataloged, classified and given data lineage.
2. Data lake modernization
Many organizations store data from numerous sources across the enterprise in raw form in a data lake with only the minimum level of metadata required for data governance. Papudesu said this can hinder adoption of the data across the enterprise because it may be difficult for users to find, understand and access the data from the data lake.
By adding a governed data catalog on top of the data, business analysts and data scientists can easily access data when they need it. They can also see where it came from and how it transforms as it flows across different applications. This can boost the use of the data lake, reduce duplicate data sets and reduce compliance risks.
3. Eliminate duplicate data spending
Many large organizations constantly buy large volumes of third-party data for advertising, marketing and credit risk management purposes.
"However, different lines of business end up purchasing the same data due to organizational silos and decentralized data procurement processes," Papudesu said.
A data catalog can provide a central repository and a standardized acquisition process for third-party data that makes it easier to do a comparative analysis across all external data sets to identify redundancies. It can also help data mangers encode and automatically enforce data sharing policies and agreements of this data.
4. Cloud modernization
Enterprises are accelerating their cloud migration in the wake of COVID-19. One challenge is that many cloud services come with their own metadata management tools optimized for each cloud or services. Another challenge is that enterprises need to be mindful of how and where certain types of sensitive data sets are physically stored.
Papudesu said he and his team are starting to see many enterprises turn to data catalogs to improve the visibility of data across on-premises, cloud and hybrid environments. This data catalog use case also makes it easier to identify high-value data sets that should be prioritized for speed and reuse based on usage and lineage. A data catalog can also help track the technical lineage of data to ensure it is intact and secure and ensure no data is lost during a move.
5. Self-service analytics
Among important data catalog use cases is democratizing data across the enterprise. In many enterprises, data is spread across departments and stored in various systems. As a result, organizations struggle to organize, maintain and utilize their data effectively and efficiently.
The data catalog can provide a central portal for finding and accessing data across these data silos. This makes it easy for users to understand what data is available, where it comes from, how it is used and whether it is trustworthy.
A data catalog can also enable users to find the trustworthy, predefined and preapproved data they need to do their jobs without waiting in the IT queue. This can increase productivity and accelerate time to insights because users spend less time searching for data and more time working on analytics and sharing findings, Papudesu said.
6. Discovering sensitive data
LaRock said the most interesting data catalog use case is discovering sensitive data that a business didn't know existed. Customer details, payment information, even passwords stored in plain text are sometimes discovered in systems that people have forgotten.
"The last thing you want is to be slapped with a GDPR fine because you had no idea what data you were storing," LaRock said.
7. Cloud spend management
As enterprises move to the cloud, IT departments are struggling with analyzing and understanding usage patterns and trends of cloud services.
There can be dozens to hundreds of data sources for this information, which makes it incredibly hard to piece together and identify for someone to consume, said Travis Rehl, vice president of product at CloudCheckr, a cloud management tools provider.
A data catalog can make it easier to stitch this information together so it can be accessed and analyzed for business cost analysis. A data catalog can also set up a translation layer for the source data so users can make appropriate comparisons of costs across different cloud providers and services.