kentoh - Fotolia
A majority of companies struggle with wrangling and organizing their data. More than 60% of respondents surveyed in O'Reilly's State of Data Quality 2020 said they suffered from too many data sources and inconsistent data, making that the most common data quality issue cited. The second was disorganization in data stores and a lack of metadata, which nearly half of respondents cited.
A data catalog could help resolve such problems. A data catalog uses metadata to create an inventory of data assets held by an organization. Users can search a data catalog for the data they need, organize it, manage it and understand its lineage. As such, it helps support not only data discovery, but data governance as well.
Despite the ever-increasing value of data and the importance of good data governance programs, many organizations don't have data catalogs, but that has been changing. Mordor Intelligence predicted that the data catalog market will grow at a compound annual rate of 25.7% between 2021 and 2026.
Enterprise data and technology executives have an expanding list of providers selling data catalogs; similarly, they can also pick from one of multiple open source data catalogs now available.
Although the number of open source data catalog options continues to grow, experts advise enterprise leaders to carefully weigh the benefits and challenges of opting for one. Experts also recommend enterprise leaders to consider what business reasons are driving their decisions on what data catalog to use.
But open source data catalog options aren't for everyone.
"The data catalogs, when fully implemented and integrated, help to form the foundation of identifying, classifying and further understanding data assets," said Matt McGivern, managing director in the IT consulting group at Protiviti.
Benefits of open source data catalogs
Like other open source software, these open source data catalogs offer a number of benefits to organizations that decide to use them.
McGivern noted that open source data catalogs offer potential benefits when they're used with open data and in other environments with easy data integration. But those benefits depend on the sources you're looking to ingest into the data catalog.
"Enterprises can look at their existing technology stack and align those with open source catalog products to see if these are synergies for use," he said. Additionally, he said, organizations may want to consider open source if they haven't already invested in other technologies.
"Open source [data] catalogs are offering lower upfront costs for organizations as they initially launch or undertake proof of values for these areas," McGivern said.
A lower entry cost is only one potential selling point. Other benefits with open source data catalogs include their agility, flexibility, scalability and transparency, all of which mirror the benefits of open source software in general. Open source data catalogs can offer organizations the ability to fully craft solutions to fit their exact enterprise needs and strategic objectives.
Organizations with unique or complicated architectures will likely find that open source data catalogs give them the flexibility and extensive customization they need to build an effective solution, said Andy Neill, chief enterprise architect and senior research director for data and analytics at Info-Tech Research Group.
Other organizations opt for open source because they're looking to develop a product based on their data catalog and need an application they can truly make their own, Neill said.
"If you're doing something very specific in the data catalog world -- perhaps selling it on the marketplace -- you may need an open source data catalog," Neill said.
He said, if you have an elastic search engine across your data catalog or if you need to share data with users who enrich it or share it further, an open source data catalog can benefit you.
Research entities or biotech firms tend to be good candidates for open source data catalogs given the intensity of their data use, he added.
Challenges of open source data catalogs
Organizations also contend with a number of challenges using open source data catalogs. Those challenges include the following:
- the absence of ready-to-use capabilities, features and functions;
- the need for higher levels of technical expertise on staff to build, customize, integrate, maintain and support the end product;
- the need to commit time to contribute back to the open source community;
- limited documentation and information; and
- highly limited to nonexistent support.
Consider all of these challenges before adopting an open source data catalog. What you save in upfront costs, you may later spend on training if your team members don't have the necessary expertise to manage it.
"Open source is cheap, but it requires a level of skill most organizations don't have," said Sanjeev Mohan, analyst at Gartner.
Evaluating open source options
Evaluation starts with knowing what you want to achieve and the capabilities required to support those objectives, Mohan said.
"You have to establish why you even need a data catalog before you talk about what data catalog to select," he explained.
If an organization decides to go with an open source data catalog after weighing the benefits and challenges, they need to select the best choice by considering several things.
Consider which option provides a level of completeness that aligns with the organization's business needs and objectives, as well as your staff's skills to customize, integrate, maintain and support the software. Neill said it's important to consider whether the product does what you need it to do without too much coding.
Determine whether the open source data catalog has the foundation to provide the capabilities and functionalities the organization needs. Be sure the data catalog has a community that can provide support if required and whether that level of support will be adequate given the enterprise's in-house skill levels.
"Keys to keep in mind would be examining their current architecture stack and available internal resources to ensure they have skill sets to configure and operate the tools," McGivern said. "Vendors in this space have invested heavily in trying to simplify functions for configuration and management, so you'll need good internal skills during that configuration and use."
According to McGivern, other considerations include an examination of tools already in your stack as many vendors provide additional data catalog capabilities. He also recommends enterprises consider data integration requirements for other platforms, such as data classifications and loss prevention modules, to help manage rule sets.
Open source vs. propriety data catalogs
Organizations can opt for propriety data catalogs, including options from Alation, Atlan, Collibra, Informatica and Talend.
As is the case with other types of commercial applications when compared to open source versions, propriety data catalog products come with available modules and tools that are ready-made for organizations to use -- an attractive option for many organizations that don't need the level of customization an open source data catalog offers.
"They have all of these components that work together for your data platform," Neill said.
Mohan said Gartner is tracking more than 60 data catalog products. Although most of them are commercial products, he said there are a dozen or so open source options with the number of them growing. Open source options include Amundsen, Apache Atlas, Ckan, Magda and Truedat.
As a result, open source data catalogs may become a more attractive option to more organizations in the near future, Mohan said. According to him, we'll see more maturity in open source data catalog options.