This article originally appeared on the BeyeNETWORK.
In any business intelligence (BI) program, there is a virtual subtext regarding the “repurposing” of data sets to uses for which they were never intended. As old data sets are put to new uses, the conventional wisdom regarding information quality is beginning to give way to modern approaches to understanding essential characteristics of your information. Instead of correcting “bad” data, the focus of data quality initiatives has moved from standardizing and matching names and addresses to more general approaches to exposing potential anomalies and subsequently defining corresponding metrics to assess quality.
To address this, a number of companies have developed applications intended to provide a much broader approach to understanding and assessing how well data meets an organization’s needs. These kinds of tools, routinely referred to as “data profiling,” were originally designed as analysis applications to support the growing data warehousing market. Today, the embedded methods have evolved from simplistic sampling and frequency analysis to more complex cross-column dependency and embedded structure analysis techniques.
These days it has become de rigueur to incorporate data profiling into the BI environment, to the point where traditional data quality tools vendors, ETL tools vendors, and even DBMS vendors are all getting into the act, by either providing internally developed profiling capability, or by outright purchases of data profiling tools vendors. Yet, there is a lingering thought that calls into question the viability of data profiling as a stand-alone application, and much of this has to do with the functionality that data profiling tools actually provide.
The Party Line – Get to Know Your Data
Companies that provide data profiling functionality typically market their wares based on a simple concept – “get to know your data.” But what do these tools actually do?
Data profiling essentially consists of three aspects of analysis:
- Column analysis, which evaluates the frequency analysis distribution of values within a specific table or file column, providing analysis of ranges, sparseness, value cardinality, percent populated, data type, size, and value frequency. Column analysis is relatively straightforward: for each column in a table, count the number of occurrences of each value (including systems nulls) and sort the results, either by count or by value. The more complicated part involves analyzing how the frequency distribution indicates which values are expected, which ones are unexpected, and whether any of the values don’t belong.
- Dependency analysis, in which relationships between columns across rows are evaluated to identify candidate keys, to look for absolute dependencies along with other relevant dependencies or business rules whose violations may indicate a data quality problem. Dependency analysis is most useful for discovering instances of denormalized data and embedded structure, and consequently most valuable for ETL and re-engineering projects.
- Redundancy analysis, where values in columns in multiple tables are compared to look for overlapping columns, foreign keys, orphaned records, syntactic format differences, complex business rules, differently named columns holding the same values, or same-named columns with different values. Redundancy analysis is straightforward, in that the values in each of a pair of columns respectively selected from different tables are evaluated for set intersection. Conclusions drawn from the analyst’s review provide the insight: depending on the size of the sets and the degree to which the sets overlap, two columns might be tagged as foreign keys (if the value set is large and mostly disparate) or perhaps as both taking their value from the same reference domain (if the value set is small).
Profiling is often used to highlight potential issues that are embedded within the data set, with the intention of providing the promised insight. Examples include:
- Exposure of potential anomalies, such as a significant occurrence of null values in a required field;
- Revelation of potential outlier values, such as numeric values falling outside a reasonable range; and
- Data quality business rules that rely on conditional relationships between column values.
These are valuable issues to identify. But if you only use the tool for anomaly analysis, you may not be fully exploiting the potential value of the technique. And contrarily, it may be possible to reap some of the benefits of data profiling even without a tool.
Tool or Technique?
So is data profiling a tool or a technique? Truthfully, some of the aspects of column profiling can be mimicked using relatively simple SQL queries. Generating a value frequency distribution may provide a large amount of profiling information, especially when it comes to cardinality, sparseness, or outlier identification. And while cross-column analysis may be computationally complex, cross-table redundancy analysis can be easily programmed as well.
But while some of the functionality can be deployed internally, the vendor tools provide some specific benefits that make them worth the cost. First, the interface that provides the summarization of the profiling functions applied across all three aspects gives the end-client a visual representation of the potential data quality issues. Second, most of the tools provide some kind of drill-down mechanism allowing the analyst the opportunity to review the records that are potentially flawed. Third, many tools evaluate both data extracted from its original source and native database access.
Many tools provide alternate analysis dimensions by reviewing value patterns (e.g., characters vs. digits vs. special characters) and reporting their corresponding frequency distributions. In addition, dependency analysis, which exposes functional dependencies within a single table, is a bit more complicated to implement and is a capability that contributes significantly to different kinds of applications. Because of its complexity, this capability distinguishes more advanced products from the more parochial ones.
Valuable Data Profiling Applications
Looking at profiling as a set of knowledge discovery techniques allows the analyst to use the technology for numerous value-added applications, fully exploiting the opportunity to “know your data.” Consider these different applications, each of which uses the profiling techniques to target a specific aspect of an information quality program:
- Anomaly Analysis – Empirically investigating a data set to look for unexpected information behavior as an initial baseline review. The resulting profile report can be used for future comparisons to track improvement and flag regressions.
- Data Reverse Engineering – Reviewing a data set for which there is little or no existing metadata, or for which existing metadata is suspect, for the purpose of discovering and documenting the actual current state of its metadata.
- Information Quality Rule Discovery – Examining a data set to identify and extract embedded business rules that may be intentional but undocumented, or unintentional.
- Auditing and Monitoring – Periodic review of data providing metrics to ensure that defined expectations are being met, as well as pointing out opportunities for deeper exploration.
- Metadata Compliance – Ensuring that the data set actually conforms to its associated documented metadata.
- Data Model Integrity – Ensuring that the cardinality of entity relationships is respected and that referential integrity is enforced properly in a database.
Data profiling is a valuable set of techniques to expose possible data quality business rule violations, to measure business rule compliance, and to track data quality issues over time. When used as a repeatable process, profiling is a useful analytical process with a relatively quick learning curve that helps both technical and business clients understand the most extraordinary data quality issues. And by integrating the results of periodic profiling sessions, the insight profiling provides helps in initial data quality assessments and to measure ongoing monitoring of data compliance with discovered business rules.