data profiling

Data profiling is the process of examining, analyzing and reviewing data to collect statistics surrounding the quality and hygiene of the dataset. Data quality refers to the accuracy, consistency, validity and completeness of data. Data profiling may also be known as data archeology, data assessment, data discovery or data quality analysis.

The first step of data profiling is gathering one or multiple data sources and its metadata for analysis. The data is then cleaned to unify structure, eliminate duplications, identify interrelationships and find any anomalies. Once the data is clean, different data profiling tools will return various statistics to describe the dataset. This could include the mean, minimum/maximum value, frequency, recurring patterns, dependencies or data quality risks.

For example, by examining the frequency distribution of different values for each column in a table, an analyst could gain insight into the type and use of each column. Cross-column analysis can be used to expose embedded value dependencies and inter-table analysis allows the analyst to discover overlapping value sets that represent foreign key relationships between entities.

Organizations can use data profiling at the beginning of a project to determine if enough data has been gathered, if any data can be reused or if the project is worth pursuing. The process of data profiling itself can be based on specific business rules that will uncover how the dataset aligns with business standards and goals.

Profiling tools evaluate the actual content, structure and quality of the data by exploring relationships that exist between value collections both within and across data sets. Vendors that offer software and tools that can automate the data profiling process include Informatica, Oracle  and  SAS.

Types of data profiling

While all applications of data profiling involve organizing and collecting information about a database, there are also three specific types of data profiling.

  1. Structure discovery- This focuses on the formatting of the data, making sure everything is uniform and consistent. It also uses basic statistical analysis to return information about the validity of the data.
  2. Content discovery- This process assesses the quality of individual pieces of data. For example, ambiguous, incomplete and null values are identified.
  3. Relationship discovery- This detects connections, similarities, differences and associations between data sources.

Benefits of data profiling

Data profiling returns a high-level overview of data that can result in the following benefits:

  • Leads to higher quality, more credible data.
  • Helps with more accurate predictive analytics and decision making.
  • Makes better sense of the relationships between different datasets and sources.
  • Keeps company information centralized and organized.
  • Eliminates errors associated with high costs, such as missing values or outliers.
  • Highlights areas within a system that experience the most data quality issues, such as data corruption or user input errors.
  • Produces insights surrounding risks, opportunities and trends.

Examples of data profiling applications

Data profiling can be implemented in a variety of use cases where data quality is important. For example, projects that involve data warehousing or business intelligence may require gathering data from multiple disparate systems or databases for one report or analysis. Applying the data profiling process to these projects can help identify potential issues and corrections that need to be made in ETL processing before moving forward.

Additionally, data profiling is crucial in data conversion or data migration initiatives that involve moving data from one system to another. Data profiling can help identify data quality issues that may get lost in translation or adaptions that must be made to the new system prior to migration.

This was last updated in April 2019

Continue Reading About data profiling

Dig Deeper on Data profiling tools and techniques