News Stay informed about the latest enterprise technology news and product updates.

Normalizing Reference Data

Many databases are created in third, or a higher, normal form only to be populated with reference data that is chaotic and self-evidently nonstandard. This can have a profound negative impact that offsets the main advantage of a normalized database design.


This article originally appeared on the BeyeNETWORK

It is fair to say that the rules for normalizing database design are reasonably well understood in the data management profession – at least to third normal form. Yet many databases are created in third, or a higher, normal form only to be populated with reference data that is chaotic and self-evidently nonstandard. This can have a profound negative impact that offsets the main advantage of a normalized database design, which is that the database design should only change if the business area it supports changes. A strong case can be made that sets of reference data values need to be “normalized” in a way appropriate to reference data if a database is to provide maximum return on investment.

“Normalization” in general means standardization, and usually involves conformance to one or more rules to achieve standardization.

The term reference data means different things to different people. The definition used here is: Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.

This definition means that reference data tables are what are commonly called “lookup tables” or “domain values”. These are typically tables with a key column containing a code, and another column containing a description. Sometimes, reference data tables can have additional columns, which may include foreign keys from other reference data tables. Examples of reference data tables include Country, Currency, Customer Type, andProduct Category.

Why is Reference Data Important?

Reference data has an importance that approaches metadata in terms of the significance of individual data values. Codes have definitions, not just corresponding descriptions. These definitions may not be formally stated, but they need to be understood to use the reference data tables. For instance, reference data is used to control business rules: if a business rule includes an actual data value, this will nearly always be reference data (i.e., a code). The semantics of the code value control the logic of the business rule. Codes are also important in reporting hierarchies, where users must understand what is meant by each of the categories appearing in a report.

Reference data, therefore, acts like metadata in certain respects, but it is not treated like metadata. IT professionals seem to view defining a database as an IT task, but populating it with data as the job of the business users. From this viewpoint, reference data values become the responsibility of the users. In reality, however, reference data tables have to be populated long before an application goes live – even prior to testing – so this task is often partially done by IT staff. Overall, little attention is paid to the population of reference data values. However, reference data tables typically make up anywhere from 20% - 50% of a database in terms of numbers of tables. Therefore, a database whose design is in third normal form can still have up to half of its tables populated in a very chaotic manner.

There are a number of simple rules for normalizing reference data values, which can briefly be described as follows.

Definition of All Required Reference Data Tables

It is not uncommon to find database tables such as Product or Customer that contain code columns which should be foreign keys, but for which no parent reference data table exists. The code values, therefore, exist only in the database table where they are found. If a new code has to be added, it can only be added as part of a new record. If a list of the existing set of codes is required, this can only be done by extracting them from the table where they are found. Since no parent table exists, no descriptions are available for the codes, and no metadata is available for them (e.g., who added or updated the codes and when this was done).

Therefore, it is important to define individual reference data tables to support all reference data used in a database. There should not be any “orphan” code columns in database tables that have no parent reference data tables.

Correct Level of Abstraction of Reference Data

Over-abstraction of reference data occurs where a reference data table is defined but contains data values that should really belong to different reference data tables, or which represent different subtypes. These reference data tables typically lack any formal definition and have names like Product Type, Customer Category, or even Record Type, and can represent just about anything. The result is that the tables contain sets of data that should really be broken out into many different reference data tables.

Over-abstraction can also be solved by having subtypes in the reference data table. These can either be traditional formal subtypes or hidden subtypes where records are identified as belonging to a particular subtype by foreign keys from other reference data tables. A reference data table that has hidden subtypes will typically be surrounded by a cluster of other reference data tables with relationships to it.

Over-abstraction of reference data can sometimes amount to no more than a database table having contents that do not match its definition, a situation known as “semantic disintegrity”. What is special about reference data tables is that they are so often defined in such vague (i.e., overabstracted) terms, that almost any data value in them can be justified. This problem can be avoided by a more strict analysis of reference data tables, and a readiness to introduce new reference data tables when necessary. Every effort must be made to make the contents of a reference data table match a reasonable level ofabstraction that is represented by the definition of the table.

Resolution of Overlapping Reference Data Values

Reference data differs from other types of data in that individual values typically require definitions. Unfortunately, these are rarely present, and one of the problems that can arise is that data values overlap. For instance, a Country Code table may contain one value for “People’s Republic of China” and another for “Hong Kong”. Politically, the People’s Republic of China is united with Hong Kong, so there should only be one record, unless the enterprise that uses this table wishes to have separate reporting for its activities in Hong Kong. If this is the case, then the definition associated with the record for “People’s Republic of China” should state that for the enterprise, this excludes all activities in Hong Kong.

Elimination Gaps in Reference Data Tables

Reference data values are often called “domains”, meaning all the possible data values that can occur for the data element that the reference data table in question is supposed to define. For reference data tables that follow industry, national, or international standards, the tables usually contain all the values that exist. However, for “domains” that are specific to a particular enterprise, there are often some values that are missing.

The traditional way by which this problem is mitigated is to have one or more values in a reference data table called something like “Other”, “Unspecified”, “Miscellaneous”, or “Unknown”. All the instances that do not fit with the well-defined values in the table are then put in one of these categories. In reality, such categories are just an escape from the governance of reference data and typically grow larger over time, eventually requiring a complete revision of the reference data table.

Elimination of Different Levels of Reference Data

When the contents of a reference data table are examined, it is sometimes quite obvious that some records represent a lower level of detail than others. For instance, a Country Code table that represents real countries will have a single record for “United Kingdom”. However, a Country Code table that is used to represent tax jurisdictions may have “United Kingdom”, “Isle of Man”, “Jersey”, “Sark”, and “Alderney”. The latter four of these values represent islands that are governed by the United Kingdom but which have their own special tax laws. If a single Country Code table that includes tax havens is also used for purposes that require only “real” countries, such as reporting on business activities, then all kinds of problems can arise.
Again, vigilance against this specific problem is the best way to mitigate it. This should happen at the time of initial data population, and reference data tables should be periodically reviewed for data at different levels of detail.

Avoiding Serious Problems

If reference data values are not normalized according to the rules described herein, then even if a database design is in third normal form, the database will have serious problems in terms of the applications it supports and the usability of its data. By contrast, if these rules are followed, then the database as a whole will provide a much higher return on investment, with a corresponding reduction in risk that could arise from data quality problems associated with reference data.

Dig Deeper on Data quality techniques and best practices

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.