Normalization is Impractical for Master Data

Master data tables seem to be particularly sensitive to the limits of normalization, and Malcolm Chisholm believes that a reflexive attachment to normalization can be counterproductive.


This article originally appeared on the BeyeNETWORK

Normalization is a technique that achieves a lot of good in database design. However, there are limits to its effectiveness, and master data tables seem to be particularly sensitive to these limits. Master data tables like Product and Customertend to have hundreds or even thousands of columns. They incorporate many different types of Product or Customer, and each of these types often has a few attributes that are unique to it, but not shared with the other types. The result is that when we examine a table such as Product, we see that certain columns only have values for certain types of Product, and never have values for other kinds of Product.

In my personal experience, I have always had issues when I have pointed these facts out to more rigorous data modelers. They have invariably told me that I have modeled the situation incorrectly, and instead of a single Product table, I should have a set of subtypes. Some of them have told me that it might be acceptable to have a single Product table in the physical implementation, but the logical data model should show the subtypes.

These attitudes have baffled me. Creating the subtypes adds complexity that creates confusion when trying to understand the data model. Having the logical data model significantly different to the corresponding database adds to this confusion. I think that a reflexive attachment to normalization is counterproductive in this case. Let me try to prove this point with an example.

The Subtypes of Project

Suppose we have a company that is active in many areas. It delivers services and builds one-off specialty products. It treats all of its business activities as projects, and consequently has a master data entity called Project. This entity has two basic attributes:Project Number, and Project Title. Project Number is the primary key. There are additional attributes for Project, as follows:

  • Project Security Level. This is level of confidentiality with which the project must be treated. There are 4 levels: Top Secret, Classified, Confidential, and Public Domain. A Top Secret Project must have a Security Officer assigned to it. A Classified Project must be registered with the company’s general council within 30 days of its inception. A Confidential Project must have a serially numbered non-disclosure agreement attached to it. A Public Domain Project must be published on its own web page with a unique URL.

  • Project Status. This describes the life cycle of the project. The project begins at the Project Inception Date, and is considered to be in a design phase until it is approved for development. The Project Development Start Date is when the development phase begins, which continues until the Project Prototype Date, which is when the prototype phase begins. The next phase is the production phase which begins on the Production Start Date and goes on to the Production End Date. From this last date until the Project Obsolete Date, the project is in a post-production support phase. On and after the Project Obsolete Date, the project is in the obsolete phase.

  • Project Sector. This is the social or economic sector to which the project pertains. If the project is for the military sector, then the armed service that is the beneficiary must be identified. If it is for the communications sector, then the relevant type of communication must be identified. If the project is for the environment, then we need to record if it is for pollution abatement, habitat preservation, or biodiversity. These categories are termed “environment goals.” Finally, if the project is for the financial sector, we need to identify the financial regulator that will have jurisdiction.

All of this information is summarized in Table 1, which shows the attribute, its possible values, a code that represents each value, and ancillary attributes that exist only for the specific value.

 Attribute Value   Code  Ancillary Attribute
 Project Security Level  Top Secret  1  Security Officer
   Classified  2  General Council Registration Date
   Confidential  3  NDA Number
   Public Domain  4  Web Page URL
 Project Status  Design  A  Project Inception Date
   Development  B   Project Development Start Date
   Prototype  C  Project Prototype Date
   Production  D  Production Start Date
   Support  E  Production End Date
   Obsolete  F  Project Obsolete Date
 Project Sector  Military  ML  Armed Service
   Communications  CM  Communication Type
   Environment  EV  Environment Goal
   Financial  FN  Financial Regulator

Table 1: Summary of Three Attributes of Project Entity

How to Model It

If we follow the rules of normalization, we will begin by creating a single entity calledProject, with five attributes, as shown in Figure 1. 


Figure 1: Basic Project Entity

Now we are confronted with a problem. According to normalization, we need to create subtypes for Project Security Level, Project Status, and Project Sector. We need one subtype per type of Project Security Level – and Table 1 shows that there are 4 of these. But we also need one subtype per type of Project Status – and Table 1 shows there are 6 of these. SinceProject Security Level and Project Status are independent of each other, they have to combine in 4 x 6 = 24 possible ways. If we add Project Sector into the mix, we end up with 4 x 6 x 4 = 96 possible combinations.

So, apparently we need 96 different subtypes in addition to the parent Project table. This is a number that it is not easy for a programmer or analyst to keep track of. In reality, it is even more complex.

Figure 2 illustrates this further complexity. It shows 3 of the possible 96 subtypes. Each of these subtypes has the attribute Project Inception Date, because each represents theProject Status of “Design.” In fact, 16 of the possible 96 subtypes will have the attributeProject Status Date. Suppose a programmer is tasked to produce a report of Project byProject Inception Date. He or she will have to figure out which 16 of the 96 tables to access, and then pull out the dates. 16-table joins (or unions) are neither easy nor graceful.


Figure 2: Example of Three Subtypes


The “subtype” tables are not actually subtypes. The “Design” subtype will be implemented in 16 of the 96 tables as we have just seen. In other words, if I want to fully understand theProjects that are in the “Design” phase of Project Status, I have to consult 16 tables. None of the individual tables map to one subtype – they are all unique intersections between the 14 subtypes listed in Table 1. We might call these tables subtypes, but they are not.

Having 97 tables for the Project entity (the parent plus 96 subtypes) will overload any human trying to make sense of the data model. Persuading a DBA to implement such a structure, and a programmer to build logic around it, are almost certain to be impossible. We are forced to fall back on the single Project table with the subtypes folded into it. I would submit that it is easier to figure out the situation of each record in such a table than it is to figure out which subtype table applies to a given record in the normalized design.

We still have to manage the subtypes, be they inside the Project table or scattered across the 96 tables. The principle of locality of reference tells us that it is easier to manage related elements if they are physically close to each other. The single Project table is a better design from this perspective. Clearly, therefore, there are limits to what normalization can give us, and in master data entities, there are design alternatives which are simpler and easier to maintain.

Dig Deeper on Data quality techniques and best practices