This article originally appeared on the BeyeNETWORK
There seem to be are a number of questions that constantly recur in the world of data modeling, and to which no generally accepted satisfactory answer has ever been provided. These questions tend to excite the emotions of otherwise calm and analytical data modelers, sometimes with negative results. Why this should be so is something that I will comment on in a later article – but first let us review my personal list of ten data modeling debates that seem to go on and on and on. Here they are in reverse sequence.
Debate 10: Crossing and Buried Lines
Data models have their own aesthetics, and crossing lines, or having them pass under entities, is generally frowned upon. Unfortunately, complex data models are going to have such inconveniences. Some people try to reduce them with startling measures such as removing code tables. Others tend to live with the problem and get disparaged for so doing. This debate often features arguments about understandability versus detail, but tends to focus on how modelers themselves apprehend models, rather than what models are used for.
Debate 9: Derived Attributes
This debate seems to have lessened in recent years, but it still persists. It revolves around the justification for any derived or computed attribute in a data model. Some modelers do not like them at all, which would compel us to calculate balances on the fly from all applicable transactions. On the other side, we sometimes find databases littered with difficult to justify “temporary tables” that hold “intermediate calculations.” Evidently there is some kind of balance here that is difficult to achieve.
Debate 8: Design or Analysis
This is a relatively recent debate, at least for me. If one hundred data modelers are presented with an identical problem, should they all come up with the same model? Is there just one answer? Or, are there design aspects to data modeling so that a modeler can justify his or her way of solving a problem in terms of requirements. I have seen tempers flare over this, with both sides stubbornly entrenched in their views.
Debate 7: Normalization versus Denormalization
As old as modeling itself, this debate tends to pit data modelers against non-data modelers like programmers. One of the things it took me a long time to understand was programmers complaining about lack of performance due to database designs. Changing designs to improve performance is something that should be done after tinkering with hardware, operating software, indexing, and query execution. Nevertheless, there can be grounds for denormalization, as we see with dimensional modeling, and this debate will continue.
Debate 6: Level of Abstraction
If you ask business users about a party, they will probably want to know if alcohol is going to be served. For data modelers, it means something different. Parties and the like are the ultimate entities that stop an infinite regression of supertypes on the entities we define in our data models. Data modelers sometimes joke that only “thing” is above entities like “party.” More generalized data models are advertised to have the property of not changing as the business changes. This alchemy is performed at the expense of ejecting the complexity surrounding change to the layers of the physical database and program code. The data modeler is buffered from change – the enterprise is not. Yet there are still valid arguments for entities like party (e.g., to get 360 degree views for cash flows, etc.). This debate is not going away.
Debate 5: Why Doesn’t the Business Care More?
Why not indeed? Data modeling can be a very valuable activity. However, everything of value needs to be marketed and sold, and value has to be delivered constantly in a demonstrable and perceptible way. Because data models are typically created in specialized tools and require special skills to appreciate, they are always going to be a tough sell to the rest of the enterprise. They are not impossible, but, by contrast, use cases, swimlanes and RACI diagrams are a lot more intuitive to business users. Also, be careful what you wish for – if the business did care more about data models, there might be considerable alarm about how they are being managed.
Debate 4: Abbreviations
Technically, abbreviations are probably part of naming conventions (Debate 3), but I would ask for indulgence in raising them to the level of their own debate. That grown persons can spend valuable corporate resources in an attempt to get “consistent” abbreviations seems odd. It is not an activity I would care to justify, but it is one in which many divergent opinions come to the fore. The most usual reason (or excuse) given for doing it is that database platforms only permit a certain number of characters for physical names of data objects. This is somehow combined with a vision of a “consistent” way of forming abbreviations that usually has tenuous links to requirements and no identified stakeholders. Yet it can generate a lot of passion.
Debate 3: Naming Standards
Many data modelers are very concerned about how naming standards should be formed. On a number of occasions, I have asked what the requirements for naming conventions are. For physical naming standards, I have been told that they would help programmers and DBAs, although I have yet to meet a DBA or programmer who has been consulted on this matter. Logical naming standards are another matter. Since business terms are a vocabulary that is already fixed, data modelers are not really free to modify them, so these standards tend to be of data modelers by data modelers for data modelers.
Oddly, advocates of naming conventions are usually the first to be outraged by any use of intelligent keys, but that is precisely what naming conventions lead to. Those who doubt that should ask themselves why we do not identify entities, attributes, tables, and columns by surrogate key values like random numbers.
Debate 2: Kimball versus Inmon
On balance, the great religious conflicts of history may have created more strife than the two schools founded by these illustrious gentlemen. But it is a very close run. There have been several occasions when I have been vigorously upbraided by glassy-eyed acolytes of one side or the other for some minor remark that could have been construed to be supportive of the opposition. Such experiences have taught me caution, and I really wish there was some easy way of figuring out which side people are on before questions of faith are put to me. What is even odder, however, is that over time, Messrs. Inmon and Kimball have changed their views of things. Maybe there will be further fracturing of the followers into sects of early, middle and late Kimballites and Inmonians.
Debate 1: Natural versus Surrogate Keys
Which is “better,” a natural or a surrogate key? Natural keys are actual characteristics of entity instances, whereas surrogate keys are made up identifiers that really identify records. I think that over the years, surrogate keys have made some headway – particularly in the realm of data marts. Nevertheless, you can often still get a good food fight going with data modelers over this question. The “meta-question” here is why does there have to be a decision about whether the user either of them? The usual response is that there can only be one primary key. So, of course, the primary key has to be the “right” column(s). It is true that in a logical data model, there is one primary key, but there can be alternates. In a database, you can have uniqueness enforced on as many columns as you want (even if programmatically), and you can choose to index any of these columns. Let a million keys blossom!