Faced with an existing database implementation, is there a way to determine if it is in fact-dimension style or a normalized style? Is it shades of grey, or is there a unique features of one or the other? What is the difference between a dimension table and a fact table?
There is a great deal of overlap between these two questions, so we'll answer them together.
Let's start with normalized data. This is how we usually structure data for transactional systems such as an order entry system, an HR system and so on - in other words, traditional, operational databases.
In essence normalized data is held in a very simple structure. The data is stored in tables. Each table has a primary key and should contain data relating to one entity – so a normalized customer table contains only data about customers. We need to make logical connections between the entities (for example, this customer placed these orders). To do this we use a foreign key in the orders table to point to the primary key in the customer table.
To get a little more technical, there are three levels of normalization – first, second and third normal form and we can test a table to see if it meets these criteria. (To get even more precise, there are more that three levels but mostly we are only interested in the first three).
One advantage of normalized data is that there is very little data redundancy – each piece of data is stored once and once only.
Now in a dimensional design (fact-dimensional style) all of these rules go out of the window. The fact table holds the measures (usually numerical) that we want to analyze and the dimension tables contain data/information about the ways in which we want to analyze the data.
The dimension tables in particular are often highly de-normalized. For example, a customer table might store the zip code of the customer, their town and state. If you have fifty customers in Walla Walla, then the customer dimension table will store the fact that Walla Walla in is Washington a total of fifty times. So there is often massive data duplication. Take this Time dimension table for example (sorry, it's in UK date format, but the sense remains the same):
It stores the information that calendar month eight (August) is in quarter three a total of 31 times for every year that is held in the table. But that is fine because these structures are not built to manage transactions they are built to help us to analyze the data as easily as possible.
Dimension tables have a column containing unique values (in this case CalDate). The data in this column is used to link the dimension table to the fact table (see below).
Each of the remaining columns is there because the users have said that they want to analyze by that attribute. So, in this case, the users wanted to analyze by Year, Quarter, Month and Day. Dimension tables effectively spell out the analysis that the users want to perform.
Fact tables, on the other hand, contain the numerical measures that the users want to analyze. So (a good question at this point is "Why aren't they called measure tables?") And the answer to that is that fact tables contain more than just measures, they contain measures placed in some kind of context.
Think about it this way - a measure is simply a numerical value. A fact is that same value with context applied to it. We could also say that a fact is a measure anchored in multidimensional space, which sounds wonderfully exotic but is in essence perfectly true.
So, if I tell you we sold three items; that is simply a measure. If I tell you we sold three copies of a CD called "Escape to Wapping" on the 23/08/2008 to Lilly Murray, you now have some context around that measure; so what you now have is a fact.
A fact table might look something like this:
|Item No||Customer ID||Employee Code||Order Date||Dispatch Date||Quantity|
We can hook both OrderDate and DispatchDate back to the time dimension so that we can analyze these facts over time. So, for example, if we wanted to know how many items had been sold during August 2008 we could scan down the time dimension table looking for the value eight in Cal month and the value 2008 in Cal year. That would return 31 rows with 31 unique date values. We could then query the OrderDate column in the fact table, identify all the appropriate rows, and sum the values in the Quantity column.
This may sound a little long-winded. You might argue that we could simply look for the dates in the fact table and find the appropriate ones. There are two reasons why we tend not to do this.
The first is the fact tables can hold quite literally billions of rows; which makes them slow to query. The second is that we would potentially have to perform a very large number of date function calls in order to do this. Of course, the intelligent use of indexing would mean we would not have to perform a full table scan on the fact table and billions of function calls; nevertheless it would be a relatively slow operation.
You'll notice that another way of thinking about the date dimension table is that it is a materialized store of the results of data function calls. This table is also a tiny compared with the fact table -- 10 years worth of data results in a table of only approximately 3650 rows.
And, going back to illustrating how the fact table works, if the Customer dimension table looks like this:
|Customer ID||Title||First Name||Last Name||Gender|
You can see that it was, indeed, Lily Murray who bought item 31. And if the Product dimension table is like this:
|Item Num||Classification||CD ROM Title|
|23||Cars||Bentley, the man and his cars|
|25||Other||Welcome to the party|
|26||Travel||Seven go placid in Bangor|
|28||Work||Are you following me?|
|29||Other||A chip on the shoulder|
|31||Travel||Escape to Wapping|
|33||Work||Shipping for profit|
|34||Other||An orchard in May|
|36||Computing||Database vol. 3|
|37||Hobby||Dentistry for laypersons|
You can see that she did indeed buy the CD called "Escape to Wapping".
As to why anyone would want to buy such a CD, sadly history does not relate.
Dig Deeper on Database management system (DBMS) architecture, design and strategy
Related Q&A from Mark Whitehorn
Here's a guide to primary, super, foreign and candidate keys, what they're used for in relational database management systems and the differences ... Continue Reading
The unstructured data types common in big data systems are often better managed by a NoSQL database than relational software, Mark Whitehorn says. Continue Reading
Analytics expert Mark Whitehorn explains the strengths of R and how to determine if the open source programming language fits your analytics purposes. Continue Reading