This article originally appeared on the BeyeNETWORK
The record as a thing of interest in data management has been somewhat overshadowed by other data objects since the introduction of relational databases. Yet there are a number of reasons why we should be interested in records and try to manage them. This is becoming more important as master data management (MDM) highlights special administrative needs of certain classes of data. One approach to meeting these needs is to utilize metadata, which forces enterprise information management (EIM) professionals to think about the structure of data as it is held in databases. Such considerations inevitably lead to a realization that it is necessary to deal with records.
The concept of the record predates the introduction of relational databases. During this early period data was organized into files, and a record was an addressable, composite data item in a file. It was composed of fields. There was no real concept of processing sets of records. Records had to be processed individually.
The Decline of the Record
Relational databases changed all that. Firstly, the term tuple superseded the term record, but that name did not stick. Instead, the term row became widely adopted. A relational table is visualized as a set of contiguous columns with rows containing the data values. The implementation of relational databases was quickly followed by the introduction of SQL which permitted set-at-a-time processing of rows. This was an enormous advance. Rows (records) could now be read and updated in groups. For those of us who can remember, there seemed to be a fashion against dealing with, or even mentioning, records in the early days of SQL. It was as if records belonged to an earlier phase of technologies and methodologies.
The Return of the Record
Unfortunately, it turned out that anyone who tried to use set-at-a-time SQL for absolutely all physical data manipulation ran into difficulties.
For instance, a problem arose if all the records in a table had to be inspected using a sequential chain of business rules with branching logic depending on the result for each individual rule. The following pseudo code provides an example:
You could figure out ways to do this with set-at-a-time SQL alone, but it became obvious that record-at-a-time processing was easier and more maintainable. Some would blame this deficiency on the procedural nature of programming languages, but the problem still existed. Thus, we saw the introduction of record sets and their equivalents in programming languages and even “cursors” (current set of rows) into SQL and the RDBMS stored procedure languages. Records were back.
The Rise of the Record
Nevertheless, in the family of data objects, records (rows) were still stepchildren compared to the importance attached to tables, columns, primary keys, and foreign key relationships. The record has no logical equivalent that can, for instance, be placed and manipulated in a logical data model. Furthermore, managing records means managing physical data, and as has been noted previously in this channel, IT, including data management, has never had much of an appetite for physical data management.
Unfortunately for legacy data management, data is physical, and eventually the problems of physical data have to be confronted. This has certainly happened in the integration required for data warehouses and marts. Maintaining historical data is part of the reason for having a data warehouse. If history is needed, then change in physical data values has to be detected. There is no logical-level component that can manage history for you, unlike a database design which can be built at the logical level and then materialized as physical data objects.
Historical data management in the context of a data warehouse typically requires “changed data capture”. That is, one must compare a column value in a record at one point in time with a column value in the same record at a different point in time to figure out what has changed. To do this, it is necessary to have a copy of the record in question at a prior point in time for the comparison. The record copy from the prior point in time can be compared to the record at the current point in time and the differences in column values found.
To do this requires being able to match the two records. There must be a unique identifier for the each record that has the same value in the prior and current version of the record. A “natural” key is suspect in this regard, since it identifies the thing that the record represents (e.g., the CUSIP of an equity in a security master). This is problematic because, for instance, somebody may have entered the natural key data incorrectly and then corrected it. There are other reasons a natural key can change too; but this is beside the point. What we are doing in changed data capture is comparing records as records, not different states of the entity that the record represents. In other words, it is the record as such, and really only the record, that interests us in changed data capture.
Records and Master Data
Changed data capture is just one example of dealing with records as records. There is also an increasing need for managing and tracking records for governance reasons. In certain sensitive types of data, it is necessary to know who did what, when, why, and with what authority to a record.
Probably everyone will be familiar with standard metadata columns that pertain to record management, such as the date and time when the record was created and last updated and the identity of the user who created and last updated the record. These columns have nothing to do with the entity the record represents. They describe the record itself as a record. In master data management (MDM), these record-related metadata columns are starting to grow in number. For instance, it may be necessary to know the production state of a record, which might be “proposed”, “verified” or “production”. It might be necessary to know when the record went into effect in terms of transaction processing, which is a different concept to when it was in effect in business terms (something called a “bitemporal design”). There seems to be no objective limit to the number of columns that can be added to a master data record to govern it as a record.
At this point, the need to identify a record as a record comes into focus. If we are required to manage records, then every record must have a unique identifier within an enterprise. Everyone in data management is aware of the arguments that exist over how to assign primary keys to entities, and thus tables. However, we are not discussing entities here, but records. A record is a thing of interest to enterprise information management (EIM). It has existence, a life cycle, and rules that govern its management. Hence, it must be identified uniquely as a record. The conclusion is, therefore, that every record in an enterprise must have a globally unique identifier that also remains globally unique across time.
In this respect, records are a kind of master data, perhaps better described as meta-master data. They exist as concepts at a different level of abstraction to the business entities they represent, but still have real management needs as records. Managing records represents a major challenge for EIM, but it is one that must be faced.