This article originally appeared on the BeyeNETWORK.
Organizations spend large amounts of money, time and human resources implementing data movement processes. However, after it is all said and done, the customers of the new data structures frequently are disappointed – if not frustrated – with the quality of the information. To make matters worse, when the organization sets out to correct the defects, they find that they lack the information needed to locate the root causes. Defect correction in data movement processes is frequently a trial-and-error, time-consuming and error-prone process. Some data movement processes are so complex and difficult to maintain that many organizations resort to "after-the-fact" data correction techniques, adding to the overall complexity of the environment.
Data movement is highly undesirable: it adds cost, reduces semantic value, increases information float and introduces opportunity for quality defects, just to mention a few of its downfalls. Yet, moving data is one of the most common practices in information management today. Data is moved to enable Internet solutions, to populate data warehouses, to convert old systems to new, to interface operational applications, to deployERP solutions, to enable business partners to service your customers or your employees, and more. Data is moved using custom batch processes, ETL tools, messaging tools, etc.
In this article, I discuss an approach to implement the "Fourth Rule of Quality by Design: Capture Information about Defects and Events." This approach enables capturing details on processes, events and defects such that defect correction is not only possible, but highly effective. The effectiveness will result from the ability to know which defects are the most significant based on their frequency and where exactly in the process the defects are detected (or missed). For this technique to be effective, it must be supported by the other three rules of quality by design described briefly in this article.
First Rule of Quality by Design: Don't
The first rule of quality by design in data movement is DON'T DO IT unless you really have to. The question is: How do you know that you really need it? The answer is best given by Larry P. English1. He advises that the goal of information management is to avoid and eliminate unnecessary data movement by providing well defined enterprise information models and well implemented enterprise-strength databases that meet the information, accessibility and performance needs of all stakeholders. However, data movement is necessary to:
- Bring information from, or send information to, outside data producers or consumers.
- Move from the official record-of-origin to the official record-of-reference data store.
- Move from the official record-of-reference data store to an application software package database and back.
- Move from an enterprise record-of-reference data store to the analytic or strategic (operational data store or data warehouse) database.
- Move data as a one-time conversion from a legacy data store to the newly architected data store.
Second Rule of Quality by Design: Begin with the End in Mind
I frequently encounter teams working on data movement projects having serious difficulties trying to determine the best route to complete their design, construction and testing. When I ask about their approach, my first question is: May I see the design of the target database (or file)? Almost always, the response is "Well, the data model is under construction," or "We don't have it yet, but we are starting," or "The vendor will/has not provided it." The explanations can be diverse, but the one answer I am looking for is seldom given.
The best approach to design high quality data movement is always to start from the end product and work your way back to the sources. In some cases, it will be necessary to study the sources to learn about the information; this is part of the data design process. As the sources are studied, it is essential that the findings are well documented so that they can be reused when designing the data movement processes.
My experience indicates that you cannot develop a quality design for moving data to a complex structure without knowing, as John Zachman says, "…its specifications at the lowest level of detail." This means the target design must include, in addition to an entity-relationship diagram, all the specifications of all the tables, columns, foreign keys (relationships) and data integrity rules.
Without a complete specification of the target, you may be able to create a data movement process, it may even run fast, but it is very likely that is will fail to meet the customer needs and expectations.
Third Rule of Quality by Design: Design the Entire Data Movement
Many data movement efforts are deployed using a piecemeal approach; that is, the team sets out to design a data flow or other components of the overall data movement often due to the urgency to get to code something" or to "show progress." It is very seldom that the effort follows a holistic approach, and often the team runs into conflicts when all the pieces are brought together.
The process to design a quality data movement begins after the information needs have been translated to a target, fully attributed and completely specified data model – and right at the time the physical database design process begins.
The first step is to develop a general system design starting from the target database or tables back to the sources. The idea is to drive the design from the end state; however, this is an iterative process where you'll find yourself going back-and-forth. Keep in mind that the decisions and discussions must always be framed from the end-state point of view to avoid getting lost in the myriad of details. An example of a data movement general design is shown in Figure 1.
Figure 1: Data Movement General Design
As the general design takes shape, as a designer you will identify critical issues, such as missing sources, incompatible key structures, conflicting timeliness of data updates, etc. These and other issues will continue to be resolved as the detail design is developed. The resolution of these issues and challenges will help identify the placement of proper controls in the flow. The key principle in data movement is "accept no defect, create no defect, and pass no defect." This means that all incoming information will be validated for business rule compliance and that all data transformations are performed in accordance with the customers' expectations and specifications. An example of a complete data flow (from all sources to the destination tables) is shown in Figure 2.
Figure 2: Detail Data Flow Design
As the detail design emerges, the focus now turns to the specification of each of the processes in the detail design. It is here, at the process specification level that the designer can insert the proper controls to ensure process, event and defect information collection. An example of a context diagram (the first step in a process specification) is shown in Figure 3.
Figure 3: Process Specification Context Diagram
As the processes are detailed, changes are identified that affect the general design, the detail design of the data flow or other data flows, the context diagram of the process or the specifications of the process. The designer must be careful in always assessing the impact of each change to ensure the overall design maintains the proper level of quality. The final specifications must include the identification of each target data element, all the necessary source data elements, the transformation rules, and the validation and disposition rules. The designer must develop one complete specification for each process based on each output from the process. An example of a fragment specification is shown in Figure 4. The specification begins (1) with the output (the "Planned Transfer" table) of the process; followed by the (2) target columns; then (3) identifies the source files or tables; the (4) source columns or fields; then, (5) the data movement and transformation logic in declarative language; and finally, the (6) validation and disposition.
Figure 4: Data Movement Process Specifications
The number in parenthesis in the validation column (circled in Figure 4) is the event registration number (in this case it is a defect) associated with this condition. This is where the data movement design converges with the fourth rule of quality by design. You may ask how this leads to knowing the processes, event and especially the defects. Let's now look at the design approach for the "processes and events management" application and database. This approach can lead to a very successful and effective means of collecting and reporting processes, events and defects.
Fourth Rule of Quality by Design: Capture Information about Defects and Events
Process, event and defect are business entities just like customer and product, except that it is usually less obvious, complex and voluminous. However, the information consumers of the process, event and defect information are usually the people in the business areas that use this information to determine if the business information (e.g., customer or product) they are receiving is meeting their quality expectations.
For this reason, the database that will house this data must be designed in the same fashion as customer and product databases and the application that collects and reports the information needs to be developed with the proper controls to ensure quality of information. Following is a description of the approach used to deploy this mechanism.
The Information Model
The data model in Figure 5 represents the minimum necessary pieces of information identified to deploy an effective database for process and event information. This database design was originally conceived by one of my clients2 as a result of his frustration with the inability to identify and effectively correct defects. The original design was expanded and implemented as part of the deployment of a very large data mart. See the model in Figure 5; the areas in the model not discussed in this article are colored in gray.
Figure 5: Process and Event Management Database
The reader is reminded that this method can be applied to all kinds of data movement, not just data warehousing solutions. Most of these techniques were developed by the author when performing the conversion of a 1.5 million customer billing system from a file (batch environment) to a database (transactional and batch environment). This conversion was the first time in the history of the organization that the conversion of a system went live without one single conversion defect.
The core of event management is the Event Registry and the Event Log; however, the database also contains tables to register owners, operators, data containers and processes (programs, modules, applications, etc.). The "owner" enables system access and control. These tables are, in turn, supported by several reference tables. These reference tables enable the classification of events as follows:
- The types of events, such as invalid value, key not found, process begins, process ends and process aborts.
- The event notification identifies the person or distribution list that will be notified in case an event of this type takes place.
- The action to be taken when an application encounters the registered event. Examples include: information only (no action), corrected to default, rejected input, aborted process.
- The severity identifies the impact to the process by the registered event. Examples include: information only, warning (e.g., error detected and corrected to default or rejected), and critical (e.g., a processing errors occurred and the process was stopped or aborted).
Event Registry: As part of the development process, when an application or program identifies the need for an event, the event is registered in the Event Registry. For an event to be properly registered, the associated process, data container and data element must be already registered. Then the designer must record the message description. This description is what the operator of the process and the business analyst will see when the actual event is issued. This description should enable the reader to understand the condition and enable the reader to take corrective action.
Event Log: This is a very small footprint table that contains very little data, but when combined with the rest of the tables in the database, provides a wealth of information. The small footprint is to increase throughput and to reduce space requirements (this is the highest volume table for this database). In addition, in order for the system to support an application from its initial development through system test and production, the "Event Log" was created in three different tables and each table was placed in a different physical location. The event log is then presented to the application via views that use the production name to shield the applications from changes as they move from one environment to the next. In this way, the different environments will not negatively impact each other. For example, due to a programming error, an application in development may generate a very large number of event log entries, exhausting the available space for the log table and impacting all applications in its environment but not in the other environments. The most important environment to protect from this kind of condition is the production environment; this physical separation prevents these kinds of failures.
The Application Design
The overall design of the application showing all the components is presented in Figure 6. The DBMS catalog export enables the collection of data container and data element details for the database to reduce the need for manual entry. For processes using the process control (called "control customers"), there are several mechanisms for them to access and update controls (such as "last processed date"). For processes using the event management support (called "event customers"), there are several mechanisms to enable the issuance of an event (e.g., a SQL custom process can issue an "insert" to add a new row with the event).
Figure 6: Event & Process Management System Design
The Information Defects Alerts are voice or text messages sent to designated persons as soon as the registered event is inserted to the Event Log. They enable timely corrective action for information defects that, if undetected, can cause serious impact to the target database.
Figure 7: Web Application Front Page
The reports (XYZ tool, ad hoc and web-based) enable in-depth review of the process execution by presenting, in an actionable form, details of the information defects encountered.
The web application enables the manual collection and verification of information such as event registrations. Figure 7 shows a sample ¡§front page¡¨ of the web application.
Some tools, such as DataStage, are not able to output multiple rows for multiple events associated with one input row. DataStage offers some capabilities to overcome this constraint; the following is an example of how this was addressed3. In this environment, a DataStage job calls a reusable "exception handler." While validating the input row, the process records each event encountered into a set of internal variables. Once the input row ends processing, the job calls the handler to transform the variables into one XML record that is written to a DataStage internal file. See Figure 8 for an example.
Figure 8: Sample DataStage Job Calling the Exception Handler
Once all the processes complete for the data movement, a final DataStage job is executed to read the XML records from the work file, break out each record into multiple event records (performed by the "input exceptions" step) and record them into the event log. See Figure 9.
Figure 9: Event Log Writer
Analysis & Corrective Action
Once this is in place and the customer application is executed, analysis can be performed on the resulting event log. The following are but a few examples of the insight we were able to obtain using this method.
Figure 10 shows a small sample of the data produced at the end of the weekend process.
Figure 10: Sample End-of-Week Process Report
This report is only the beginning of the analysis of the process. Here we could see that a large number of rejects are taking place due to invalid dates followed by the hours of production line availability. This gave the team a starting point to begin the correction of defects. Because each defect identifies its process, and provides a description of the problem, the team was able to locate and correct these issues very fast and with a high degree of confidence.
Figure 11 shows the trend of events (defects) found over the initial deployment of a newer component for the data mart. The numbers on the horizontal axis of the chart indicate the deployment week (from the initial deployment), and the vertical axis shows the number of events generated over the weekend process.
Figure 11: Events Per Week (Initial Deployment)
At first the event log indicated that millions of events were generated, as expected. Some of the events were caused by erroneous assumptions on the quality of the information to be received. Some were caused by misinterpretation of the business rules by either the business analysts using the information, the systems analysis receiving the requirements, or the application programmer implementing the data integrity checks.
Having a record of each event, its frequency and its location within the code, enables the team to quickly identify and correct defects as they emerge. For instance, events that occurred in the millions during the first three weeks were removed or reduced to less than a million and a half by week four and to less than 350,000 by week 16 (see Figure 11). Once a defect is identified as an issue, test cases and scenarios can be identified from the event details (see the "Event Value ID"); see Figure 12 for an example.
Figure 12: Sample Event Detail Report
The value ID is the information necessary to locate the specific record or row that has failed the validation. Once a sample is selected, the business analysis can trace back to the sources to determine the proper action plan to either correct the information, correct the validation, etc.
Figure 13: Summary of Events Per Week – After High Volume Corrections
At the same time that the high volume (and high impact) defects were removed, the team also removed high impact with lower volume. For example, lacking the line availability impeded proper analysis in the correlation of planned versus actual production. Figure 14 shows the trend once the team corrected the sources of the line availability defects. The source system was not creating the proper values; therefore, the data mart was unable to use the information. It was reduced from 400-500 range per week to one and later on to zero. This is significant because this is a very small reference table; one defect impacts millions of facts and impairs analysis.
Figure 14: Line Availability Defects Per Week
Another example was the correction of the mismatching forecasts; the system was to merge forecasts from two separate planning departments. Both departments provided their specifications and were involved in the specification of the matching process.
Figure 15: Mismatched Forecasts Weekly Events
However, there were assumptions made by both teams that proved to be incorrect. As defects continued to show up, both areas came back together to revisit the matching process. The chart shows that even though efforts were made to remove the defects, they continued to show; and in one occasion, they actually increased after a change was made. Once the situation was identified and corrected, the volume of rejects was reduced to the expected level. See Figure 15 for the trend analysis.
Measuring a process is necessary to continually improve it. However, measuring does not stop with measuring defects. Other measures may include performance, breakdowns, etc. In this case, part of the original requirements was that this process (78 programs in 4 main streams) was to run within the weekend time frame (elapsed time not to exceed 24 hours). However, when the first weekend run ended (several days after Sunday!), the overall time was more than 60 hours. This was somewhat disheartening; one of our developers said, "This process is never going to run in 24 hours." The team focused on the performance issues at the same time as it addressed the quality issues. Using the parallel extender capability of the ETL tool, they were able to decrease the processing time to well below the required window (see Figure 16). These efforts continued past the 22nd week, and today this data movement process runs in well below 8 hours.
Figure 16: Running Time Per Week
Having this information readily available (first thing Monday morning) and having special alerts sent to key personnel enabled the business and IT areas to stay in front of the issues and increased the business areas' trust in the information. They knew the defects and their volume and were able to handle the situations before applying the data. Even though the launch of this data mart was not without glitches, the issues were known, controlled and removed effectively.
This responsive environment fostered a team spirit that has endured to this day. The data mart is stable and working very effectively. The assurance that when something goes wrong it will be noticed before Monday morning and that the group will come together to address the problem has created a strong sense of commitment on the part of the business to using the data mart, and this translates into a great working atmosphere for the systems people who support them.
Applying the four rules of "Quality By Design" to your data movement will lead to a more effective, less costly and more durable information environment with a higher level of information quality and customer satisfaction. Data movement may be undesirable due to all of its shortfalls, but by applying these rules, you will be able to overcome many of these shortcomings and increase your customers' satisfaction. Next time you embark on a data movement endeavor, take a moment to plan it using these rules to ensure a successful and satisfying experience.
I hope that you find this article useful. What do you think? Please let me know at firstname.lastname@example.org.
- Larry P. English, Improving Data Warehouse & Business Information Quality, p. 116.
- Many thanks to Mr. Bryan Smith, Data Warehouse Architect, for developing the initial draft of this excellent method and data design for event capture.
- Many thanks to Mr. Kevin Worrell, DataStage Specialist, for developing the DataStage part of the solution described in this article.