This article originally appeared on the BeyeNETWORK
Quality assurance (QA) in a data warehouse/business intelligence environment is a challenging undertaking. For one thing, very little is written about business intelligence QA. Practitioners within the business intelligence (BI) community appear to be more interested in discussing data quality issues and data cleansing solutions. However, data quality does not make for BI quality assurance, and practitioners within the software QA discipline focus almost exclusively on application development efforts. They do not seem to appreciate the unique aspects of quality assurance in a data warehouse/business intelligence environment.
What is Software Quality Assurance?
Quality assurance means different things to different individuals. To some, QA means testing, but quality cannot be tested at the end of a project. It must be built in as the solution is conceived, evolves and is developed. To some, QA resources are the “process police” – nitpickers insisting on 100% compliance with a defined development process methodology. Rather, it is important to implement processes and controls that will really benefit the project.
Quality assurance consists of a planned and systematic pattern of the activities necessary to provide confidence that a solution conforms to established requirements. Testing is just one of those activities.
In the typical software QA methodology, the key tasks are:
- Articulate the development methodology for all to know
- Rigorously define and inspect the requirements
- Ensure that the requirements are testable
- Prioritize based on risk
- Create test plans
- Set up the test environment and data
- Execute test cases
- Document and manage defects and test results
- Gather metrics for management decisions
- Assess readiness to implement
Quality Assurance for Business Intelligence is Unique
It is in the second point above – rigorously define and inspect the requirements – where we run into the first problem with QA in a BI setting. QA practitioners frequently approach the requirements definition task with a behavioral point of view. “If I do A, B, C, D and E, then money will come out of the ATM machine.” In the typical application development effort, the QA analysts define hundreds of operational, “if X…then Y” requirements.
But business intelligence is not like this. It is an enabling technology, allowing non-technical business end users to perform exploratory data analysis. The requirement in a DW/BI setting is, “The data loaded into the target must balance to the data from the data source.” If we focus on the operational action and response, what we wind up doing is spending most of our time testing the BI tool, rather than validating the data population process. This is a waste of time and effort.
The BI tool must be viewed as a pre-validated component of the data warehouse architecture. If you were implementing a budgeting application, you would not test the features of Excel. Rather, you would test and validate the population of the spreadsheet. The same is true of a data warehouse. We need to test the population process, not the user interface provided by the tool vendor.
Another unique aspect of a data warehouse effort is the presence of slowly changing dimensions. It is not sufficient to simply test the data load process. Dimension attributes can change over time. Test plans must be created and test cases executed to ensure that the data warehouse remains viable as the dimension attributes change. Each of these unique aspects will be addressed in the following sections.
Validate the Target to the Data Source
Data population of the target data warehouse structure is typically conducted by highly trained data integration development specialists using a specialized extraction, transformation and load (ETL) software tool. Typically, the QA analysts do not need to use or be familiar with the ETL tool. Rather, the QA resources need to be fully competent at accessing and analyzing a variety of types of source data. This includes reading mainframe files, flat files, and writing and running SQL queries. The requirement here is to replicate the transformation logic that was used in the ETL process so that the target data warehouse can be validated against the original data source. In more sophisticated warehouses, SQL alone does not have enough functionality “to replicate the transformations” in the ETL process. In those cases, other techniques must be used.
Since the data warehouse contains historical data beyond the scope of the operational databases that feed the data warehouse, the validation process needs to be time-bounded based upon the data source. Typically, the validation routines are limited to the verification of the prior day’s results. This is done by creation of SQL queries that aggregate key measures in the data warehouse and compare them to aggregates taken from the data sources. When an out-of-balance situation is detected, warnings need to be generated and sent to the data warehouse data steward.
By embedding these SQL scripts in the nightly data warehouse population and update process, we implement a process that is self-monitoring. The initial data warehouse population is validated, and the update process cycle itself is validated on a nightly basis. In our experience, features like this have been instrumental in creation of data warehouses that have withstood the scrutiny of external audits, including SEC audits.
Slowly Changing Dimensions
The attributes of dimension tables will tend to change occasionally over time. For example, Wisconsin Electric Power Company changed its name to WE Energies. This is an example of something that normally is treated as a Type 1 change. In this case, we normally update the dimension table entry for the entity, but other changes need to be tracked historically. In this case, we use Type 2 slowly changing dimension (SCD) logic. In these cases, we insert new dimension table rows.
The QA tasks related to SCD logic fall into two classes. First, during the design and data modeling process, the QA analyst should review and evaluate the appropriateness of the SCD definitions that are included within the data warehouse design. The QA analyst needs to be fully immersed in the process followed in acquiring the business objectives and requirements, in order to provide an extra set of eyes validating SCD design decisions.
During the data warehouse development process, the QA resources need to create SCD test plans and then execute those test plans. The execution can be a difficult undertaking. It may require the QA resource to manipulate the contents of both the target data warehouse and the development instance of the data source. During integration and user acceptance testing, test plans should be created spanning multiple data warehouse update cycles in which the business users apply changes that can be tested end-to-end.
Consequently, this environment is demanding in the SQL skills and other technology skills required for an effective QA resource. Specialized training and methods are required to ensure that the QA function addresses the unique aspects of the DW/BI development effort.
QA Challenges from Vast Amounts of Source Data
An additional challenge is provided by the vast amounts of source data that are stored in many current data warehouses. Business users will not “believe” what they see in the data warehouse until they feel comfortable that the data is complete and accurate. It is extremely important to test each ETL process thoroughly. There are several strategies that can be used to deal with this issue that will be addressed in a future article.
Characteristics of an Effective Software QA Culture for Business Intelligence and Data Warehousing
An effective software QA culture should be ingrained within each DW/BI project. It should have the following characteristics:
- QA goals and objectives should be defined from the outset of the project.
- The role of QA should be clearly defined within the project organization.
- The QA role needs to be staffed with talented resources, well trained in the techniques needed to evaluate the data in the types of sources that will be used.
- QA processes should be embedded to provide a self-monitoring update cycle.
- QA activities are needed in the requirements, design, mapping and development project phases.