Yes, real-time data integration, or even near-real-time integration, is really a powerful approach to expanding the capabilities of a data warehouse environment. There are numerous business analysis capabilities that are only feasible with low-latency data delivery. We see a lot of data warehouse systems move into the center of a company’s business operations once they’ve implemented a near-real-time extract, transform and load (ETL) process.
I don’t believe that data cleansing and other data quality functions are any more of a problem with a near-real-time environment than they are with a traditional higher-latency change data capture environment. The challenge in cleansing and correcting data really relates to the details that are available at the time of data creation or capture.
If your data quality problems involve measurable or quantitative errors in well-defined content (such as addresses, product descriptions and location IDs), or a need to standardize records, there are numerous technologies that can be applied to address those issues while supporting a near-real-time process. Obviously, for tools supporting automated data improvement or correction, you need to define rules and logic to correct the information.
There are inevitable challenges in addressing qualitative or accuracy problems when it comes to data content. In most companies, this can’t be automated through a set of rules; it requires a manual review of the data. And the moment that an IT worker must enter into the workflow of ETL processing, the opportunity for near-real-time delivery goes out the window.
Companies that want to support business processes requiring near-real-time data typically address data quality in a series of incremental steps:
- Limit near real-time data for business processes that don’t require perfect data. Most real-time data needs aren’t specific to an individual transaction but require access to information to support aggregate details (for example, average call duration or quantity of calls in a corporate call center).
- Review the ability of your operational systems to correct data content. While it may not be possible to address data quality issues data during the data capture or creation process, it’s fairly common to correct errors prior to the extract activity.
- Deliver data to the analytics platform in near-real-time and flag it as “not inspected”. A post-load data qualification or acceptance process can occur when practical, and the flag can be modified to “inspected” at the completion of that process.
This was first published in September 2010