This article originally appeared on the BeyeNETWORK.
I have discussed in previous articles how organizations are becoming increasingly interested in using business intelligence, not only for strategic planning and tactical analysis, but also for driving intra-day business decision making. I use the term operational right-time business intelligence to describe the business intelligence used for this intra-day decision making. The word operational signifies that the business intelligence is being used to optimize everyday operational business processes, and the word right-time denotes that this optimization occurs at a frequency that matches business needs. Right-time may vary, for example, from right now (i.e., real-time) to several minutes or hours.
There are three types of right-time business intelligence processing: right-time data integration, right-time reporting and performance management, and right-time decisions and actions. The first type is concerned with gathering data in a timely fashion for analysis, whereas the latter two types are about analyzing the data, making business decisions, and taking actions in a timely manner. In this article, I talk about the first type, i.e., right-time data integration and the technologies that can be used to support it.
Right-Time Data Consolidation
Traditional data warehousing involves running regular, usually batch, ETL processes that extract data from operational data sources and transform and load the extracted data into a data warehouse. ETL processes can be thought of as doing data consolidation. Data replication is another technology that can be used for data consolidation. A third approach is enterprise content management (ECM), which consolidates unstructured and semi-structured data content (documents, rich-media, Web data) into a content repository. All three types of data consolidation move or copy data from a data source to a data target. The objective of data consolidation is to provide a shareable, clean, consistent, integrated and managed view of data for business users.
One approach to doing right-time data integration is to consolidate data on a timelier basis. Many companies, for example, have introduced the operational data store (ODS) and operational data marts into their data warehouse architecture for handling integrated right-time operational data. These operational data stores are usually maintained by event-driven and right-time ETL (RT-ETL) tools.
The use of operational data stores in data warehousing projects has accelerated over the last few years. This has increased the complexity of data integration projects, and often results in the copying of large amounts of operational data, which in turn leads to very large data stores. It also requires data consolidation products that are capable of supporting sustained and high-volume right-time processing. This type of right-time environment can be expensive to deploy and maintain.
Right-Time Data Federation
Another approach to right-time data integration is to use a federated data approach, which is often easier and more cost effective than data consolidation for certain types of applications. Data federation provides the ability to present a single logical view of dispersed data to an application, without the need to physically copy or move the data into a consolidated data store.
In its basic form, access to federated data involves breaking down a federated query into subcomponents, and sending each subcomponent for processing to the location where the required data resides. The federated query server then combines the results and sends a reply to the application that issued the query. Data federation is provided by enterprise information integration (EII) software. EII products vary considerably in the features they offer. Performance and query optimization is one key area, for example, where products differ. Other areas include support for unstructured data, and for business data in application packages from vendors such as SAP and Siebel.
The objective of EII is to enable business users to see all of the information they need as though it resided in a single database. EII shields business users and applications from the complexities associated with retrieving data from multiple locations and where the data may differ in semantics and formats, and may employ different APIs.
It is important to emphasize that data federation cannot replace the traditional data consolidation approach used for data warehousing. A fully federated, or virtual data warehouse, is not recommended for reasons of performance and data consistency. Data federation should be used instead to extend and enhance a data warehousing environment to address specific business needs.
Data federation is a powerful approach for solving certain types of data access problems, but it is essential to understand the trade-off of using federated data. One issue is that federated queries may need access to an operational business transaction system. Complex EII query processing against such a system can impact the performance of the operational applications running on that system. With the federated data approach, this impact can be reduced by sending only simple and specific queries to the operational system.
Another potential problem with data federation is how to logically and correctly relate data warehousing information to the data in operational and remote systems. This is a similar problem that must be addressed when designing the ETL processes for building a data warehouse. The same detailed analysis and understanding of the data sources and their relationships to the targets is required. Sometimes, it will be clear that a data relationship is too complex, or the source data quality too poor, to allow federated access. Data federation does not, in any way, reduce the need for detailed modeling and analysis. It may in fact require more rigor in the design process, because of the right-time nature of data transformation and cleanup in a federated environment.
When to Use Data Federation
The following is a list of circumstances when data federation would be an appropriate approach to consider:
- Right-time access to rapidly changing data. Making copies of rapidly changing data in operational systems can be costly, and there will always be some latency in the process. Data federation can be used to directly access the live operational source data. The performance and security aspects of accessing the live data, however, must be considered carefully. A federated query can also be used to access both live data and historical data warehouse information in the same query.
- Direct write access to the original data. Working on a data copy is generally not advisable when there is a need to modify the data, as data integrity issues between the original data and the copy can occur. Even if a two-way data consolidation tool is available, complex two-phase locking schemes are required.
- It is difficult to copy the original source data. When users require access to widely heterogeneous data and content, it may be difficult to bring all the structured and unstructured data together in a single consolidated copy.
- The cost of copying the data exceeds that of accessing it remotely. The performance impact and network costs associated with querying remote data must be compared with the network, storage and maintenance costs of consolidating data in a single store. In some cases, there will be a clear case for a federated data approach when data source data volumes are too large to justify copying it, or when a small percentage of the copied data is ever used.
- It is forbidden to make copies of the source data. Creating a copy of source data that is controlled by another organization may be impractical for security, privacy or licensing reasons.
- The needs of users are not known in advance. Allowing users immediate and self-service access to data is an obvious argument in favor of data federation. Caution is required here, however, because of the potential for users to create queries that give poor response times and negatively impact both source system and network performance. In addition, because of semantic inconsistencies across data stores within organizations, there is a risk that such queries could return incorrect answers.
When to Use Data Consolidation
The arguments in favor of data consolidation are the opposite of those for data federation:
- Read-only access to reasonably stable data is required. Creating regular copies of a data source isolates business users from the ongoing changes to source data, and enables data to be fixed at a certain moment in time to allow detailed analyses to be done.
- Users need historical or trend data. Historical and trend data is seldom available in operational data sources, but this data can be built up over time through the data consolidation process.
- Data access performance and availability are key requirements. Users want fast access to local data for complex query processing and analysis.
- User needs are repeatable and can be predicted in advance. When queries are well-defined, repeated and require access to only a known subset of the source data, it makes sense to create a copy of the data in a consolidated data store for access and use. This is particularly the case when a group of users may need a view of the data that differs substantially from the way the data is stored in the source systems.
- Data transformation and cleanup is complex. It is inadvisable for performance reasons to have complex processing done, in-line, as a part of a federated data query.
You can see then that both data consolidation and data federation have a role to play in data warehousing and right-time data integration, and companies will need to implement both of these technologies. Rather than buying two separate products, one for data consolidation, and one for data federation, you should start looking at companies that support both of them in a single integrated product with shared metadata. This product trend is beginning to happen in the market, and to be successful in the future, data integration vendors will need to offer both capabilities in a single product.