Big data poses weighty challenges for data integration best practices

photobank.kiev.ua - Fotolia


Streaming, connectivity new keys to data integration architecture

The growing use of the cloud, big data and data from outside sources has complicated data integration, making the addition of data streaming and broader connectivity a must.

Though never simple, the data integration process used to at least be relatively straightforward.

The development of a data warehouse or a set of data marts to pull together information from different source systems was driven by the need for better reporting and analysis. But, initially, both of those activities were largely operational in nature -- operational reporting to let business managers know what was going on in an organization, and operational analytics to examine the factors driving the current state of affairs and evaluate opportunities for improvement.

Because those were the objectives, most data warehouse environments were populated with transaction data originating from systems inside a company. That meant a data integration architecture dealt mostly, if not only, with extracting data from internal systems, moving it to a staging area, transforming it into a consistent format and then loading the consolidated data into the target data warehouse or data mart.

However, many organizations are now extending their analytics data domains beyond traditional transaction processing systems and their own corporate walls -- a transition that's motivated by these factors:

Cloud computing. The economic and IT management benefits of using cloud-based services, combined with the improved security and data protection provisions built into cloud platforms, make the cloud an increasingly viable alternative to on-premises system deployments.

For example, the emergence of software as a service (SaaS) vendors offering customer relationship management tools and other business applications via the cloud can simplify implementation and reduce the time it takes to bring the applications online -- and to get business value from them. The next step up from SaaS is migrating entirely to a database platform running in a hosted cloud environment. Platform as a service (PaaS) setups can similarly lower costs, speed time to value and enable organizations to take advantage of a PaaS vendor's IT expertise.

Big data. Hadoop, Spark and other emerging big data platforms make it more feasible to incorporate unstructured and semi-structured data into analytics applications. Among other things, that includes internet clickstream data, system and application log files, satellite images and sensor data collected from devices connected to the internet of things.

External data. Increasingly, organizations are augmenting their own data with information from a wide variety of outside sources, including open data initiatives, data aggregators, analytics services providers and streaming feeds of social media, stock market, weather, traffic and other types of real-time data.

Together, those things have fundamentally changed data integration. As enterprise data extends its boundaries, modern data integration approaches must combine a hybrid mix of data streaming and conventional data extraction and loading, followed by harmonization and transformation. In addition to collecting data from more sources, there may be new target systems on the receiving end -- for example, Hadoop clusters and NoSQL databases in big data environments.

More called for on data integration

As a result, there's a need to augment a conventional data integration architecture to include better support for cross-platform data connectivity. Instead of custom coding the routines for accessing and extracting data from each source, explore the use of data integration tools with widespread connectivity between source and target systems built into them.

Such tools provide mechanisms for accessing commonly used database systems, with parameters that let data integration developers create data access queries using a limited set of details -- such as the name of the database, the server on which it runs and the tables where the relevant data is located. In some cases, the tools allow you to write SQL-compliant queries, even if the underlying data source isn't a SQL database; the queries are transformed into a set of native access routines to pull out the data and configure it as a SQL-style result set.

Stream processing engines are another key addition to a modern data integration architecture. They collect streaming data within defined, real-time windows and enable it to be reconfigured and prepared for loading into Hadoop or other analytics systems.

Staying in sync gets more complicated

Incorporating these new techniques does have implications for data synchronization and consistency. In the conventional extract, transform and load approach to data integration, all of the data is brought to a staging area and synchronized as the data sets are processed in preparation for loading into the target system. However, as the number of origination points expands and the speed at which data is produced and delivered increases, it becomes more challenging to manage the synchronization process.

One idea is to consider a modified real-time batching approach that collects streaming data in a virtual holding pen, and then to coordinate its migration to the target system with other data sources in accordance with predefined synchronization criteria. That will enable you to enforce data consistency, timeliness and currency policies for subsequent analysis and reporting uses.

Next Steps

More from David Loshin: Self-service data prep broadens info access

Consultant Rick Sherman on the key features of data integration tools

Users turn to real-time data streaming to accelerate big data analytics

Dig Deeper on Enterprise data integration (EDI) software