Whether we're discussing the impact that the tsunami of big data is having on organizations or the cloud application takeover of traditional on-premises applications, the common foundation of such trends is an increasing demand for data. More accurately, there is a need for data that has been integrated and translated into a business context for analysis. That demand is making effective data integration -- already a key component of data warehouse environments -- even more important to business success.
Data integration involves taking data -- often from multiple sources -- and transforming it into meaningful information for business executives, data analysts and other enterprise users. As the need to share the growing volumes of data being generated and collected by organizations increases, turning to commercial data integration platforms is one way to help manage -- and simplify -- the process.
What are data integration platforms?
Packaged data integration software began with extract, transform and load (ETL) tools designed to automate efforts to pull data from source systems, convert it into a consistent format and load it into a data warehouse or other target database. The first generation of ETL tools consisted of simple but expensive code generators with limited functionality. Many of the companies that evaluated these tools found it more effective to develop their own custom integration code. Second-generation ETL software offered more functionality, but was primarily batch-oriented and didn't perform well. Based on those two sets of tools, many IT managers felt that ETL software wasn't worth the cost or the effort to learn, as it wouldn't meet their performance needs.
But, over the years, ETL tools have evolved in several key areas, including development, operational processing and integration functionality. To make them a more viable development platform, ETL vendors added support for code management, version control, debugging and documentation generation. For operational processing, the tools now have built-in functionality such as error handling, recovery/restart, run-time statistics and scheduling.
As the industry gained experience and sophistication in data integration, best practices were developed that were then added into ETL tools as prebuilt transformations. These transformations include mechanisms for change data capture, slowly changing dimensions, hierarchy management, data connectivity, data merging, reference lookups and referential integrity checks. Data integration performance has increased significantly by leveraging memory, parallelism and various data transport architectures.
In addition, a variant of ETL tools emerged called extract, load and transform (ELT). These tools eliminate the need for a separate application server dedicated to ETL functionality and can be deployed at either the data sources or target systems based on their capacity and configurations. The ELT approach lets users store raw data as is and then transform all or subsets of it as needed for specific business intelligence (BI) and analytics applications.
ETL tools evolve into data integration platforms
Data integration needs also expanded beyond the core ETL use of loading enterprise data warehouses; data marts; and BI data stores, such as OLAP cubes, to include these tasks:
- B2B integration
- Cloud integration
- Application and business process integration
- Data migration
- Data consolidation
- Data quality and cleansing
- Master data management
As that happened, the following integration categories emerged as well, targeting specific uses and technologies:
Enterprise application integration (EAI). Often referred to simply as application integration, this subcategory, which supports interoperability among different applications, is enabled through Web or data services created using service-oriented architecture and industry standards such as electronic data interchange. An enterprise service bus is a common architectural approach to implementing EAI functionality.
Enterprise information integration (EII). EII -- initially known as data federation -- provided a virtual view of disparate data sources, but had limited integration capabilities. The current generation, called data virtualization software, provides both data abstraction and data services layers to a wide variety of sources, including structured, semi-structured and unstructured data.
Cloud-based integration. Also referred to as integration platform as a service (iPaaS), cloud-based integration emerged to provide real-time interoperability between cloud-based applications and databases. These tools are deployed as a cloud service leveraging EAI and EMS functionality.
Eventually, vendors put the various pieces together and began offering full-fledged data integration suites that provide hybrid capabilities spanning ETL, application integration, cloud-based integration, real-time integration and data virtualization, as well as data cleansing, and data profiling tools. The suites can support data integration processes in traditional batch mode, or in real- or near-real-time through the use of Web services. They can also handle both on-premises and cloud data and less structured information -- system logs, text and other forms of big data, for example -- along with structured transaction data.
Dispelling data integration tool myths
If used correctly, data integration platforms greatly improve user productivity and integration flexibility, scalability and expandability over custom manual coding (see sidebar). But hand coding, either by IT workers writing SQL scripts or business people using spreadsheets, is still being done extensively in organizations.
Tool-based data integration development vs. hand coding
Tool-based data integration development provides the following benefits:
- Reusable processes based on industry best practices.
- Robust data quality processes.
- Workflow, error handling and restart/recovery functionality.
- Self-documentation of processes and entire workflow.
- The enablement of data governance.
- Impact analysis and where-used (lineage) functionality.
There are several reasons why IT groups believe they should manually write code rather than use a data integration platform; however, these beliefs are usually based on the following misconceptions or myths:
Integration tools are too expensive. There's a lingering perception left over from the early days of ETL that expensive tools are the only choice, but many data integration platforms priced for cost-sensitive budgets are now available in the market.
Highly skilled resources are required. Another false perception is that an enterprise looking to use commercial software needs data integration developers experienced in the legacy ETL tools that required extensive skills rather than the newer, easier to use data integration platforms.
Coding is cost-free. There's an inherit bias for the IT staff to generate SQL code: They know SQL and can create code in it quickly, and there are no license or subscription costs. But what starts out as a simple SQL script can quickly snowball into numerous scripts or stored procedures, creating a hodgepodge of often undocumented integration processes. Making changes to that code takes longer as it gets more complex, consuming more and more resources just to maintain it.
The data integration platform market
A variety of data integration platforms are available, but the market is led by IBM, Informatica, Information Builders, Microsoft and Oracle. Other vendors considered leaders either by market share or thought leadership include Pentaho, SAP, SAS and Talend.
All of these vendors sell data integration products that are deployed on premises, but can integrate data that resides on premises or in the cloud. Also, both Pentaho and Talend offer open source versions of their products along with paid-for enterprise versions. Although pricing is a separate discussion and will be covered more in-depth in a later article, Microsoft is unique in that it bundles its data integration product with its database rather than selling it separately.
Data integration continues to primarily be an IT-centric activity based on the data, database and technology know-how needed. Typically, data integration platforms are purchased, managed and used by IT groups responsible for BI, data warehousing, master data management and other data management programs. These groups should have the skills and experience to successfully utilize the platforms. Some leading-edge enterprises with multiple integration use cases and separate IT groups addressing those uses have created integration competency centers to manage their data integration platforms from an enterprise-wide perspective in an effort to avoid integration and, ultimately, data silos.
Why data integration is a fundamental part of a big data analytics platform
Data integration No. 1 challenge for healthcare agencies
How data curation tools offer a step forward on data integration.