Data integration is the process of combining data from multiple source systems to create unified sets of information for both operational and analytical uses. Integration is one of the core elements of the overall data management process; its primary objective is to produce consolidated data sets that are clean and consistent and meet the information needs of different end users in an organization.
Integrated data is fed into transaction processing systems to drive business applications and into data warehouses and data lakes to support business intelligence (BI), enterprise reporting and advanced analytics. Various data integration methods have been developed for different types of uses, including batch integration jobs run at scheduled intervals and real-time integration done on a continuous basis.
Importance of data integration
Most organizations have a collection of data sources, often including external ones. In many cases, business applications and operational workers need to access data from different sources to complete transactions and other tasks. For example, an online order entry system requires data from customer, product inventory and logistics databases to process orders; call center agents must be able to see the same combination of data to resolve issues for customers.
Loan officers have to check account records, credit histories, property values and other data before approving mortgages. Financial traders need to keep an eye on incoming streams of market data from internal systems and external sources. Pipeline operators and plant managers depend on data from various sensors to monitor equipment. In these and other applications, data integration automatically pulls together the necessary data for users so they don't have to combine it manually.
It's the same in BI and analytics systems: Data integration gives data analysts, corporate executives and business managers a complete picture of key performance indicators (KPIs), customers, manufacturing and supply chain operations, regulatory compliance efforts, financial risks and other aspects of business processes. As a result, they have better analytical information available for uses such as tracking business performance, managing operations, and planning advertising and marketing campaigns.
How data integration works
At a basic level, data integration connects source and target systems and routes data from the former to the latter. In some cases, the actual data is moved to a target system, as in real-time integration of different data streams. In others, copies of data sets from source systems are fed into a target one -- for example, when transactional data is consolidated in a data warehouse or smaller data mart for analysis.
From a technical standpoint, data integration architects and developers create software programs that automate and manage the process of integrating data sets. Some forms of data integration are relatively straightforward -- replicating data from one system to another is a case in point. Often, though, different database schemas in separate source systems need to be harmonized as part of integration projects.
One common approach for doing so is to create a mediated schema that merges local source schemas into a global one and then use data mapping to bridge the differences between them and match data elements to the mediated schema. That can be done in a target system, such as a data warehouse, or in a virtual architecture that creates a unified view of data from different systems without physically loading it into a new repository.
Types of data integration
The most prevalent data integration method is extract, transform and load (ETL), which is commonly used in data warehousing. In ETL jobs, data is extracted from source systems and run through a data transformation process to consolidate and filter it for analytics uses; the resulting data sets are then loaded into a data warehouse. ETL is a batch process that typically involves bulk amounts of data, and it can also be used to feed varied sets of big data into Hadoop clusters and other data lake platforms.
However, an alternative extract, load and transform (ELT) method is often used in big data systems. ELT inverts the second and third steps of the ETL process, loading raw data into a target system and then filtering and transforming it as needed for individual analytics applications. That's a popular option for data scientists, who often do their own data preparation work and want to have access to full data sets for predictive modeling, machine learning applications and other forms of advanced analytics.
Forms of real-time data integration include change data capture (CDC), which applies updates made to the data in source systems to data warehouses and other repositories, and streaming data integration, which integrates real-time data streams and feeds the combined data sets into databases for operational and analytical uses. Another data integration method that can be done in either real time or batch mode is data replication, which copies data from one data source to another system to synchronize them for operational, backup and disaster recovery (DR) purposes.
Data integration methods also include data virtualization, which evolved from an earlier approach known as data federation. It uses a virtual data layer to integrate data instead of doing so physically. That gives business users and data analysts an integrated view of different data sets without requiring an IT team to load the data into a data warehouse, an operational database or another target system. Data virtualization can augment an existing analytics architecture for specific applications or be used as part of a logical data warehouse or data lake environment that includes a mix of different platforms.
Data integration tools and techniques
Developers can hand-code data integration jobs, typically in the form of scripts written in Structured Query Language (SQL), the standard programming language used in relational databases. For many years, that was the most common approach to integration. However, packaged data integration tools that automate, streamline and document the development process have become available from various IT vendors. These include Dell's Boomi subsidiary, Hitachi Vantara, IBM, Informatica, Information Builders, Microsoft, Oracle, SAP, SAS Institute and Talend, as well as other companies.
ETL tools were among the first data integration software products, reflecting the ETL method's central role in the data warehouse systems that emerged in the mid-1990s. Now, many vendors offer more expansive data integration platforms that also support ELT, CDC, data replication, big data integration and other forms of integration; in addition, associated data quality, data catalog and data governance software is often included as part of the platforms.
Some of the integration platform vendors provide data virtualization tools, too. They're also available from data virtualization specialists and other data management vendors, including Actifio, Astera Software, AtScale, Data Virtuality, Denodo Technologies, IBM's Red Hat unit, Stone Bond Technologies and Tibco Software.
The growth of cloud computing has created new needs for organizations to integrate data in different cloud applications and between cloud and on-premises systems. That led to the development of integration platform as a service (iPaaS), a product category that provides cloud-based integration tools. Most of the major data integration platform vendors now also offer iPaaS technologies; other companies in the iPaaS market include Jitterbit, MuleSoft, SnapLogic, Tibco and Workato.
Data integration examples and benefits
Integrating customer data is one of the foremost data integration use cases. It involves consolidating data about customers from all available sources, including contact details, account records, customer lifetime value (CLV) scores and information collected through customer service calls, website visits, direct marketing programs, surveys, social media posts and other interactions.
Done properly, the data integration work ensures that departments across a business can get a current and complete view of customers. That information can help companies better target their marketing efforts and identify opportunities to cross-sell and upsell customers on products, hopefully resulting in increased sales. Well-integrated customer data can also improve customer service by giving call center workers and field service technicians immediate access to all the information they need.
Data integration initiatives typically also include the integration of data on revenues, expenses, profits, productivity and other performance metrics from different business units and regional operations. Commonly, that information is then made available to corporate and business executives in reports or BI dashboards; it can help improve both operations management and strategic planning. Similarly, integrating data on an organization's employees can aid in human resources management and support related people analytics applications to optimize HR processes.
Increasingly, organizations are moving to integrate data collected by multiple sensors installed on industrial equipment, including manufacturing machines, vehicles, elevators, pipelines, electrical grids, oil rigs and other devices connected to the internet of things (IoT). Integrated sets of sensor data can be used to monitor operations and run predictive maintenance models that aim to identify potential equipment failures before they occur, which can help reduce unplanned downtime for repairs.
In the healthcare industry, integrated data from different clinical systems and patient records helps doctors diagnose diseases and other medical conditions. For medical insurers, effective data integration both internally and with healthcare providers and employers can increase claims processing accuracy and ensure that the names and addresses of members are consistent and correct in their systems.
Data integration challenges
Common challenges that IT and data management teams encounter on data integration include keeping up with growing data volumes; unifying inconsistent data silos; dealing with the increasingly broad array of databases and other data platforms in IT infrastructures; integrating cloud and on-premises data; and resolving data quality issues. In large organizations with global operations, the number and distributed nature of the systems that need to be integrated add to the complexity.
The amount of data being generated and collected by organizations creates particularly big integration challenges. Data volumes continue to grow quickly, and the rate of that growth is only likely to increase as big data applications expand, the use of low-cost cloud object storage services rises and the IoT develops further. Data integration is essential to realizing the full potential business value of all that data, but successfully planning and managing the required integration work is a complicated process.
To start with, data managers and data integration developers need full documentation of the source and target systems in an organization's data architecture so they can do the required mapping between them. They also must have a solid understanding of both internal and external data sources, the business rules that are embedded in the data, and how often data is updated and modified.
As a result, it's imperative that they work closely with business users. Data integration efforts should also be aligned with data governance programs, as well as related data quality and master data management (MDM) initiatives, to ensure that data is clean and consistent and that data lineage documentation is available to help integration developers better understand what's in data sets.