Enterprise data managers face a variety of data integration challenges, stemming partly from growth in the amounts and types of data coming into corporate systems. These challenges include the inability to locate siloed data, keeping up with faster data volumes and dealing with different streaming data sources.
Automation can help with many of these challenges and also speed up the integration process. Manual data integration efforts often slow the ability of organizations to combine data sets to support the development of analytics, machine learning and AI applications that can generate business value.
For example, Mr. Cooper, a home loan provider based in Dallas, has seen its data footprint expand due to organic growth and acquisitions. That made it challenging to provide near-real-time data analysis capabilities to business users in the past, said Sridhar Sharma, the company's CIO. As a result, there were instances in which decision-makers had to wait for insights from the analytics team.
"These delays stemmed from an inability to locate siloed data and, subsequently, apply the right algorithms," Sharma said.
Sharma's team made significant investments to eliminate data silos and establish a common scalable data platform. Those efforts involved strengthening the core architecture that supports a hybrid cloud environment, along with moving away from traditional batch data processing in favor of streaming data exchange patterns.
This article is part of
Sharma said it also has been important to focus on data quality and building a rich, domain-specific corpus of data as the company expands into more machine learning and AI projects. "That has also meant a constant feedback loop that allows our agents and our engineers to constantly tag and enrich our data sets," he said.
With that background, let's look at the five most common data integration challenges that data management teams face: keeping up with the volume of data, working with different streaming sources, manual integration, different source technologies and assimilating SaaS data.
1. Keeping up with data volumes
The biggest data integration challenge is the exponential growth of data from many different sources, said Mitch Gibbs, a cloud consultant at Candid Partners, based in Atlanta. This causes problems with the available capacity for data retention and, more important, efforts to make actionable information out of all the data being created and collected.
The use of machine learning, in combination with the cloud's elastic storage capacity, allows targeted experiments to find and then retain the most valuable data. "The difficulty is in creating a balance for data retention for data that isn't important now and what data may be important in the future," Gibbs said.
Organizations need a strategy to proactively manage and integrate growing data volumes, while making data accessible for analytics when it's needed, according to Gibbs. That also needs to be balanced with the cost of storing all the data, he noted.
2. Different streaming data sources
Many enterprises struggle to integrate data from disparate sources. For example, that's a particularly big challenge in the power industry, where utilities have to ingest data from different systems to create a single, seamless data flow, said Farnaz Amin, principal digital product manager for GE Digital's grid analytics platform.
As a result, a utility may have point services offered by multiple vendors that run in silos and have little or no integration between them. In addition to tackling integration, it needs to ensure that all the data is safely stored to address risks associated with security, reliability and financial fines from regulatory authorities, Amin said.
As part of their overall data strategy, companies should spend time evaluating the kinds of data they're capturing and how different data sets need to be integrated, she added. "It's important to have a clear picture of the use cases that are possible by leveraging this data and what business problems they can solve."
Another key to building a strong data infrastructure and integration strategy is knowing how your end users operate. Two questions you should ask early on, Amin advised, are how will analysts use a data set and how frequently will they need to access it. "This gives you the opportunity to right-size the frequency and reduce the strain on infrastructure," she said.
3. Manual data integration
In many companies, roughly 80% of a data scientist's time is spent finding and prepping data, leaving only the equivalent of one day a week for actual data science work, said Brendan Stennett, CTO of ThinkData Works, a data wrangling tools provider based in Toronto.
Enterprises may want to evaluate tools to automate data ingestion and integration. They can be used to create and document integration jobs, make consistent changes to all the ingested data and position it in a standard way. It's also helpful to use information on entity relationships to craft common keys between disparate data sets based on similar fields, according to Stennett.
In addition, tracking data lineage can help when ingesting raw source data and performing minimal transformations, he said. For example, a unified schema for mapping source data can enable any issues in the common data set to be traced back to a row from the raw source data.
4. Different source system technologies
"IT teams today not only face the challenge of dealing with the explosion in the variety of the data itself, but also the increased variety in the underlying database technologies," said Shekhar Vemuri, CTO of Clairvoyant, an IT consultancy based in Chandler, Ariz. This can include a mix of SQL, NoSQL and big data systems, deployed both on premises and in the cloud.
The problem is compounded by the growing use of purpose-built databases and the adoption of microservices architectures that break up databases into smaller pieces and spread them across enterprise systems. Although shifting to an architecture with smaller components simplifies application development, it can make data integration more complex.
"What used to be in one large database before is now split across tens of databases, some of which are loosely linked with each other," Vemuri said.
A good practice is to start by cataloging all the data that exists in an organization, he recommended. Historically, this was the purview of enterprise architects working with teams from IT and the various business units. However, an approach based solely on that process often won't work in today's complex data landscape.
"We have to look at augmenting the process with tools that allow teams to collaboratively manage and maintain a catalog of data sets in the enterprise," Vemuri said, adding that these tools can scan and discover data sets and speed up the creation of the initial data catalog.
5. Assimilating SaaS data
"As software as a service witnesses rapid growth, the challenges of integrating with legacy and homegrown applications compound," said Sayid Shabeer, chief product officer at HighRadius, an accounts receivable software vendor based in Houston.
Some of the data integration challenges he sees include security, cloud integration and IT infrastructure issues. Engineers should start the integration process by answering questions about how the data could be used in real-time or batch processing applications, Shabeer said. "This is the most important factor in determining the kind of interface required."
Many SaaS products have connectors available for common business applications, and the vendors are often willing to work with IT teams on custom integrations. That can involve figuring out how to bring all the underlying systems and data sources together in a way that allows them to be seamlessly integrated, analyzed in real time and mined for insights.
"The idea is to create a true integrated platform," Shabeer said.