Sergey Nivens - Fotolia
Enterprise data managers face a variety of data integration challenges, which all stem from the growth in the amounts and types of data. These challenges can include the inability to locate siloed data, keeping up with faster data volumes and dealing with different streaming sources.
Automation can help with many of these challenges. Manual data integration efforts can slow the ability to create new analytics, machine learning or AI applications that generate business value.
For example, Mr. Cooper, a home loan provider, has seen its data footprint expand due to organic growth and acquisitions. This made it challenging to provide near-real-time analysis for its business users in the past, said Sridhar Sharma, CIO at Mr. Cooper, based in Dallas. As a result, the company dealt with instances in which its decision-makers had to wait for insights from the analytics team.
"These delays stemmed from an inability to locate siloed data and, subsequently, apply the right algorithms," Sharma said.
Sharma's team made significant investments to break down data silos and establish a common scalable data platform. This involved strengthening the core architecture that supports a hybrid cloud environment, along with moving away from traditional batch data processing in favor of streaming data exchange patterns.
Sharma said it has also been important to focus on data quality and building a rich, domain-specific corpus of data as the company expands into more machine learning and AI projects.
"That has also meant a constant feedback loop that allows our agents and our engineers to constantly tag and enrich our data sets," he said.
In this article, let's look at the five most common data integration challenges: keeping up with the volume of data, working with different streaming sources, manual integration, different source technologies and assimilating SaaS data.
Keeping up with data volumes
The biggest data integration challenge is the exponential growth of data from many different sources, said Mitch Gibbs, a cloud consultant at Candid Partners, based in Atlanta. This causes problems with the capacity of data retention and, more importantly, making actionable information out of all this data.
The use of machine learning, in combination with the cloud's capacity, allows targeted experiments to find and then retain the most valuable data.
"The difficulty is in creating a balance for data retention for data that isn't important now and what data may be important in the future," Gibbs said.
Organizations need an active strategy to proactively manage the volumes of data, while making it accessible for analytics when needed. This also needs to be balanced with the cost of storing the data.
Different streaming sources
Many enterprises struggle to integrate data from disparate sources. This is particularly a challenge in the power industry, where utilities have to ingest data from different systems to allow one seamless data flow, said Farnaz Amin, principal digital product manager for GE's Grid Analytics, an energy analytics platform.
As a result, a utility may have point services offered by multiple vendors that work in silos and have little or no integration between them. It's also important to ensure this data can be safely stored to address risks associated with security, reliability and financial fines from regulatory authorities.
As part of their data strategy, companies should spend time evaluating what kind of data should be captured.
"It's important to have a clear picture of the use cases that are possible by leveraging this data and what business problems they can solve," Amin said.
Another key to building strong data infrastructure is knowing how your end user operates. One question you should ask early on is, "How will my analysts use this data, and how frequently will they need to access it?" Amin said.
"This gives you the opportunity to right-size the frequency and reduce strain on infrastructure," Amin said.
In many companies, roughly 80% of a data scientist's time is spent finding and prepping data, leaving only a day a week for actual data science, said Brendan Stennett, CTO of ThinkData Works, a data wrangling tools provider based in Toronto.
Enterprises may want to evaluate tools to automate data ingestion, as they make consistent changes to all the data ingested and position it in a standard way. It's also helpful to use entity resolution to craft a common key between disparate data sets based on similar fields.
Data lineage tracking can help when ingesting raw source data and performing minimal transformations. A unified schema for mapping source data can enable any issues in the common data set to be traced back to a row from the raw source data.
Different source technologies
"IT teams today not only face the challenge of dealing with the explosion in variety of the data itself, but also the increased variety in the underlying database technologies," said Shekhar Vemuri, CTO of Clairvoyant, an IT consultancy based in Chandler, Ariz. This can include myriad technologies, such as SQL, NoSQL, Hadoop and SaaS offerings.
This problem is also compounded by the fact the adoption of microservices has driven an increase in the use of purpose-built databases, which involve component databases broken up into smaller pieces and spread across the enterprise. Although smaller components can simplify app development, it can make integration more complex.
"What used to be in one large database before is now split across tens of databases, some of them which are loosely linked with each other," Vemuri said.
A good practice is to start by cataloging the data that exists in the organization. Historically, this was the purview of enterprise architects working with the various teams. However, a process-based approach can't work in today's complex landscape.
"We have to look at augmenting the process with tools that allow teams to collaboratively manage and maintain a catalog of data sets in the enterprise," Vemuri said.
These tools can scan and discover data sets and speed up the process of creating the initial catalog.
Assimilating SaaS data
"As software as a service witnesses rapid growth, the challenges of integrating with legacy and homegrown applications compound," said Sayid Shabeer, chief product officer at HighRadius, an accounts receivables platform based in Houston.
Some of the data integration challenges he sees include security, cloud integration and IT infrastructure.
Engineers should start the integration process by answering questions about how the data could be used in real time or batch processing applications.
"This is the most important factor in determining the kind of interface required," Shabeer said.
Many SaaS products have connectors available for common business apps, and the vendors are often willing to work with IT leadership teams for customer integrations. This can involve figuring out how to bring all the underlying systems and data sources together in a way that allows them to be seamlessly integrated, analyzed in real-time and mined for insights.
"The idea is to create a true integrated platform," Shabeer said.