Evolving data integration strategies target new analytics needs

Sergey Nvns - Fotolia

Expert: For BI, you must know the data integration process

Understanding the data integration process is central to self-service BI and data architecture design, consultant Rick Sherman says in an end-of-year look at data management trends.

As 2017 gives way to 2018, data management is buffeted by winds of change. Unstructured data is far more prevalent, the cloud is emerging as a bigger target platform, and the call for self-service BI from business users is heard more than ever. Despite the tumult, the data professional can't lose sight of basics. One key is planning, advises veteran data management and BI consultant Rick Sherman.

SearchDataManagement caught up with Sherman, founder of Athena IT Solutions, to survey the data landscape. Long at work in the field of data warehousing, Sherman also teaches classes on the data integration process and data architecture at Northeastern University in Boston. In addition, he recently led sessions on data architecture at a Dataversity conference dedicated to the topic. Today's tools are useful, he tells us, but knowing how to use them is still the top challenge to be faced.

Data on the cloud was almost a forbidden topic not that long ago. Today, the question seems to be how fast one can move to the cloud, especially for experimentation. What are you seeing on that front?

Rick Sherman: There are go-to applications for the cloud, but that doesn't cover everything. You're not going to have a wholesale movement to the cloud.

You have a set of existing applications, data warehouses and the like, and, if they're running fine, you won't want to migrate them just for the sake of migrating them. It's not necessarily that compelling to move existing things that work to the cloud if they're running and they're large.

Picture of Rick ShermanRick Sherman

It will be more the newer work: new applications, and new technologies. More and more applications -- Salesforce is an example -- are on the cloud, and as such applications move to the cloud, people will move work there. That's nothing new. And, now, the database vendors are making a big enough push that people are much more comfortable.

You mention the word experimentation. That is a place where cloud is being used. It's much more cost-effective for people to leverage the cloud vendors' capabilities there, especially if they're going to be setting up Hadoop or NoSQL projects, than it is to try and establish that on premises.

With BI, it's different. Certainly, we're in a span of time where the number of vendors discussing cloud on the BI side has exploded. But I see much more of a trend toward data discovery tools on premises, as opposed to cloud.

One of the tenets of the data discovery folks is that ETL-related data integration processes -- i.e., extract, transform and load -- can move out of IT and into the hands of front-line data workers. How real is that?

Sherman: Well, one of the things that enterprises have always been bad at is data integration. They use tons of ETL tools, but they use them to create custom SQL. Enterprises hire cheaper and cheaper data integration resources. They aren't really using the tools because they're using the tools to write custom scripts. But, as a result, a lot of the established tools have taken a reputation hit.

There is a feeling that you can have what Gartner might call citizen data scientists, and you have people proceeding with the idea that they don't need the ETL tools, that you can use data prep tools, which in a way is just making an ETL tool for somebody else to use.

But, I don't think that things have gotten automated. I think what you have is more time spent by a data science or data engineering team in writing custom SQL scripts, or in moving stuff in or out of Excel spreadsheets. The ETL tool is probably not being used much. It's not because things are getting better -- things are getting worse. More and more time is being used to do custom extractions, custom Python and the like, and less time is used to analyze the data. The work is not being automated.

I agree that the whole idea of a citizen data scientist is worthwhile, but the fact is that people are still using an inordinate amount of time moving data in and out of tools.

You lead extended sessions on data architecture at conferences, and you teach some of the same principles at Northeastern. How do you try to guide people? What should be top of mind in the data integration process today?

Sherman: The key takeaway I try to emphasize is that all the BI tools and analytics are great, and they're expanding, but the real guts of effectively using data is having a data architecture and understanding how things fit within it.

If people designed ... cars the way companies design their data backbone architectures, we wouldn't be traveling much.
Rick Shermanfounder, Athena IT Solutions

Everything doesn't have to be integrated -- but lots of stuff does need to be integrated. Too many companies are doing it manually, or doing it point-to-point. The more effective way to do it, if you want to be a data engineer or a data scientist, or if you just want to support BI, is to focus on data integration. You need to understand when you need data integration, and how you do it.

There are tons of reasons why companies haven't been as successful as they should be with data warehousing, and almost none of them has anything to do with the tools. It has to do with the fact that they don't have an architecture. The fallback position is to react to things and to custom code things point-to-point, as opposed to planning it out.

If people designed phones or cars the way companies design their data backbone architectures, we wouldn't be traveling much -- we'd be walking everyplace. We'd be back to the days where, when you made a call, you talked to an operator and they physically plugged in lines to connect your call; or, with computers, in the days before Grace Hopper, where the system was a mass of patch cords. The technology is there, and there are best practices, but most companies' data architectures are reactionary, as opposed to being planned.

Next Steps

Salesforce purchase of Mulesoft generates buzz

Dig Deeper on Extract transform load tools