Data integration is often still done by hand, through SQL coding and stored procedures, according to Rick Sherman, founder of consultancy Athena IT Solutions in Maynard, Mass. Writing SQL integration scripts can be straightforward and seemingly inexpensive -- but data integration might be one case where handmade is no longer best, Sherman says in this interview with SearchDataManagement.com. He also discusses the different kinds of data integration tools now available and offers advice on how to evaluate them and choose the technology that best fits an organization's needs.
How prevalent is manual coding for data integration these days? Is it still a common practice in a lot of organizations?
Rick Sherman: It's very, very prevalent today, and I'll say unfortunately. It's still, I think, the most pervasive type of data integration that's happening in companies, big and small. Most large corporations with data warehouses are using data integration software. But even in large corporations, once you go beyond the enterprise data warehouse, data marts or reporting databases -- people getting data off of enterprise applications to do some simple reporting, almost all of that's done with manual coding, meaning SQL scripts or stored procedures. And in small and medium-sized businesses, they are predominantly using SQL to move data from various points and do integration.
Why do companies stick with manual coding?
Sherman: First off, it's quick and easy to move data that way. Most companies have enterprise application or SQL developers who know how to code and know how to do SQL. That's the skill set they have, and you generally use what you know. The second thing is that oftentimes data integration processes aren't these well-planned-out projects. The business needs to get some data into a report, and it's easy to do it in SQL or a stored procedure. Then the next thing you know is they need to add 12 more data sources, and it escalates.
Another thing is the overall market impression of extract, transform and load software -- ETL -- and other data integration products. The top-tier products are perceived by many in IT as being too expensive. Most companies look at that expense and think it's too big, so they [resort] to coding. There also are a lot of ETL and data integration products that are bundled with various databases, as well as open source integration products; but they are perceived not to be very sophisticated. And then there are a lot of unknown products and technologies that most people in IT don't get enough visibility into. So they either see [data integration tools] as too expensive, or the products that are within their cost reach they either don't know about or they don't think are sophisticated. And that brings us back to SQL as sort of the default.
What are the potential downsides of relying on manual coding, and some of the potential benefits of using automated tools?
Sherman: It's important to note that many of the ETL tools have evolved into more full-fledged data integration suites. When those tools first came on board, they did simply move data from point A to B and didn't provide very much functionality, but that's no longer true. There are a lot of data integration processes that now are built into the ETL tools. If you're manually coding, you have to redevelop all those processes. So one of the benefits of using the tools is there's a whole lot of code that you don't have to develop, plus you get code management, process management and built-in documentation. With all the clients that I have worked with, if it's SQL code, there is not likely to be any documentation of where the stored procedures are. They're scattered across multiple databases; you don't know how they're interconnected and neither do the developers. The long-term cost looks pretty inexpensive: You're just creating code; you're not asking to buy a tool. But over the long run, you get a mess. The code just builds up.
Are ETL tools still the most widely used form of data integration software? And if so, do you see that changing or is ETL likely to remain the primary integration technology?
Sherman: Outside of manual coding, ETL is certainly the most prevalent data integration tool. I want to qualify that and say that ETL to some degree, especially with newer vendors, is almost considered a dinosaur. The reason is because ETL is considered sort of this batch-driven code that's running in the background on a nightly basis. But ETL tools, certainly the more sophisticated ones, have greatly evolved and have incorporated newer technologies -- real-time complex event processing, data virtualization and Web or data services. I don't think the old batch-style ETL tools would be relevant to as many people today as they were before. But the new tools are and will be the mainstay for a while.
Do technologies such as data virtualization and other real-time data integration tools have specialized use cases, or can they be applied broadly for integration purposes?
Sherman: Each of these tools does have special use-case scenarios. Data virtualization allows you in real time to integrate both structured and unstructured data when you need it. A classic example is a call center in a financial services enterprise, and you have data about your customers scattered across different systems. Rather than having everything brought into a data warehouse, you can gather it real time from the different locations. But it's not an either-or situation. The problem we run into is that oftentimes we try to have everything be an end-all-be-all -- there's one thing that does everything. The danger with data virtualization is assuming you don't need to do warehousing anymore. The danger with just promoting a warehouse is that you think you can do everything in a warehouse -- well, you can't. So you need to have both, or multiple technologies for multiple business uses. It really gets driven by what the business needs to use the data for. We're going to need a mix of integration tools, especially when we have so much variety of data now at our disposal.
Speaking of which, "big data" is a big topic these days. What kind of impact do you think the growing amounts of unstructured data in particular that companies are looking to capture, store, and analyze will have on data integration needs and processes?
Sherman: Certainly, the three V's of big data -- velocity, variety and volume -- are hitting everybody, big and small, and corporations in every type of business. Going after this unstructured data -- social media data, Web data, all sorts of communications data, machine data, et cetera -- that type of data is certainly a different variety than what existing data integration tools were built for. So the whole NoSQL [movement] and the text search tools that have come up are addressing that part of the market. But I think those are complementary tools. I don't think they really address structured data as well. And I don't think the existing data integration tools address unstructured data very well. So a company needs both.
Various vendors have introduced cloud-based versions of their integration tool sets. Do you see much integration work being done in the cloud at this point?
Sherman: The first wave of cloud integration has been to move data in and out of cloud applications like Salesforce. That's been the predominant use-case scenario: move data in and out of those cloud-based applications and back to on-premises applications. As cloud applications become more and more used by companies of all sizes, cloud integration certainly will become more and more prevalent. But there are a couple things holding companies back from more pervasive use of cloud integration. The first wave of cloud integration tools are generally a little more lightweight [than on-premises tools are]. They're geared toward data movement and sort of simple integration. The second part is because of security, privacy, network bandwidth and other practical aspects, people aren't really extending [the cloud] to be used as much with their on-premises data. I think as all those things get addressed, people will get more comfortable with it and cloud integration will get to be more expansive than it is now.
How should organizations go about evaluating all the different data integration technology options and deciding which one, or which combination of tools, is the right choice for them?
Sherman: What they need to do first is get a feel for where they are now as far as data silos go. What sort of state are their data integration processes in, and is that holding back what they can deliver to the business? Second, where do they want to go in the next year and where do they want to go beyond that? Based on that information, they can start categorizing what kind of tools they need. From the technology point of view, are they generally dealing with on-premises data -- say, enterprise applications, where they want to load a data warehouse and build out business intelligence applications? Do they need to be able to access things in real time? Do they need to integrate with the cloud? And then they need to have a real critical look at where they are skill-wise and where they think they can get skill-wise.
There's a lot of ETL or data integration tools out there, at all different price points and all levels of complexity. Too often, companies only look at the most expensive ones and then it gets back to what I mentioned earlier: "Well, that's too expensive, so we'll just go back to SQL coding." I think what they want to do is look at a number of price points and then look at whether the ones at the lower price points can solve their needs or do they need to go up to the more expensive ones. But all of them are probably better than hand coding, especially if you have a lot of data integration needs and a lot of silos.