Gartner predicts that 60% of organizations will deploy data virtualization software as part of their data integration tool set by 2020. That's a big jump from the adoption rate of about 35% the consulting and market research company cited in a November 2018 report on the data virtualization market. But the technology "is rapidly gaining momentum," a group of four Gartner analysts wrote in the report.
The analysts said data virtualization use cases are on the rise partly because IT teams are struggling to physically integrate a growing number of data silos, as relational database management system (DBMS) environments are augmented by big data systems and other new data sources. They also pointed to increased technology maturity that has removed deployment barriers for data virtualization users.
Mark Beyer, one of the report's authors, discusses those trends and the use of data virtualization tools as an alternative to traditional extract, transform and load (ETL) integration processes in this Q&A.
Do you still see data virtualization use cases as niche in nature, or has the technology gone beyond that from an adoption standpoint?
Mark Beyer: We're seeing data virtualization used more frequently in broader use cases. It used to be for taking sort of prebuilt data stores, with some data quality and integration built in, and putting virtualization over the top of them -- almost like a see-through layer. Or the other one was just taking three or four data stores from transactional applications and creating a view of them.
But now, what we're seeing is it being introduced as a true semantic layer. In a semantic layer, you can have several different use cases. Power users can build their own trust factor or confidence model [in virtualized data sets]. Another thing you can do is build multiple tiers of data -- two or three different layers in the same data model, each with its own trust or quality level. And, in a logical data warehouse, you have the ability to take a very traditional operational data store and put it together with things you do in a data lake. You can get a view into warehouse data and, through virtualization, be able to run a data science job. To the user, it just looks like the two data sets coming together.
Gartner's 2018 market guide report on data virtualization said that more than 35% of organizations you surveyed were using the technology in production deployments. Is that a healthy number, given how long data virtualization software has been around?
Beyer: Data management teams have always been very cautious about the idea of giving access to data, and data virtualization goes down to the very bottom layer of the data -- you have to give it all the permissions. In the past, there was a reluctance to put too much data virtualization into an environment. Now, the cloud has taught people that data is flying all over the place. Let's admit that and not try to lock down the data -- let's lock down the use case instead. So, yeah, 35% is kind of robust.
The report does talk about 'renewed interest' in data virtualization tools. Did user interest diminish for a while? And if so, what happened to make it rebound?
Beyer: Data virtualization kind of sat around doing nothing for a long time. Then, six years ago, it started flying up to the peak of expectations, but being able to see views of data sets doesn't mean it ran fast and had all the management and security capabilities of a DBMS. What has happened in the data virtualization space is [the vendors] solved some of the performance problems, they solved some schema management problems and they solved problems with the ability to read and write data in multiple formats. It doesn't have to be SQL anymore -- you can write in XML now, for example.
What are some data virtualization use cases and applications where the technology is a particularly good fit?
Mark BeyerAnalyst, Gartner
Beyer: No. 1 is prototyping [of data models]. Think of power users: They don't build the models they want the first time out of the box. The process is iterative. Pilot projects are another good use of data virtualization -- it's a great place to test things or try them in a small operational area. The other one now is in a logical data warehouse as the semantic layer.
And then another one is as a kind of embedded semantic layer in IoT applications. You might have 50 different makers of sensors for different types of readings, [all with] slightly different data structures and formats. You can use data virtualization to put a semantic model on the data stream right as it comes out of the device to say, 'This is the way I want the data.' We're starting to see that idea.
What advice do you have for IT and data management teams on deciding if data virtualization is right for their organizations?
Beyer: If flexibility outweighs consistency in data delivery, you should start thinking about data virtualization -- for example, if multiple users demand different output schemas. If 55% of your users like one schema, you can write ETL for that. But, for the other 45%, you can put a variety of models in a semantic layer and let them pick the one they want to use. That's almost a no-brainer.
Is performance still a common roadblock when you deploy a data virtualization architecture?
Beyer: That was always sort of a myth. People would say that data virtualization ran slow, and it did. They were basically paying a performance tax every time they ran a virtualization process against a data source. In a data warehouse, you pay that tax once when you're loading data into the warehouse, but you still have to pay it. You always have to pay the tax when the job is going to run, and you optimize for what your users need. It's always a balancing act.