Data virtualization can integrate and provide access to data from different back-end systems without physically moving the data to a data warehouse or other central repository. Developers and data scientists benefit by more easily combining data sets stored across the enterprise in various formats.
A data virtualization platform can also simplify security and governance around accessing this data for different user roles. And it offers the potential to reduce ongoing data integration costs compared to traditional extract, transform and load processes. Getting those data virtualization benefits isn't a given, however.
Collaboration is important. Business leaders need to work closely with their data management teams when planning, deploying and using a data virtualization platform. Other crucial upfront steps include testing for possible performance issues and creating a data governance plan, according to experienced data virtualization practitioners.
When selecting a platform, compare various options available in the market and make a shortlist of platforms according to cost and workloads supported, suggested Amaresh Tripathy, senior vice president and global business leader at IT professional services firm Genpact. Next, it's a good idea to start with a pilot program to test out the platform's benefits -- and cost.
Stand-alone data virtualization platform vendors include DataVirtuality, Denodo Technologies, Red Hat, SAS Institute Inc. and Tibco Software, while vendors like IBM, Informatica and SAP offer data virtualization tools as part of broader data integration platforms.
Data virtualization benefits include rapidly prototyping new data flows involving data from multiple federated sources. Full-blown data virtualization platforms might provide a higher level of reusability in the long run. In the short run, Tripathy recommended tools that are optimized for rapid prototyping while also making it possible to persist data for production workflows.
Data virtualization projects, he added, have also worked out well for operational reporting at Genpact clients when the performance hit on data sources is acceptable. It's also important to ensure that the query latency of a virtual view is acceptable to end users, Tripathy said.
Data quality, provenance and governance
Amaresh TripathySenior vice president and global business leader, Genpact
A data virtualization platform can make it easy for business analysts, security teams and data protection officers to audit the flow of data across the company. Matt Shaw, director and head of the enterprise data practice at consultancy Synechron, worked with a U.K.-based global investment bank on a data virtualization project. The bank, which he declined to identify, needed to comply with global banking regulation BCBS 239, which is designed to ensure effective risk data aggregation and reporting with a strong focus on data quality and provenance.
The project had to ingest data across 42 data centers in 16 countries to provide a single golden source of data. A data virtualization tool was selected that could perform data lineage attribution and entity resolution for recording merge recommendations. Solid data protection controls were also implemented, Shaw said. Even though the project was delivered at the end of 2018, the work continues -- ingesting, cataloguing, conforming, curating and protecting additional data to support the bank's analytics projects.
Companies should approach a data virtualization project with a governance plan in place, particularly in privacy-sensitive industries. The plan should include consideration of how to protect the data, where the data is coming from and where it's intended to be used.
Synechron's banking project required that management establish clear ownership and accountability for data elements throughout the bank, Shaw explained. That involved policy and operational model changes as well as collaboration among the chief data officer and IT stakeholders. Implementing a data catalog and ontology to support a framework for organizing data was also helpful in searching for data without compromising data privacy, Shaw noted.
Know the limitations
There can be limitations to data virtualization in a project such as performance overhead. "Data virtualization as a concept has seen limited success in the market given the cost versus performance trade-offs they come with," Genpact's Tripathy said.
Genpact has used several different data virtualization platforms for its clients, including tools from Lore IO, Informatica, DataVirtuality and Denodo, but it has seen mixed results. Data virtualization projects can incur significant hardware costs or unacceptable performance, particularly for bigger or growing data volumes, Tripathy said. Data virtualization may also introduce some development and analytics complexities when the data structure changes or in understanding deeper correlations across data sources.
Vikas Khorana Co-founder and CTO, Ntooitive Digital
"Before diving into data virtualization, understand the complexities of your data sources," said Vikas Khorana, co-founder and CTO at Ntooitive Digital LLC, a digital ad agency and software vendor for publishers in the media industry. Data virtualization benefits can be limited by changes in the source data structure.
"Once you pull data sources in and draw up correlation logic, these changes, no matter how small, inevitably send you back to the drawing board to tweak or fix any issues," Khorana said.
Ntooitive experimented with data virtualization to help make the overall computational architecture behind its core platform more agile. But it ran into problems as data structures from vendor platforms like Facebook Business Manager and Google Analytics changed and updated frequently. The company ultimately decided to create its own homegrown tool that can fetch, simplify, unify and integrate information from multiple sources as part of N2Hive, a sales automation platform it sells.