This content is part of the Essential Guide: Exploring data virtualization tools and technologies

Essential Guide

Browse Sections

Pfizer swaps out ETL for data virtualization tools

When a research and development division of pharmaceutical maker Pfizer Inc. ran into data integration struggles, it decided to go virtual.

Pfizer Inc.'s Worldwide Pharmaceutical Sciences division, which determines what new drugs will go to market, was at a technological fork in the road. Researchers were craving a more iterative approach to their work, but when it came to integrating data from different sources, the tools were so inflexible that work slowdowns were inevitable.

What is data virtualization?

Data virtualization technology adds an abstraction, or services, layer to IT architectures that enables information from multiple, heterogeneous data sources to be integrated in real time, near real time or batch, as needed.

Continue reading the definition of data virtualization.

At the time, the pharmaceutical company was using one of the most common integration practices known as extract, transform, load (ETL). When a data integration request was made, ETL tools were used to reach into databases or other data sources, copy the requested data sets and transfer them to a data mart for users and applications to access.

But that's not all. The Business Information Systems (BIS) unit of Pfizer, which processes data integration requests from the company's Worldwide Pharmaceutical Sciences division, also had to collect specific requirements from the internal customer and thoroughly investigate the data inventory before proceeding with the ETL process.

"Back then, we were basically kind of in this data warehousing information factory mode," said Michael Linhares, a research fellow and the BIS team leader.

Requests were repetitious and error-prone because ETL tools copy and then physically move the data from one point to another. Much of the data being accessed was housed in Excel spreadsheets, and by the time that information made its way to the data mart, it often looked different from how it did originally.

Plus, the integration requests were time-consuming since ETL tools process in batches. It wasn't outside the realm of possibility for a project to take up to a year and cost $1 million, Linhares added. Sometimes, his team would finish an ETL job only to be informed it was no longer necessary.

"That's just a sign that something takes too long," he said.

Cost, quality and time issues aside, not every data integration request deserved this kind of investment. At times, researchers wanted quick answers; they wanted to test an idea, cross it off if it failed and move to the next one. But ETL tools meant working under rigid constraints. Once Linhares and his team completed an integration request, for example, they were unable to quickly add another field and introduce a new data source. Instead, they would have to build another ETL for that data source to be added to the data mart.

Entering a virtual world

The demand for more agility accompanied by a shrinking budget pushed Linhares to search for an alternative. He eventually settled on a new direction: data virtualization. Rather than copy and move data, the technology allows Pfizer to keep the tools it currently has, abstracts data from multiple sources and creates a virtual view for the user through a Web portal. This enables users to quickly query, share and, most importantly, integrate data, whether it resides in flat files, an Oracle database or on SQL Server.

More on data virtualization tools

Exploring the process and benefits of data virtualization tools

Read more about Qualcomm's use of use of data virtualization tools

Learn about the benefits and perils of data virtualization

Supporting real-time BI with real-time ETL

Linhares and his team selected a data virtualization platform from Composite Software Inc. back in 2005, just as the tools were entering the market. As the years ticked by, data management and integration requirements continued to grow as businesses continued to add multiple data sources, such as the cloud-based CRM systems and business intelligence (BI) tools, according to Cambridge, Mass. –based IT consulting firm Forrester Research Inc.

In 2011, Forrester released a report called Data Virtualization Reaches Critical Mass.

"Driven by new capabilities and fueled by customer successes, data virtualization delivers on the promise of Information as a Service by enabling tactical solutions that also deliver a stepping stone to enterprise data management," the report read.

According to Forrester, data virtualization can also be relatively inexpensive compared with traditional data integration methods, such as database consolidation.

For Linhares, Composite, labeled a data virtualization leader in Forrester's 2012 review of the market, presented an easy-to-use product that met a couple of other important criteria. The new platform could cache data so that even in the event of a server crashing, users could still access views of the data in memory. The platform also supported "pure SQL," a standard programming language, according to Linhares.

Linhares has noted in past interviews that it's important to be ready to face the first hurdle before getting started with data virtualization: Organizations need to make sure that the data being accessed is treated and defined consistently across the sources. Otherwise, "virtualization won't work," he said.

Dig Deeper on Data virtualization and data federation