Big data architecture adds integration options -- and tool needs

Mixing data warehousing and big data technologies creates new opportunities for data integration, but also calls for new types of integration tools.

Big data technologies open up new options for storing and managing data -- potentially in concert with data warehouse systems, not as an alternative to them. That in turn creates new data integration opportunities, which might require additional tools to effectively support a big data architecture.

Big data systems make it more feasible to store data "in a very crude fashion and refine it as needed" for particular uses, said Shawn Rogers, who heads business intelligence (BI) and data warehousing research at Enterprise Management Associates Inc. in Boulder, Colo.

Rogers said Hadoop systems and NoSQL databases can serve as "a sort of loading dock" for raw data, with data models and schemas being applied to data sets "much later in the game than they used to be." In such scenarios, data integration morphs from conventional extract, transform and load (ETL) processes into more malleable extract, load and transform (ELT) approaches. And once data is ready for BI and analytics uses, it can be put on the system that's the best fit, whether that's a data warehouse, a Hadoop cluster or a special-purpose analytical database. "It doesn't have to be so rigid now," Rogers said. "We can apply some freedom, and some common sense, to our architectures."

Another factor that encourages ELT over ETL in big data environments is a desire on the part of data scientists doing advanced analytics to have access to unfiltered information. "Data scientists are used to working with 'dirty data' and dealing with the noise," said Michele Goetz, an analyst at Forrester Research Inc. in Cambridge, Mass. In fraud detection applications, for example, "you don't clean the data at all," Goetz said. The goal is to find anomalies in the information that point to suspicious transactions and activities.

To help grease the data integration skids in a federated systems environment, Goetz recommends that organizations create a "contextual services" layer consisting of components such as a metadata repository, data quality and governance policies, master data management models, and an enterprise-wide glossary of business terms. "Unless you have that, you're not going to be able to put all the pieces together," she said.

Too much going on in a big data architecture

Another danger in developing data integration applications involving Hadoop clusters and other big data systems is overloading them with too much data movement. "It's easy to write a MapReduce program, but it's also easy to write one that doesn't perform very well," said David Loshin, president of consultancy Knowledge Integrity Inc. in Silver Spring, Md. "You don't want to flood your network with just sloshing data back and forth."

The good news is that vendors of big data technologies and data integration tools are trying to one-up each other in developing automated integration capabilities for big data environments. For some users, the tools that are available now are good enough to get them over at least basic integration hurdles. For example, Amadeus IT Group SA, a travel reservations system operator based in Madrid, Spain, is using Hadoop, MapReduce and NoSQL technologies to reduce its IT costs and support new services for travel agencies and other users of its system -- including an application called Extreme Search, which provides proposed trip itineraries to consumers based on a variety of customizable parameters.

A June 2013 report on big data usage and issues in the travel industry, written by university professor and author Thomas Davenport, puts creating integrated data sources first on a list of challenges that need to be overcome. Integration can be a particularly thorny task for travel companies because of their continuing use of mainframe systems at the heart of their IT architectures, according to the report, which was sponsored but not controlled by Amadeus.

Hervé Couturier, head of research and development at Amadeus, said during a joint interview with Davenport that the company's mainframe isn't going away anytime soon. But the integration problem is solvable, he added. "The challenge is using 30-year-old technology and [how you] merge that with new technology," Couturier said. "But we can do that. The technology is here, and now the question, to a large degree, is how you can get to a usable business case."

Multiple slots on big data integration tool belt

There's no shortage of packaged tools to choose from for use in big data integration, and there isn't necessarily one right answer. ETL technology isn't completely out of the picture -- it still has viable applications in big data environments. Data virtualization software that pulls together information from source systems without physically moving it is another option offered by various integration vendors. Data replication, change data capture and compression technologies can all play valuable roles in integrating big data, Loshin said.

Database vendors that offer a mix of relational, columnar and appliance technologies are integrating the products up front to enable data to flow between them, although Rogers said that creates the potential for "stack lock-in" with a single vendor. In addition, vendors of all stripes have introduced connector software that can shuttle data between Hadoop systems and SQL databases. Gartner Inc. analyst Merv Adrian also pointed to Apache HCatalog, a table and storage management technology that's being developed by the Apache Software Foundation; it's designed to provide a shared schema and table abstraction capabilities to free Hadoop users from having to worry about where and in what format their data is stored.

But taken as a whole, the current set of integration tools still has some maturing to do. "Some easy things have been done," Rogers said. "Over the next 18 to 24 months, I think we'll see more sophisticated tools."

Tony Baer, an analyst at London-based Ovum Ltd., has a similar expectation. Baer said the state of big data tools is similar to the one for BI and data warehouse software circa 1996. "Back then, the industry had to introduce things like data cleansing because, for the most part, people had simply been dealing with transactional data up until then," he said, adding that more functional tools are needed "to help civilize and manage big data integration."

Craig Stedman is executive editor of SearchDataManagement. Email him at [email protected] and follow us on Twitter: @sDataManagement.

Freelance writer Alan R. Earls contributed to this story.

Next Steps

Find out why big data integration efforts require a firm handle on corporate info

Learn how the growth of big data makes data integration more difficult

Read a Q&A on how outside info adds to the challenges of integrating big data

Dig Deeper on Big data management