Big data software deployments might be launched because existing data warehouse systems are beginning to sag under the weight of the data that's flooding into organizations. But that doesn't mean data warehouses are all of a sudden obsolete -- just that the nature of warehousing data is changing to make room for big data.
"Different styles of data warehouse architecture have come and gone over the years," said Philip Russom, data management research director at The Data Warehousing Institute (TDWI) in Renton, Wash. "As we move to bigger volumes and diversity of data, we have to again evolve the data warehouse, just as we have in the past."
Hadoop-based big data systems initially were viewed as potential data warehouse killers, but that sentiment has largely given way to expectations of peaceful coexistence. For example, 78% of 263 IT professionals, business users and consultants surveyed by TDWI in November 2012 said they thought Hadoop systems could be a useful complement to their data warehouses for supporting advanced analytics applications. In addition, 41% saw Hadoop as an effective staging area for information on its way to an enterprise data warehouse (EDW). Asked if Hadoop clusters could fully replace an EDW, more than half of the respondents said no; just 4% said yes.
Russom thinks that using Hadoop to stage data for loading into data warehouses is a "beachhead" for big data technologies in companies. But the staging process itself is one aspect of data warehousing that has changed significantly in recent years, he said. In many cases, raw data is likely to pile up in Hadoop clusters and initially be analyzed there. "In the old days, the data staging area was pretty temporary," Russom said. "But it has evolved to become a kind of archive."
Comrades in data processing arms
Even so, he doesn't expect those archives to exist in isolation, disconnected from data warehouses. Some of the data will be moved into EDWs, perhaps in the form of aggregated analytics results, and the two technologies increasingly are being used in tandem, according to Russom. "Hadoop-enabled analytics are sometimes deployed in silos, but the trend is toward integrating Hadoop and EDW data at analysis time for maximal visibility into business performance," he wrote in a report about the TDWI survey.
Big data projects begun as skunkworks or standalone undertakings do run the risk of creating information silos. To prevent that, organizations should incorporate them into an overall data management strategy from the start, said Gartner Inc. analyst Mark Beyer. That means asking many of the same questions IT teams ask about conventional data as part of data quality and governance programs, he added. For example, where did a particular set of big data come from, how long must it be kept and does it need to be remediated before being used?
Beyer said applying proven data management processes to pools of big data is especially important with information that comes from external sources, including what he described as "crowdsourced" data collected from Facebook, Twitter and other social networks. With such data, "you don't know if the 'create case' matches the use case." Understanding the origins of data and factors such as how fast it changes is crucial to effective big data management, he advised.
The bottom line, Beyer said, is that "big data assets are no more accurate than any other digital information" -- and often less so. As a result, he warned IT managers to get ready for a bumpy ride: "Big data is an invader. Big data breaks things. You don't control it." Asserting control over the data once it's in an organization's systems could mean the difference between success and failure in making effective use of the information, according to Beyer.
Big data software brings technical challenges
In addition to the data quality and governance challenges, technical complexities lurk around the corners of big data environments. Not least is the need for MapReduce programming skills in organizations that are implementing Hadoop clusters. Maintaining high Hadoop performance levels can also be difficult, partly because of scalability limitations in the first version of the distributed processing framework. In addition, Hadoop 1.x was limited to running MapReduce batch-processing applications.
The new Hadoop 2 release, which became generally available last October, addresses those problems by opening up the framework to non-MapReduce applications and adding federation and high-availability features designed to increase scalability and cluster uptime. Several vendors have also introduced query engines that support real-time analysis of Hadoop data, while Yahoo Inc. and other users have paired the open source Storm complex event processing engine with Hadoop 2's new YARN resource manager to capture streaming data.
Those technologies and others might well help the big data management and analytics cause, but they further add to the vast forest of big data software that IT, data warehousing and data management professionals need to find their way through in planning and managing deployments. And it's a challenge that likely will be faced in more and more companies. In the TDWI survey, only 10% of the respondents said their organizations had Hadoop systems in production use -- but another 51% said they expected to be Hadoop users within three years.
The corporate spotlight increasingly will be on the IT teams responsible for building scalable big data systems and integrating them into existing data warehousing, analytics and operational environments. Finding the right technologies, and managing the process in a way that gets the most out of them, will help keep the glare of that light from getting too hot.
Find out why selecting big data technologies is often a multiple-choice question
Get advice from consultant David Loshin on managing big data integration projects
Read case studies and trend stories in our big data management strategies guide