Big data applications: Real-world strategies for managing big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
The growing interest in capturing, storing and analyzing “big data” has prompted many a data management trend spotter to predict the impending demise of the enterprise data warehouse (EDW). But as companies dive deeper into big data deployments, author Mark Twain’s famous remark about a report of his death being an exaggeration may turn out to be a more accurate commentary on the EDW’s future prospects.
There’s little doubt that the flood of big data -- large amounts of both structured and unstructured information, often involving multiple data types and frequent data updates -- will require changes in many corporate data warehousing strategies. For the past two decades, IT groups, particularly in large companies, have pursued the development of a single data warehouse that serves as the central repository for all of the structured data within their organizations. Now the validity of that approach is being challenged by the meteoric increase in social media posts and a surge in non-transactional data from sources such as application and Web server logs, network monitoring devices and sensors.
The traditional relational database world of the EDW typically isn’t equipped to accommodate the incoming tide of text and other forms of unstructured data. In response, pockets of users within companies -- often operating outside of the control of the IT department or data warehouse team -- have embraced new technologies like Hadoop, MapReduce and NoSQL databases in an effort to gain control over expanding volumes of big data so they can be mined for insights that can lead to competitive advantages and other business benefits.
More on enterprise data warehouse technology
Get advice on selecting a data warehouse platform from consultant William McKnight
Learn why Forrester Research expects vendors to begin offering more data warehouse software bundles
Watch a video Q&A on data warehousing sacred cows and what to do about them
But even with the rapid emergence of big data technologies as alternatives to relational database management system (RDBMS) platforms, the EDW isn’t headed for extinction anytime soon, according to data warehousing analysts. Instead, they see it morphing into something of a different animal as companies trying to formalize internal approaches on big data best practices look to extend their existing data warehouse systems and processes to help manage the new data types.
“The EDW is not going away -- in fact, the enterprise data warehouse itself was always a vision and never a fact,” said Mark Beyer, research vice president for information management at Gartner Inc. in Stamford, Conn. “Now the vision of the EDW is evolving to include all the information assets in the organization. It’s changing from a repository strategy into an information services platform strategy.”
What Beyer and other analysts envision is a modified version of the EDW, in which structured and unstructured data sets are stored and managed where it makes the most sense as part of an extended but well-coordinated architecture.
Big data a match for data warehouse discipline?
“We’re seeing the trend of applying the technology and disciplines learned in established data warehouses to a more federated set of data sources,” said David Menninger, a vice president and research director at Ventana Research in San Ramon, Calif.
In a survey on big data management conducted by Ventana early last year, 89% of the 163 respondents said their organizations were using mainstream relational databases on conventional hardware platforms to support large-scale data processing activities, and 73% said relational software was their primary tool for addressing big data.
But 93% of the respondents said they were using or evaluating other technologies for managing big data, according to Ventana, which released the survey results in January. That includes flat files (being used by 70% of the respondents); data warehouse appliances (34%); in-memory databases (33%); Hadoop (22%); and specialized analytical databases (15%).
“The big federation scenario used to be geographic servers or [database] instances acting as if they were one cohesive unit, but all [with] very similar structures and all relational,” Menninger said. “What’s dying is the concept of a single-instance RDBMS as the one and only enterprise data warehouse. Now it’s evolving so the various parts may not be the same technology, but the idea is to make somewhat disparate technologies behave and act as if they are one cohesive data set.”
Most organizations aren’t there yet. In the Ventana survey, for example, 64% of the respondents cited a lack of integration between big data systems and existing business intelligence (BI) and data warehousing tools as one of the technical challenges they were facing.
But analysts say that the shift under way today is not all that dissimilar to what happened in the early 1990s, when the big data warehouse challenge centered on trying to consolidate the bumper crop of data marts seeded throughout organizations into an EDW that was under the domain of the IT department. As part of that effort, companies also tried to find common ground among different business units on data warehousing projects, both to take advantage of cost efficiencies and to foster data consistency and reuse, according to Ralph Kimball, founder of Kimball Group, a data warehouse consulting and training company in Boulder Creek, Calif.
“Eventually, it dawned on everyone that [having] islands of expertise in the end-user departments wasn’t an effective way to scale upwards and have a coherent strategy,” Kimball said, adding that the same scenario is likely to play out with the rise of big data systems, many of which get their start in functional areas outside of IT. “At some point, you try to unify these things -- not by some communist control by IT, but the Wild West of end-user departments building their own systems has to be corralled. It simply costs too much.”
Brickbats give way to building bridges
While we’re still early in that corralling process, a lot has changed even within the last six months. Initially, there was a flurry of jockeying for position between traditional data warehouse vendors and startups offering Hadoop and other big data technologies, but that appears to have subsided. The two sets of vendors are now channeling their efforts more into creating links between their respective platforms, said Colin White, president of BI Research, a consulting firm in Ashland, Ore.
The Wild West of end-user departments building their own systems has to be corralled. It simply costs too much.
Ralph Kimball, founder, Kimball Group
Various vendors are already offering connectors for moving data between Hadoop clusters and conventional databases, with promises of more to come. As the integration and bridge-building process proceeds, Beyer said the EDW potentially can become more of a logical data warehouse that’s able to automatically retrieve data from different systems while channeling workloads to the best available platform based on factors such as cost and performance requirements.
Similarly, another emerging model views the EDW as a hybrid system that virtually combines multiple data processing technologies. For example, departmental users in an organization could use Hadoop to sift through Web data in an effort to find information that’s relevant to a particular business problem, then move that subset of data to an analytical database for more heavy-duty analysis. Once the analytical processing was complete, the aggregated results could be rolled up into a data warehouse and made available to a wider group of users.
“Trying to do advanced analytics on top of a traditional data warehouse architecture is daunting, which is why analytical databases have done so well,” said Shawn Rogers, vice president of BI and data warehousing research at Enterprise Management Associates Inc. in Boulder, Colo. “And it’s proven that a traditional data warehouse architecture can’t handle the unbelievable amounts of information coming off of new data sources like Web logs or social data, which is where Hadoop is a better platform.”
But the data warehouse is likely to still be part of the big data best practices equation as well: “There’s an opportunity,” Rogers added, “for all the different platforms to play a unique role in solving the problem.”
Beth Stackpole is a freelance writer who has been covering the intersection of technology and business for 25-plus years for a variety of trade and business publications and websites.