What does MapReduce and in-database technology mean for data warehouses?

By placing the analytic logic inside the data warehouse, in-database technology, including open source MapReduce, can quickly process large sets of Web-based data.

Collecting large volumes of data is easy enough. But quickly analyzing large volumes of data fast enough for the resulting insights still to be relevant is another matter.

That's where in-database analytics comes in. In-database analytics puts analytic logic inside the database itself – rather than in individual business intelligence and analytic applications that access a data warehouse, for example – for faster results on ever-larger data sets. Data no longer has to be cleansed and transferred to a separate destination.

Most of the big database vendors, including IBM, Oracle and Teradata, offer some sort of in-database functionality, said Jim Kobielus, an analyst with Cambridge, Mass.-based Forrester Research. The in-database technology at these vendors is mainly proprietary, he explained, meaning that it cannot be transferred from one vendor's database – say, IBM's – to another's – Oracle's or Teradata's, for example.

But an open source in-database analytics framework originally developed by Google has emerged that could change that, Kobielus said. Called MapReduce, it is essentially an open source stored procedure that lets developers put application logic in the data warehouse, enabling high-performance analytics against large data sets inside the warehouse, much like proprietary in-database technology.

The important difference between MapReduce – which is supported by just two smaller database vendors, Greenplum and Aster Data Systems -- and proprietary technology is that MapReduce's open source nature makes it easier to reuse, Kobielus said.

"The benefit of MapReduce's vendor-neutral standard is that the logic you develop on Greenplum or Aster, you can in theory port those apps to execute against bigger and faster databases of your choice," Kobielus said, just as you might an SQL-based logic. "It gives flexibility to the developer."

Introduced to the world via a Google whitepaper in 2004, MapReduce supports the full gamut of data management processes, from extract, transform and load (ETL) to data cleansing and data analysis, said Wayne Eckerson, director of research and services for The Data Warehousing Institute in Renton, Wash. It can also handle more complex processes than frameworks written in the relatively simple SQL language, he added.

"MapReduce is a way to insert custom programs into a database. In many ways, it's like a user-defined function," Eckerson explained. "Very complex analytics that are hard to do in SQL would be easy to do in MapReduce."

It works by first breaking up data into multiple parts and distributing them to multiple nodes for analysis – that's the "map" part. It then "reduces" the results, aggregates and returns them.

MapReduce is usually implemented on data warehousing platforms or in a cloud environment in conjunction with Hadoop, another open source framework that Kobielus described as a cloud-based user-defined function. "[Together] MapReduce over Hadoop is more than in-database analytics. It's really in-cloud analytics," he said. "The sky is the limit as to what data streams you can analyze in this cloud environment."

Of course, in-database technology, including MapReduce, is not ideal in all situations. Business analysts and executives still want to see corporate data in the form of reports and dashboards. In-database analytics is particularly helpful, though, in speeding up click-stream and other Web-based data analysis, like tracing the path a user took through a website to better understand buying patterns, Kobielus said.

Greenplum and Aster are the only two database vendors to apply MapReduce to their database products, however, and this limits the open source framework's reach. That may be because other database vendors utilizing proprietary in-database technology of their own are reluctant to embrace a potentially revenue-draining open source technology.

Greenplum's and Aster's core databases are also based on massively parallel processing (MPP) architectures, which make MapReduce a better fit than in non-MPP databases, Eckerson said. "It seems to be very useful in a parallel environment."

But the open source framework is paying off for some. While Google developed MapReduce to analyze its users' Web behavior, today Greenplum and Aster customers are taking advantage of MapReduce for a wider variety of uses, including supporting click-stream analysis, recommendation engines, and fraud detection.

An online gaming firm, for example, is using an Aster data warehouse with MapReduce to catch money launderers and cheats, said Shawn Kung, senior product manager at Aster Data Systems, which is based in San Carlos, Calif. Previously, the firm pulled data from its operational database and transferred it to an analytics engine on a single SMP server, Kung explained. Then a custom written algorithm was applied to the data, which detected behavioral patterns that indicated a high likelihood of illegal activity.

But the firm could afford to run the fraud detection analytics only once a week, and each instance took between an hour and a half and two hours, Kung said. After migrating to an Aster data warehouse with MapReduce, the firm was able to do the analytics in the database itself, eliminating the need to transfer the data to a separate analytics engine. The MPP architecture also greatly reduced the amount of time it took to run the analytics, down to just 90 seconds.

As TDWI's Eckerson explained: "What [MapReduce] is replacing is the need to create a custom database environment and to run custom programs in a separate environment."

Not all database vendors are as impressed with MapReduce as Greenplum and Aster are, however. "A lot of people don't want to use MapReduce inside the database as much as we database vendors probably want them to," said Dan Graham, marketing director at Teradata.

Graham said MapReduce has a number of useful applications, including for "ETL on steroids," transferring massive data sets to a separate data warehouse. But for in-database analytics, Teradata has written its own proprietary framework that seems to suit developers just fine, he said.

"It's not a burning need, because the MapReduce and Hadoop programmers don't really think much about databases," Graham said. "It turns out if you have a data warehouse at a site where there is a MapReduce community of programmers, the most common and valuable function is this ETL on steroids."

Still, Teradata is experimenting behind the scenes with MapReduce, Graham said, although he wouldn't provide details. And whether other database vendors will ultimately embrace MapReduce as Aster and Greenplum have done is very much an open question.

Holding back MapReduce may be a lack of accepted industry-wide standards, such as SQL standards, which ensure that developers are writing code in a common language, Aster's Kung said. "That's what it's going to take to [move] MapReduce to the next level."

"The reality is if you look at leading data warehousing and database vendors on the market today and their degree of adoption of MapReduce, it's not all there yet," Forrester's Kobielus said. "[But] there's a lot of push from customers -- at least, leading-edge customers -- to support this kind of framework. The smart money is riding on MapReduce and Hadoop now."

Next Steps

Find out how Greenplum's EDC helped Zion Bank rein in data mart proliferation

Decide whether to stick with SAP's Business Information Warehouse or look elsewhere for an enterprise data warehouse

Check out Gartner's latest data warehousing Magic Quadrant report and see who made the leaders' quadrant

Dig Deeper on Data warehouse software