Home > Data management / BI News > What does MapReduce and in-database technology mean for data warehouses?
Data management / BI News:
EMAIL THIS

What does MapReduce and in-database technology mean for data warehouses?

By Jeff Kelly, News Editor
22 Jul 2009 | SearchDataManagement.com

News on data management trends and technology
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google

Collecting large volumes of data is easy enough. But quickly analyzing large volumes of data fast enough for the resulting insights still to be relevant is another matter.

That's where in-database analytics comes in. In-database analytics puts analytic logic inside the database itself – rather than in individual business intelligence and analytic applications that access a data warehouse, for example – for faster results on ever-larger data sets. Data no longer has to be cleansed and transferred to a separate destination.

Most of the big database vendors, including IBM, Oracle and Teradata, offer some sort of in-database functionality, said Jim Kobielus, an analyst with Cambridge, Mass.-based Forrester Research. The in-database technology at these vendors is mainly proprietary, he explained, meaning that it cannot be transferred from one vendor's database – say, IBM's – to another's – Oracle's or Teradata's, for example.

But an open source in-database analytics framework originally developed by Google has emerged that could change that, Kobielus said. Called MapReduce, it is essentially an open source stored procedure that lets developers put application logic in the data warehouse, enabling high-performance analytics against large data sets inside the warehouse, much like proprietary in-database technology.

The important difference between MapReduce – which is supported by just two smaller database vendors, Greenplum and Aster Data Systems -- and proprietary technology is that MapReduce's open source nature makes it easier to reuse, Kobielus said.

"The benefit of MapReduce's vendor-neutral standard is that the logic you develop on Greenplum or Aster, you can in theory port those apps to execute against bigger and faster databases of your choice," Kobielus said, just as you might an SQL-based logic. "It gives flexibility to the developer."

Introduced to the world via a Google whitepaper in 2004, MapReduce supports the full gamut of data management processes, from extract, transform and load (ETL) to data cleansing and data analysis, said Wayne Eckerson, director of research and services for The Data Warehousing Institute in Renton, Wash. It can also handle more complex processes than frameworks written in the relatively simple SQL language, he added.

"MapReduce is a way to insert custom programs into a database. In many ways, it's like a user-defined function," Eckerson explained. "Very complex analytics that are hard to do in SQL would be easy to do in MapReduce."

More on data warehousing and analytics
Find out how Greenplum's EDC helped Zion Bank rein in data mart proliferation  

Decide whether to stick with SAP's Business Information Warehouse or look elsewhere for an enterprise data warehouse  

Check out Gartner's latest data warehousing Magic Quadrant report and see who made the leaders' quadrant 
It works by first breaking up data into multiple parts and distributing them to multiple nodes for analysis – that's the "map" part. It then "reduces" the results, aggregates and returns them.

MapReduce is usually implemented on data warehousing platforms or in a cloud environment in conjunction with Hadoop, another open source framework that Kobielus described as a cloud-based user-defined function. "[Together] MapReduce over Hadoop is more than in-database analytics. It's really in-cloud analytics," he said. "The sky is the limit as to what data streams you can analyze in this cloud environment."

Of course, in-database technology, including MapReduce, is not ideal in all situations. Business analysts and executives still want to see corporate data in the form of reports and dashboards. In-database analytics is particularly helpful, though, in speeding up click-stream and other Web-based data analysis, like tracing the path a user took through a website to better understand buying patterns, Kobielus said.

Greenplum and Aster are the only two database vendors to apply MapReduce to their database products, however, and this limits the open source framework's reach. That may be because other database vendors utilizing proprietary in-database technology of their own are reluctant to embrace a potentially revenue-draining open source technology.

Greenplum's and Aster's core databases are also based on massively parallel processing (MPP) architectures, which make MapReduce a better fit than in non-MPP databases, Eckerson said. "It seems to be very useful in a parallel environment."

But the open source framework is paying off for some. While Google developed MapReduce to analyze its users' Web behavior, today Greenplum and Aster customers are taking advantage of MapReduce for a wider variety of uses, including supporting click-stream analysis, recommendation engines, and fraud detection.

An online gaming firm, for example, is using an Aster data warehouse with MapReduce to catch money launderers and cheats, said Shawn Kung, senior product manager at Aster Data Systems, which is based in San Carlos, Calif. Previously, the firm pulled data from its operational database and transferred it to an analytics engine on a single SMP server, Kung explained. Then a custom written algorithm was applied to the data, which detected behavioral patterns that indicated a high likelihood of illegal activity.

But the firm could afford to run the fraud detection analytics only once a week, and each instance took between an hour and a half and two hours, Kung said. After migrating to an Aster data warehouse with MapReduce, the firm was able to do the analytics in the database itself, eliminating the need to transfer the data to a separate analytics engine. The MPP architecture also greatly reduced the amount of time it took to run the analytics, down to just 90 seconds.

As TDWI's Eckerson explained: "What [MapReduce] is replacing is the need to create a custom database environment and to run custom programs in a separate environment."

Not all database vendors are as impressed with MapReduce as Greenplum and Aster are, however. "A lot of people don't want to use MapReduce inside the database as much as we database vendors probably want them to," said Dan Graham, marketing director at Teradata.

Graham said MapReduce has a number of useful applications, including for "ETL on steroids," transferring massive data sets to a separate data warehouse. But for in-database analytics, Teradata has written its own proprietary framework that seems to suit developers just fine, he said.

"It's not a burning need, because the MapReduce and Hadoop programmers don't really think much about databases," Graham said. "It turns out if you have a data warehouse at a site where there is a MapReduce community of programmers, the most common and valuable function is this ETL on steroids."

Still, Teradata is experimenting behind the scenes with MapReduce, Graham said, although he wouldn't provide details. And whether other database vendors will ultimately embrace MapReduce as Aster and Greenplum have done is very much an open question.

Holding back MapReduce may be a lack of accepted industry-wide standards, such as SQL standards, which ensure that developers are writing code in a common language, Aster's Kung said. "That's what it's going to take to [move] MapReduce to the next level."

"The reality is if you look at leading data warehousing and database vendors on the market today and their degree of adoption of MapReduce, it's not all there yet," Forrester's Kobielus said. "[But] there's a lot of push from customers -- at least, leading-edge customers -- to support this kind of framework. The smart money is riding on MapReduce and Hadoop now."



Tags: Data warehouse softwareData warehouse project managementData mining and predictive analyticsVIEW ALL TAGS

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
Data warehouse software
Microsoft details self-service business intelligence, data warehouse releases
Why pay for a data warehouse appliance when you can get one free?
In-database analytics pulls together SAS, data warehouse vendors
Teradata takes a logical approach to data warehousing appliances
BT taps open source BI software, homegrown DW to unlock customer data
Bill pushes for data warehouse, XBRL to track TARP funds
Teradata VP talks data warehouse appliances, reveals cloud and SSD plans
Data Warehouse Platforms Product Directory
Commodity hardware aiding data warehouse appliance performance, costs
Columnar databases, appliances, cloud computing top BI trends

Data warehouse project management
Why pay for a data warehouse appliance when you can get one free?
Teradata takes a logical approach to data warehousing appliances
BT taps open source BI software, homegrown DW to unlock customer data
Teradata VP talks data warehouse appliances, reveals cloud and SSD plans
Commodity hardware aiding data warehouse appliance performance, costs
Future of data warehousing shaped by open source, MDM, the economy
Greenplum brings data warehousing in the cloud indoors
Three data warehouse project management metrics
Introduction to enterprise master data management
To avoid enterprise data mashup madness, plan ahead and keep it simple

Data mining and predictive analytics
IBM launches private analytics cloud
How to expand enterprise reporting and capitalize on benefits of BI
In-database analytics demystified
Benefits of operational, real-time capabilities in smart systems
Business intelligence software helps hospitals fight swine flu
Developing countries tap SAS data analytics software to improve governance
SAP, SAS respond to IBM's planned SPSS acquisition
IBM to acquire predictive analytics specialist SPSS
Data acquisition and integration techniques
What is a data rollup?

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
data modeling  (SearchDataManagement.com)
extract, transform, load  (SearchDataManagement.com)
OLAP  (SearchDataManagement.com)
tree structure  (SearchDataManagement.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary



Data Management: Business Intelligence, Data Integration, Data Compliance