I’m hearing a lot about Hadoop and MapReduce, but I’m still a little unclear as to how those two emerging database technologies relate to each other. Can you clarify?

    Requires Free Membership to View

Hadoop is a framework for distributed data and computing. In other words, it’s excellent for storing large sets of semi-structured data. (Whether a collection of semi-structured data can truly be considered to be a “set” is an interesting question, but you can probably guess what I mean). The data can be stored redundantly, so the failure of one disk doesn’t result in data loss. Hadoop is also very good at distributed computing – processing large sets of data rapidly across multiple machines.

MapReduce is a programming model for processing large sets of semi-structured data. What is a programming model, you ask? It’s a way of approaching and solving a given problem. For example, in a relational database, we perform queries using a set-based language – i.e., SQL. We tell the language the result we want and leave it to the system to work out how to produce it. With a more traditional language (C++, Java), we tend to spell out, step by step, how to solve the problem. Those are two different programming models. MapReduce is yet another.

MapReduce and Hadoop are independent of each other but, in practice, work well together – hence we often find them mentioned in the same breath.

This was first published in May 2011

Join the conversationComment

Share
Comments

    Results

    Contribute to the conversation

    All fields are required. Comments will appear at the bottom of the article.