BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
I’m hearing a lot about Hadoop and MapReduce, but I’m still a little unclear as to how those two emerging database technologies relate to each other. Can you clarify?
Hadoop is a framework for distributed data and computing. In other words, it’s excellent for storing large sets of semi-structured data. (Whether a collection of semi-structured data can truly be considered to be a “set” is an interesting question, but you can probably guess what I mean). The data can be stored redundantly, so the failure of one disk doesn’t result in data loss. Hadoop is also very good at distributed computing – processing large sets of data rapidly across multiple machines.
MapReduce is a programming model for processing large sets of semi-structured data. What is a programming model, you ask? It’s a way of approaching and solving a given problem. For example, in a relational database, we perform queries using a set-based language – i.e., SQL. We tell the language the result we want and leave it to the system to work out how to produce it. With a more traditional language (C++, Java), we tend to spell out, step by step, how to solve the problem. Those are two different programming models. MapReduce is yet another.
MapReduce and Hadoop are independent of each other but, in practice, work well together – hence we often find them mentioned in the same breath.