I’m hearing a lot about Hadoop and MapReduce, but I’m still a little unclear as to how those two emerging database...
technologies relate to each other. Can you clarify?
Hadoop is a framework for distributed data and computing. In other words, it’s excellent for storing large sets of semi-structured data. (Whether a collection of semi-structured data can truly be considered to be a “set” is an interesting question, but you can probably guess what I mean). The data can be stored redundantly, so the failure of one disk doesn’t result in data loss. Hadoop is also very good at distributed computing – processing large sets of data rapidly across multiple machines.
MapReduce is a programming model for processing large sets of semi-structured data. What is a programming model, you ask? It’s a way of approaching and solving a given problem. For example, in a relational database, we perform queries using a set-based language – i.e., SQL. We tell the language the result we want and leave it to the system to work out how to produce it. With a more traditional language (C++, Java), we tend to spell out, step by step, how to solve the problem. Those are two different programming models. MapReduce is yet another.
MapReduce and Hadoop are independent of each other but, in practice, work well together – hence we often find them mentioned in the same breath.
Related Q&A from Mark Whitehorn
The unstructured data types common in big data systems are often better managed by a NoSQL database than relational software, Mark Whitehorn says. Continue Reading
IT managers should ask cloud providers some pointed questions about the security of data stored in cloud databases, says expert Mark Whitehorn. Continue Reading
Expert Mark Whitehorn explains what skills are required for predictive modeling -- and whether business users can do the work of data scientists. Continue Reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.