Joshua Resnick - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Make the right choice between Hadoop clusters and a data warehouse

Consultant David Loshin outlines a process for comparing specific criteria and variables to help guide decisions on deploying Hadoop or using an enterprise data warehouse.

A major motivation for the ongoing evolution of the Hadoop software stack is a desire to position it for broader use in enterprise computing environments. And the appetite for Hadoop and related technologies among user organizations is growing -- justifiably so, because the economics of Hadoop clusters provide the ability to implement a scalable data management architecture with a low barrier to entry and to incremental investments as processing needs increase.

At the same time, much confusion remains about what Hadoop is and does, and how it differs from -- or is the same as -- an enterprise data warehouse (EDW). In many cases, this confusion becomes more acute as additional buzz terms are introduced into the Hadoop vs. data warehouse conversation, such as data lake or data reservoir. In addition, the confusion surrounding what capabilities are or are not part of Hadoop adds additional complexity.

There's no doubt that Hadoop has a place in the enterprise, especially as big data applications take hold. But the venerable EDW has a well-established presence in data centers, and after years of refinement plays a significant role in meeting the reporting and analytics needs of most organizations. Does the emergence of Hadoop mean it's time to abandon the EDW? Some IT and data management professionals are aching to use Hadoop as a replacement for the data warehouse -- but are companies really prepared to abandon their decades-long investments in EDW infrastructure, software, staffing and development?

Criteria to help drive buying decisions

To determine when one technology is better than another for a particular task, it's worth examining the criteria that are important to consider. In this case, they can be grouped into four high-level categories:

Operational performance. Examples include query response times, data integration and loading throughput, the volume of data that can be stored, the ability to support a mixed workload of applications and the number of simultaneous users.

Support for desired functionality. This includes things like SQL support, the ability to ingest and manage both structured and unstructured data, high availability and fault-tolerance features, integrated security and data protection, data compression and stream processing functionality.

Cost factors. As a matter of course, you should take into account elements such as the base system price, the average cost per storage volume unit and the cost of scaling up systems, as well as maintenance and support costs.

Strategic value. Utilizing a single platform for both data management and high-performance computing that powers advanced analytics applications might be beneficial to an organization. Other issues that could come into play include replacing legacy systems for which there is a shrinking pool of skilled workers and being able to use data virtualization techniques that allow you to pull together different data sets without physically moving them.

Hadoop or data warehouse, side by side

You can compare the different technologies by lining up the choices side by side and seeing how they each address the specific criteria that the people in your organization care most deeply about, in accordance with the planned applications. As an example, consider "SQL compliance," which might be relevant for a more traditional reporting application. A relational database running on a data warehouse appliance is likely to satisfy that requirement. On the other hand, even though query engines that layer SQL interfaces on top of Hadoop have become available, they may still need some additional refinement and tweaking before they can be trusted to execute all types of SQL queries correctly and efficiently.

Another example is "cost to scale," which might be relevant in the context of an application for analyzing a growing number of text documents. For a traditional EDW running on a mainframe or large Unix server, the cost of scaling up the system is likely to be is high. A Hadoop cluster based on commodity hardware is better positioned to scale incrementally, and the required investment in a few more compute nodes and storage devices should be relatively small.

For each business requirement, you can repeat this evaluation process for all the criteria relevant to the desired outcome. Create a grid in which the technology choices form the columns and the performance measures form the rows. If you'd like, add a weighting factor to each of the criteria. Assess how well the corresponding technology satisfies each criterion. When you're done, you can use your weightings to formulate a comparative score that can guide the decision on how to move forward.

Eventually, a broader mix of traditional data management, reporting and analytics tools will be adapted to and deployed on the low-cost, high-performance framework that Hadoop embodies. It's safe to say that, in a way that mirrors the maturity curve for any new technology, the underlying Hadoop platform will gel into an environment that is production-hardened and satisfies the dimensional analysis, high-performance computing and advanced analytics needs that span the corporate user community. But that may take some time -- and in the interim, choices between Hadoop clusters and data warehouses must be made. Considering performance criteria and specific variables will help you decide which technology is best suited to solving a particular business problem.

About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting and development services company that works with clients on big data, business intelligence and data management projects. He also is the author or co-author of various books, including Using Information to Develop a Culture of Customer Centricity. Email him at

Email us at and follow us on Twitter: @sDataManagement.

Next Steps

Expert: Data warehouses still have a role to play

Why vendors should stop bashing data warehouses

Hadoop data lakes must improve to oust data warehouses, advises one expert

Dig Deeper on Hadoop framework

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Has your organization compared a Hadoop cluster vs. a data warehouse for a deployment? If so, which way did it go?
We have been using a method called "Information Characteristic Assessment Methodology" to look at the appropriate information storage approach. ICAM is based on 17 factors of use of information in an enterprise (or public entity):

Monetization Value, Monetary Value, Intrinsic Value, Real-time Consistency, Time Value, Security,Regulatory, Availability, 4 V's: Volume, Variety, Veracity, Volume, Geographical Dispersion, Data Sparsity, Variability

Using a combination of prioritization, and rating assignment of goals, application usage, and business capabilities, we can focus on whether the solution should be CA, AP, CP (CAP Theorem) and somewhat directional to graph, relational, key-value, document, etc.
I definitely agree that organizations should have both Hadoop and SQL/data warehouse technologies in their quiver to hit different points on the price/performance curve as dictated by different analytical workloads. But isn’t the same true for other “big data” processing technology options like Spark, search and others? Perhaps the framework could be extended to include additional technology evaluation columns.

For organizations interested in the speed and cost savings that the cloud enables, a “Big Data as a Service” solution can automatically deploy the right technology (Hadoop or MPP SQL/data warehouse or Spark) for each analytical workload based on the SLA and analytical tools involved.