Joshua Resnick - Fotolia
A major motivation for the ongoing evolution of the Hadoop software stack is a desire to position it for broader use in enterprise computing environments. And the appetite for Hadoop and related technologies among user organizations is growing -- justifiably so, because the economics of Hadoop clusters provide the ability to implement a scalable data management architecture with a low barrier to entry and to incremental investments as processing needs increase.
At the same time, much confusion remains about what Hadoop is and does, and how it differs from -- or is the same as -- an enterprise data warehouse (EDW). In many cases, this confusion becomes more acute as additional buzz terms are introduced into the Hadoop vs. data warehouse conversation, such as data lake or data reservoir. In addition, the confusion surrounding what capabilities are or are not part of Hadoop adds additional complexity.
There's no doubt that Hadoop has a place in the enterprise, especially as big data applications take hold. But the venerable EDW has a well-established presence in data centers, and after years of refinement plays a significant role in meeting the reporting and analytics needs of most organizations. Does the emergence of Hadoop mean it's time to abandon the EDW? Some IT and data management professionals are aching to use Hadoop as a replacement for the data warehouse -- but are companies really prepared to abandon their decades-long investments in EDW infrastructure, software, staffing and development?
Criteria to help drive buying decisions
To determine when one technology is better than another for a particular task, it's worth examining the criteria that are important to consider. In this case, they can be grouped into four high-level categories:
Operational performance. Examples include query response times, data integration and loading throughput, the volume of data that can be stored, the ability to support a mixed workload of applications and the number of simultaneous users.
Support for desired functionality. This includes things like SQL support, the ability to ingest and manage both structured and unstructured data, high availability and fault-tolerance features, integrated security and data protection, data compression and stream processing functionality.
Cost factors. As a matter of course, you should take into account elements such as the base system price, the average cost per storage volume unit and the cost of scaling up systems, as well as maintenance and support costs.
Strategic value. Utilizing a single platform for both data management and high-performance computing that powers advanced analytics applications might be beneficial to an organization. Other issues that could come into play include replacing legacy systems for which there is a shrinking pool of skilled workers and being able to use data virtualization techniques that allow you to pull together different data sets without physically moving them.
Hadoop or data warehouse, side by side
You can compare the different technologies by lining up the choices side by side and seeing how they each address the specific criteria that the people in your organization care most deeply about, in accordance with the planned applications. As an example, consider "SQL compliance," which might be relevant for a more traditional reporting application. A relational database running on a data warehouse appliance is likely to satisfy that requirement. On the other hand, even though query engines that layer SQL interfaces on top of Hadoop have become available, they may still need some additional refinement and tweaking before they can be trusted to execute all types of SQL queries correctly and efficiently.
Another example is "cost to scale," which might be relevant in the context of an application for analyzing a growing number of text documents. For a traditional EDW running on a mainframe or large Unix server, the cost of scaling up the system is likely to be is high. A Hadoop cluster based on commodity hardware is better positioned to scale incrementally, and the required investment in a few more compute nodes and storage devices should be relatively small.
For each business requirement, you can repeat this evaluation process for all the criteria relevant to the desired outcome. Create a grid in which the technology choices form the columns and the performance measures form the rows. If you'd like, add a weighting factor to each of the criteria. Assess how well the corresponding technology satisfies each criterion. When you're done, you can use your weightings to formulate a comparative score that can guide the decision on how to move forward.
Eventually, a broader mix of traditional data management, reporting and analytics tools will be adapted to and deployed on the low-cost, high-performance framework that Hadoop embodies. It's safe to say that, in a way that mirrors the maturity curve for any new technology, the underlying Hadoop platform will gel into an environment that is production-hardened and satisfies the dimensional analysis, high-performance computing and advanced analytics needs that span the corporate user community. But that may take some time -- and in the interim, choices between Hadoop clusters and data warehouses must be made. Considering performance criteria and specific variables will help you decide which technology is best suited to solving a particular business problem.
About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting and development services company that works with clients on big data, business intelligence and data management projects. He also is the author or co-author of various books, including Using Information to Develop a Culture of Customer Centricity. Email him at [email protected].
Expert: Data warehouses still have a role to play
Why vendors should stop bashing data warehouses
Hadoop data lakes must improve to oust data warehouses, advises one expert