With all the hype, it's easy to forget that there are still some pretty big obstacles to open source Hadoop adoption...
and "big data" analysis, according to James Kobielus, a senior data management analyst with Cambridge, Mass.-based Forrester Research Inc. Those obstacles include the high cost of storing multiple terabytes of data and a lack of Hadoop interoperability standards.
SearchDataManagement.com recently spoke with Kobielus to get a reality check on the current state of Hadoop adoption. Kobielus talked about storage issues related to big data analysis and explained why he thinks a standards body would bode well for Hadoop adoption levels. Here are some excerpts from that conversation:
There has been a lot of hype lately around Hadoop and big data analysis. So why isn't everybody doing it?
James Kobielus: However you implement big data [analysis], whether it's through a data warehouse or a Hadoop cluster, you're talking about petabytes or multiple hundreds of terabytes worth of storage. That's expensive. So really the gating factor in the big data universe paradigm is price of storage. How expensive is storage? How much data can you really afford to retain in hot storage versus offline storage, cheaper storage, tapes and so on?
Have you seen many early Hadoop adopters analyzing petabytes of data?
Kobielus: Most Hadoop clusters in the real world are nowhere near a petabyte. Many are in the hundreds of terabytes range. But you know I ask the implementers why and when are they going toward petabytes and they say storage is a big issue there. That's why we don't see that many petabyte scalable data warehouses on traditional platforms. It's just too expensive.
What else stands in the way of Hadoop adoption and big data analysis?
Kobielus: The whole Hadoop ecosystem is still immature. The majority of the [enterprise data warehouse] vendors, for example, either don’t have Hadoop [distributions] of their own, or they do but haven't really fully integrated them with their core data warehousing tools. That is one indicator of immaturity.
Are there any other signs of immaturity?
Kobielus: The Hadoop community is not really standardized. I mean, it's standardized in the way that open source initiatives standardize. A lot of people [and companies] get together, they build software and they open source it. It's used and it's adopted but with no formal standardization or ratification process. Now, there's a lot of people in the Hadoop and open source community who would say that [standardization and ratification is] the wrong way to go. I can respect that point of view. But part of the problem there is that without standardization there is a lot of risk for implementers.
Why is a lack of standardization potentially risky?
Kobielus: The fact is that there is no common reference architecture for Hadoop clusters -- a reference architecture that would lay out clearly the interfaces for the pluggable storage layer [and] the standard interfaces for MapReduce interoperability across different vendor platforms. There is no common reference framework equivalent to the reference framework that the SOA community built in the last decade all around things like SOAP and WSDL and UDDI for interoperability within that whole web services world. [With Hadoop, there is no] interoperability and certification testing, no good housekeeping seal of approval on interoperability. That can become critical in lots of areas where, for example, you're a big company and you've got Hadoop clusters in various business units and they want to play together or federate with each other. There are no federation standards for Hadoop clusters yet. There is no common specification for real time data manipulation and access to functionality within the whole Hadoop world.
How do early adopters get around these interoperability issues?
Kobielus: If you want to do truly real-time analytics in a distributed Hadoop world with multiple vendors' distributions, then you're going to have to write a lot of custom code or it might not work or work well. There is a lot of risk there. I think the industry needs to at least begin to think through how to build a common reference framework for interoperability and certification testing, hopefully with some formal standards body of some sort to, for example, ratify things like versions of HDFS, the core file system used for most of these clusters.