BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
When evaluating open-source Hadoop technology usage in big data projects, it can help to take a closer look at the total cost of data, according to a veteran data management lifecycle consultant.
Because hidden costs lurk behind the ostensibly free data framework, true cost can be hard to uncover, consultant Richard Winter told attendees at last month's 2014 TDWI Executive Summit in Boston. Mere hardware costs can be outweighed by other concerns.
"Much of the cost of Hadoop [were] outside of the system per se," Winter said. Instead, he noted, the costs of developing and managing systems is crucial.
Application development for Hadoop on commodity clusters, along with its associated tool ecosystem, is still a bit of a Wild West experience. Still, Winter said, Hadoop can be less expensive overall for some types of data work.
Data managers should look at specific application types when gauging Hadoop's suitability for their organization, according to Winter.
How much is my Java?
To measure Java-based Hadoop in the enterprise, Winter looked at the cost of storing, managing and using data over time for analytics, as well as development and systems costs. He admitted that he had to employ some generalities during his research. He obtained the average expense of contracting a Java developer, for example, from a site that tracks salaries, and added 50% for the general overhead that's required for employees. Winter has posted background material on his methodology for estimating total cost of data in Hadoop projects on the Winter Corp. website.
Winter's measure for total cost of data includes estimates of the development cost of queries, which, with the especially skilled developers in the Hadoop world , can run up. At the same time, he made some general estimates on the number of lines of code and cost to create both simple and complex queries in both data warehouse and Hadoop-oriented settings.
He said a usual Hadoop combination needed to create complex queries -- the Hadoop file system, MapReduce, Java and SQL alternatives such as Hive -- can call for far more lines of code than is found in today's SQL-based data warehouse, which can be difficult work for some types of companies.
"In a small percentage of companies there is wide use of Hadoop. It works because they have been able to develop large complements of experts, principally in Java," he said. Outside of those companies -- ones such as Yahoo -- the opportunities for Hadoop are a pretty modest set of jobs, he added.
Don't forget to count the queries
Winter walked TDWI Summit attendees through a comparison of the total project cost for building with well-known styles of data warehouse technology versus Hadoop alternatives, and got different results for different end uses.
When surrounding costs were factored in, recreating an enterprise data warehouse using Hadoop, perhaps not surprisingly was a much more expensive proposition than it was using a traditional SQL-based enterprise data warehouse platform. But, for building a data staging refinery or data lake-style application that takes on a portion of an analytics job, Hadoop showed overall cost benefits, despite relatively high development costs.
Hadoop has a place where one encounters tremendous amounts of data with only slight variation, Winter said, pointing to Internet of Things applications such as airlines' engine data performance analysis, where information only becomes interesting on the occasions that it "trends away from normal," he said.
Some of the factors that comprise a use case specially influence the decision to go -- or not go -- with Hadoop. For example, the benefits of time-tested enterprise data warehouse technology are more pronounced when systems have more data sources, more users and require more complex query types, Winter said. When there are fewer sources, fewer users and queries are simple, Hadoop may be the path.
Going forward, both Hadoop and enterprise data warehouses are likely to be put to use, he said. The important task for the data manager is to not only pick the right platform for the right use, but to understand that at times, the answer may be to employ both technologies, while carefully splitting different portions of the work between the two.
Learn how to estimate data cleansing costs
Find out about the cost of poor data quality