Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Joe Caserta is in a good position to analyze the current status of Hadoop architecture deployments. Along with Ralph Kimball, he co-authored The Data Warehouse ETL Toolkit, a book published in 2004 that details extract, transform and load techniques for feeding data warehouses. But the founder and president of New York-based consultancy Caserta Concepts has seen Hadoop and other big data tools change traditional methods of warehousing data -- and aided in that process by helping organizations to implement Hadoop clusters. In an interview with SearchDataManagement at the 2014 Strata + Hadoop World conference in New York, Caserta offered his perspective on the bubbling stew that is Hadoop.
Is Hadoop architecture in the enterprise ready for broad adoption? Sometimes it seems like it isn't going to break out of what some would call a niche.
Joe Caserta: We were very early adopters of Hadoop. I thought that by now it would be completely widespread. I think it will get there, but the timing has a lot to do with the fact that what it does is hard to do. First, the tools that are out there today, such as relational databases, ETL tools and SQL, have had 30-years-plus to mature. You can argue about the age of Hadoop, but many of the tools are essentially just three or so years old. So there is a lot of maturity that still has to happen.
Second, there are no best practices yet. There are no graphical interfaces. You really have to be a programmer to work with Hadoop. You can't get away with being a really smart power user and start diving into Hadoop. Most work is done at the command line.
Third, governing data that does not have a structure is virtually impossible. It's hard to comply with HIPAA or SEC regulations when you don't have structured columns to mask or encrypt. That's probably the biggest challenge for the enterprise to embrace Hadoop.
It seems, looking at some data, that a lot of projects are stuck at the proof-of-concept (POC) stage.
Caserta: Yes. When we started in 2009, then, into 2010, the work was mostly with academics. In 2011 and into 2012, it was mostly POCs. And increasingly, we see what people are calling proofs of value, which focus on business needs.
Joe Casertafounder and president, Caserta Concepts
The term big data is kind of a misnomer, because it really doesn't have to be big. But for the first couple of years, the main impetus with Hadoop was for doing big data. The reason for that was that people really wanted low-cost data. The cost difference between installing, configuring and maintaining a Hadoop cluster, versus purchasing licenses and installing hardware, software and infrastructure for some established data warehouse like Netezza or Teradata, is very compelling. There's no question that it's an economic savings. But now people are looking for more.
Last year and this year is when those proof-of-concept projects started into production, and to be used in the operations of a business. It's now that people are beginning to notice all the flaws. For a single use case, it's great, but once you start expanding it to more users and more use cases, it's just like traditional data marts. Building [single-use data marts] can be pretty easy. But once you start expanding them into a data warehouse where you have to support all these disparate systems and business processes working cohesively, that's when you start realizing, 'I need something a little bit more sophisticated and more mature.'
A lot of Hadoop's potential expanded use has to do with back-end analytics. But those types of Hadoop tools, some of them are really just brand new.
Caserta: Right. Another reason why Hadoop has not taken the world by storm like we thought it would is its lack of ability to do interactive queries. When the tools like Impala and Drill start to mature, then I think it may get embraced more widely.
Still, the data scientists and data engineers and the very sophisticated database developers and ETL people are starting to embrace it. It's very similar to the old days when we had a floor full of COBOL programmers, and then object-oriented programming came out. Some were able to make the leap, and some just weren't. I think we're going to have a similar shift. Nowadays most of our ETL is being done in Python: We're using Python, Pig, Hive and MapReduce. It requires a different skill set. Some developers can make the leap. Sometimes we just have to find new people with new skills.
The real thing that is changing today, though, is that Hadoop is allowing businesses to be run without human beings involved. SQL and SQL-like languages and BI tools -- those are really made for human beings. The concept of machine learning means that you can feed some data to a machine running Hadoop, run some algorithms on that, and get smarter and make predictions and recommendations about what we should do.
The baby steps for this were the recommendation engines on Amazon.com. But we're doing that with everything now. We're doing it with stock picks, and with which ads to serve on ad servers. And the more prevalent this becomes, the less dependent we will actually be on humans to make decisions for us. Data warehousing was primarily made for people to interact with a BI tool. But the downstream consumers of most of the Hadoop systems we have been working on are other machines.
Learn how to use Hadoop in its Version 2 incarnation
Watch a video that looks at Hadoop adoption and use cases
Check out our guide on managing Hadoop projects
Hadoop 3.0 shows that Hadoop is constantly re-inventing itself