By one measure, Hadoop turns 10 today. On this date a decade ago, the development effort that resulted in the distributed processing framework became a separate open source subproject within The Apache Software Foundation, after being split off from another one, called Nutch. Other dates in the history of Hadoop can also lay claim to being its birthdate -- for example, April 1 will mark the 10-year anniversary of the first Hadoop code release. But Hadoop distribution vendor Cloudera Inc. and others are celebrating the occasion today. Doug Cutting was one of the two co-creators of Hadoop, which he famously named after his son's stuffed elephant; now, he's chief architect at Cloudera. In an interview yesterday, Cutting discussed the current status of the big data platform and shared his thoughts on what the Hadoop future holds.
Where do you think Hadoop stands at the 10-year mark?
Doug Cutting: I guess I'd call it adolescence. We're seeing a lot of broad adoption across industries, but there's still a long way to go. The vast majority of Fortune 500 companies are using Hadoop in production, but many in fairly kind of minor ways -- and I think that's going to grow and it's going to move more into the operational aspects of organizations. There's a lot more potential.
Gartner and TDWI surveys conducted last year got almost identical results on Hadoop adoption by organizations: 26% of the respondents in Gartner's case, and 28% in TDWI's. Do you see those as high numbers, relatively speaking? Or did you think usage would be higher by now?
Cutting: I certainly didn't think it would be higher by now. More than half of the Fortune 500, we have engagements with. If you count companies that are big, the adoption rate is way more than 50%. If you count smaller companies, too, it could be less. But engagements, conference attendance, inquiries -- in all these metrics, we see an annual doubling in the number of users. I have a hard time imagining it growing faster than that. People don't change over these things that quickly. And from Cloudera's perspective, it's as fast as we can go.
What are some of the things that still need to be added to or improved on in Hadoop, or the broader Hadoop ecosystem?
Cutting: We don't have good transaction [processing] support in the Hadoop world yet. It's coming -- there are some open source projects where people are working on this. But it's not likely to be an overnight thing, where organizations move their transactional systems to Hadoop all at once -- companies just don't work that way. Some of the things we continue working on are pretty boring. Security is one; another is multi-tenancy. They're not very exciting, but tremendously necessary.
Another thing we focus on is more real-time processing, with technologies like Kafka, Spark and Kudu. And the cloud is a tremendous focus for us. Moving [Hadoop systems] to the cloud has been slower than a lot of people anticipated -- a lot of our customers are still running clusters on premises. But I think that's going to change in the coming year.
Speaking of Kudu, Cloudera introduced it last fall as an alternative data store to the Hadoop Distributed File System (HDFS) for real-time applications. And the YARN resource manager that was added to Hadoop 2 in late 2013 made it possible to run applications without MapReduce. If Hadoop systems don't include the two original core components of Hadoop, will they really still be Hadoop clusters?
Cutting: It's a semantic question that comes back to, what is Hadoop? Properly speaking, it's one project at Apache. But it's also become a synonym for a whole ecosystem. If you imagine a future where everybody tends to move away from HDFS to Kudu, or from MapReduce to Spark, or from YARN to Mesos, there may be a little less in Hadoop itself. But I think it's a success that it's architected to be responsive [to the development of new technologies]. You can't imagine the Oracle world moving away from the [relational database] -- Oracle wouldn't allow it to happen. Whereas in this universe, we're happy to see that kind of thing happen.
What about Spark? Is it ultimately a potential replacement for, or a successor to, Hadoop?
Cutting: I think it's a successor to MapReduce, certainly. It's a better execution engine, and it does more things. But a lot of people look at this as an all-or-nothing thing, and I don't think that's the case at all. Spark isn't going to control the universe any more than Hadoop has. What it will replace is MapReduce. When you need a general-purpose execution engine, it's a clear step up from MapReduce.
Looking to the Hadoop future, how do you think things will change in the next five years?
Cutting: The [data management] world already looks very different than it did five years ago. People are much more accepting of the use of open source software for data processing than they were back then. That's a huge change, and I think in five years, it's going to be even more widespread. It's still new in a lot of companies. In another five to 10 years, it will be where we are. It'll be what's normal.
Doug Cuttingchief architect at Cloudera and co-creator of Hadoop
Will the same technologies still be part of the picture then?
Cutting: It will be a long time before MapReduce, HDFS and YARN go away. The established IT vendors, too -- their market shares will get smaller, but they won't disappear. It takes a long time for a technology to go away. I think the lesson people should learn is to be cautious about what technologies you take in. It amazes me the percentage of our customers who just grab and download some other open source technology, unsupported, and start using it in production. I admire their bravery, but they might eventually come to regret it.
What strikes you the most about Hadoop, 10 years on?
Cutting: It's way bigger than I ever expected. I wanted to write an open source project that would survive and people would use. What I totally didn't see was that the scale would be so big. If you go back to 10 years ago, the traditions of enterprise software and open source, sort of hacker software were totally separate, and now they aren't. You can go to a big bank, and people there are contributing to open source projects. It's pretty neat to see that big of a change.
Buying into the enterprise Hadoop future can mean big IT changes in organizations
Get real-world deployment examples and advice in our Hadoop project management guide
New SQL-on-Hadoop software aims to make the big data framework easier to use