This content is part of the Conference Coverage: Strata + Hadoop World 2016: Hadoop and Spark in spotlight

Q&A: Dinsmore sees open source Apache Spark moving to new stage

Analytics vet Thomas Dinsmore says Apache Spark is entering a new phase of adoption, one in which hype gives way to clearer assessment. He also discusses the ascent of the R programming language for analytics.

Analytics software vet Thomas Dinsmore has tracked open source Apache Spark since the technology first emerged. In this Q&A, the independent consultant, based in Newton, Mass., says he sees a new level of maturity for the often-hyped data processing and analytics platform. Spark may not be as fast as early benchmarks suggested, but it still warrants the attention of IT teams, according to Dinsmore. He spoke to SearchDataManagement shortly after the recent Spark Summit East 2016 in New York.

You have followed Spark from the get-go. What is the takeaway from this latest event?

Thomas Dinsmore: There's an overall maturation of the Spark community. There is a greater sense that Spark has arrived, as opposed to "it's the next thing." Whether it was last year's announcement that IBM would support Spark, or Hadoop distribution providers like Cloudera announcing that they would make it the default selection in their distribution, there is no need anymore to take a position that Spark is overhyped. Now, I think, it is a given that Spark is going to be a part of the future.

Spark does seem to be entering another stage. Discussions have evolved from how fast it is to how to make open source Apache Spark faster yet.

Thomas Dinsmore, independent consultant and authorThomas Dinsmore

Dinsmore: Yes, you get a lot of contradictory information regarding speed. For example, a couple of years, ago folks were talking about Spark being "100-times" faster than MapReduce. But at Spark Summit East, IBM's Anjul Bhambhri [vice president of big data and analytics] said they are getting five- and six-times faster operations when IBM SPSS pushes down to Spark rather than MapReduce. Now, five-times faster is still a good thing, but it is not 100-times faster.

When people talk about performance, you always have to frame it in terms of workload. Are we talking about a sort? A logistic regression problem? What is it we are comparing? I think you hear these numbers floating around -- but in terms of rigorous studies, I am seeing more along the lines of five- and six- times improvement versus "100 times" on MapReduce, in terms of, say, a sort.

And we've recently seen from Hortonworks and the folks from HPE [Hewlett Packard Enterprise] that they have rewritten the [Spark] shuffle, and they are claiming "15-times" speedup. This is reasonable, given they are rewriting it in C++, but we will see.

There is no question that open source is becoming more pervasive in the enterprise stacks.
Thomas Dinsmoreindependent consultant

But all this is a sign of maturing. Instead of wild-eyed claims, we are seeing thoughtful benchmarks.

In several ways, Hadoop greased the skids for Spark. Certainly, recent data analytics platforms tend to support open source, as Hadoop does.

Dinsmore: First of all, there is no question that open source is becoming more pervasive in the enterprise stacks. And open source is part of the DNA of Hadoop. It is an essential part of the business model of Hadoop.

Thinkers on disruptive technology will tell you it's not actually technology per se that disrupts industries, it's disruptive business models. Companies like Teradata aren't hurting because Hadoop is a new technology -- it's because Hadoop is a new business model, and it's a completely disruptive way for software to be developed and delivered.

Because it is open source you have a much more rapid cadence of enhancements -- particularly when you have a community of the type that open source Spark has. It's grown rapidly because it has attracted contributors.

IT organizations like open source partly because open source is easier to integrate. They can open it up, look at it, inspect it -- but also, if you've grown up in the Hadoop ecosystem, you simply expect open source. It's just the way it is. It is a completely different ecosystem from the data warehousing ecosystem where everything was strictly commercial. The freely available software means more people can download it and learn it.

It goes beyond Hadoop and Spark. In the advanced analytics world, the reason the R language has become so popular is that in colleges, where they used to primarily use SAS, in the last 10 years colleges, universities, academic researchers and so forth have switched over to open source R. If you're a student, you can teach yourself R without having to pay out a lot of money. 

It does seem that SAS has realized this and has increased its support of college programs for analytics, at the same time, making its software more widely available.

Dinsmore: It's true that SAS has introduced a university edition; it's a virtualized version of their software that is offered for free. But it is too little, too late. It's something that would have been good if they had done it 10 years ago.

A community has grown up around R, and that community is very sticky. Once people get involved in an open source community, they have an almost negative reaction to commercial software. If you are a commercial software vendor, you are really going to have to prove that you are delivering something that is otherwise unavailable.

Next Steps

Follow coverage of the Spark Summit 2016 East

Find out about effective big data project planning

Learn about open source Apache Hive

Dig Deeper on Hadoop framework