Apache Spark meets the PDP-11 -- in the end, it's just processing

At times, it seems a Hadoop usurper, but the open source Spark processing engine is best viewed as another step on the long road of computing technology.

Long ago and far away, in 1992, I sat with my then-boss, Jon Titus, discussing the IT news of the day. Titus was a great editor, and very understated; he was also a PC pioneer, having created the early Mark-8 Personal Minicomputer that now sits in the Smithsonian. The newsroom conversation was much as it might be today, when we talk about Apple's surprise purchase of FoundationDB, or the rapid ascent of the Apache Spark processing engine in the headlines of the Hadoop world.

The news of the moment that day was the ouster of Ken Olsen, Digital Equipment Corp.'s co-founder and CEO. His departure was an inflection point along a trail that saw DEC go from being a gutsy mill town startup in Massachusetts to being a serious threat to IBM's industry leadership to being a forlorn acquisition candidate.

Like those in other editorial offices, we wondered what went wrong. Ultimately, what went wrong was the company got confused about what business it was really in. Seems absurd, but it can happen. Titus had a unique perspective on Olsen's quandary as smaller computers and new kinds of software came along to unseat its flagship PDP-11 and VAX computers. "DEC came to think they were selling minicomputers," he said. "But what they were selling was computing."

The simple things can sometimes be the hardest to remember. That's good to keep in mind in light of the growth of distributed data processing, and the highly touted Apache Spark framework's recent rise to prominence.

What's in a name?

There's some confusion today as Hadoop distributed processing is joined by Spark distributed processing in the Apache Software Foundation data ecosystem. Spark was the hot topic at last fall's Strata + Hadoop World conference in New York, and its echo was heard even more loudly at the recent Spark Summit East -- the first east coast edition of that event, also held in New York.

Some managers will wonder if Spark isn't just the shiny new object on the distributed computing block -- or, worse, just fodder for a developer's resume. Or both.

Some developers will urge their managers to jump from MapReduce-based Hadoop to Spark, and some -- with good reason -- will. Spark proponents claim that it can run batch processing jobs up to 100 times faster than MapReduce can -- it can also run stream processing and machine learning applications, which MapReduce can't do. But other managers will wonder if Spark isn't just the shiny new object on the distributed computing block -- or, worse, just fodder for a developer's resume. Or both.

Using the DEC story as a guide, if your organization has deployed a Hadoop cluster, you'd be advised to think about what you've been doing to date not as Hadoop computing, but, more simply, as computing. It isn't incorrect to think of Hadoop as a precursor to Spark, or Spark as a descendant of Hadoop -- but such generalities can only be taken so far.

Still, looking at the similarities between Apache Spark and Hadoop is a good first step. It's helpful to realize that Hadoop, in a way, greased the skids for Spark by bringing into wider currency basic notions of distributing workloads and managing compute clusters.

Hadoop computing, Spark computing

In Hadoop, open source APIs are used to link between different tools as applications demand. That has become part and parcel of Spark as well. And in fact, people looking to bring Spark into an organization often start out with systems that are an adjunct to the Hadoop Distributed File System, which can serve as an input source and as a persistent data store for Spark output.

Looking at differences is valuable, too, of course. On a basic level, at least for some types of jobs, Spark seems to provide a superior compute engine to MapReduce, the calculating engine that powered the original version of Hadoop before a new Hadoop 2 release opened up the framework to other platforms. As always, your mileage may vary. Also, Spark's reliance on in-memory architecture is a plus -- or maybe a minus, depending on your IT environment.

When Olsen's company put on its DECWorld user conference in Boston back in the day, it was flush enough to bring in the Queen Elizabeth II ocean liner to host some of the event. Thereafter, DEC went from an industry titan to a desperate company that disappeared into Compaq Computer. That all went down in about 10 years' time.

Hadoop vendors will do well to see Apache Spark as a partner technology. In most cases, they have been doing just that thus far. For users, there must be a realization that they aren't doing Hadoop computing -- and they aren't doing Spark computing, either. They're doing distributed data processing, and the particular engine is only one step in an ongoing progression.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Next Steps

Find out about Spark and MapReduce

Discover more about a new Hadoop initiative

Read about Spark's summer of buzz

Dig Deeper on Hadoop framework