News Stay informed about the latest enterprise technology news and product updates.

Apache Spark steps up, offers Hadoop 2 a new take on processing

Apache Spark is moving up a notch in the Apache stable; Embarcadero gets set to buy ERwin; and JSR-107 crosses finish line.

The changes in architecture brought on by Hadoop 2 can be expected to usher in a new round of innovative software. A prime contender appears to be Spark, which late last month went from ''Apache Incubator'' to become a top-level Apache Foundation project. Spark is an in-memory cluster computing framework that takes Hadoop further away from its original MapReduce roots.

The Spark framework is meant to better enable analytics interactive queries and stream processing. It can read from HDFS and other Hadoop data sources. As such, it is something of an alternative to MapReduce, which has been pictured as complex and limited by some detractors.

Spark supports more generalized computing methods than the specialized MapReduce format, according to Matei Zaharia, who began work on Spark at U.C. Berkeley in 2009. He is now vice president of Apache Spark, Chief Technology Officer at Spark startup Databricks, and an associate professor at the Massachusetts Institute of Technology.

"Overall, what Spark does is let you use existing types of Hadoop clusters to do different types of computation more easily or quickly. It has higher-level APIs [application programming intrfaces] in Java, Python and Scala, as well as libraries for graph processing, machine learning and so on," Zaharia said.

As many as 150 developers from over 30 companies have been involved in the Spark project since its inception at U.C. Berkeley. It has been used in systems at Yahoo, and was made available commercially by Hadoop distribution provider Cloudera, which formed an alliance with Databricks.

Embarcadero set to buy CA's ERwin data modeling product line

Application and database tool house Embarcadero said it has signed a letter of intent to acquire CA Technologies' ERwin data modeling product line. If concluded, the deal would thrust Embarcadero into the forefront of a data architecture technology that may be ready to ride the general uptick in big data interest.

"We serve the market for database developers' and application developers' tools," said Michael Swindell, senior vice president of products at Embarcadero. "With this acquisition, the database side of the business becomes the biggest part of what we do."

ERwin, which became part of CA with its 1999 purchase of Platinum Technologies, has long been a mainstay in data-oriented modeling, but was somewhat lost in the mix over the years as CA pursued "core" systems and business management capabilities. Like Embarcadero, CA has tried to expand data modeling beyond just data modelers.

IDC analyst Al Hilwa, program director of software development research, said the merger would put Embarcadero in the lead in the data modeling market. Growth of these tools can be accelerated if vendors realize the emerging big data opportunity, he said.

Say hello to JCache, JSR-107

Part of the battle in big data as it applies in operational business intelligence (BI) today is moving the data around quickly, or ensuring the data is not moved unnecessarily. Java is still a very central programming language for dealing with such development issues, and various means of distributed and elastic caches have been employed by Java developers to feed this big data pipeline.

While some approaches have become common, they really haven't become standard. This changed with word that a Java workgroup had completed deliberations on JSR-107, the JCache API. Including a key-value API, the JCache API defines interfaces for caches and operational stores.

Standardization was a long, multi-year effort -- dedicated members of the coffee-swigging Java army will recognize that simply by the initiative's designator number ''107.'' In any case, it could find wide use.

"Historically, people have looked at caching after the fact. They think in terms of 'build, deploy,scale,' which is sort of a 'ready, shoot, aim' type of model," said Miko Matsumura, vice president of marketing and developer relations at Hazelcast, a maker of in-memory data grids.

He said software architects now need to consider scaling long before deployment because "with cloud and mobile, your user base can grow exponentially all of a sudden."

Contributing to the JSR-107 effort were technologists from Hazelcast, Oracle and others. The standardization effort began in 2001. Who says Java isn't moving faster under Oracle's stewardship?

Jack Vaughan is SearchDataManagement's news and site editor. Email him at, and follow us on Twitter: @sDataManagement.

Dig Deeper on Hadoop framework

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.