Sergey Nivens - Fotolia

Apache Impala gets top-level status as open source Hadoop tool

Born at Cloudera, the MPP query engine known as Apache Impala has become a top-level open source project. It's one of various tools bringing SQL-style interactivity to big data analytics.

Apache Impala this week gained the status of top-level project within the Apache Software Foundation -- a key step...

in the SQL-on-Hadoop software's development progress, according to the open source standards group.

The Impala massively parallel processing (MPP) query engine is one of several technologies -- the list also includes Hive, Spark SQL, Drill, HAWQ, Presto and others -- that strive to bring SQL-style interactivity to distributed big data applications. In effect putting SQL on top of Hadoop, it originated at Hadoop distribution provider Cloudera, which has contributed considerable resources to the Apache Impala effort.

Companies such as Caterpillar, Cox Automotive and the New York Stock Exchange have used Impala, which employs a data architecture that separates query processing from storage management.

Marcel Kornacker, founder Impala project Marcel Kornacker

That separation of processing and storage was a deliberate part of the software's initial conception, according to Marcel Kornacker, who created the Impala technology as a software engineer at Palo Alto, Calif.-based Cloudera, where he worked until last June.

Cloudera originally released Impala, supporting an Apache license, in 2012. Impala formally entered the Apache Incubator process for new projects in December 2015. That year, Impala was opened up to community contributions, and it has had four releases since then. Google, Oracle, MapR and Intel are among the vendors that have developed integrations with Impala.

"Graduation from incubator to top-level project status is recognition of the strong team behind the Impala project," Kornacker said. "Along the way, Impala's stability has increased. The community made that possible."

Cluster at scale

"With Impala, you can perform analytics on a cluster at scale," said Brock Nolan, chief architect and co-founder of phData, a Minneapolis-based managed service provider and consultancy that works specifically with the Cloudera big data platform. Queries run more quickly, he said, and data scientists don't have to take their analytical jobs home with them.

Graduating to being a top-level project has really been about the ability to develop a community of contributors.
Brock Nolanchief architect and co-founder, phData

Apache Impala benefited from efficient integration with other Hadoop ecosystem components, Nolan said. A useful integration he noted in particular is with Kudu. That technology, a column-oriented data store, is also part of the Hadoop ecosystem and an Apache project, also originated at Cloudera and was also named after a species of antelope.

Nolan said he has found Kudu, together with Impala, useful in dealing with fast-moving internet-of-things data. Like much of the data that is fodder for big data tools these days, unstructured, nonrelational data is a big part of what is being gathered, but SQL support is still vital for analytics on that data, in his view.

"Graduating to being a top-level project has really been about the ability to develop a community of contributors [to Impala]," he said. "When something becomes a top-level project, it means there is traction, and it is mature, and that there are multiple companies that are using it and contributing, not just customers of Cloudera."

Sting like a butterfly, query like an Impala

Software like Apache Impala moves the Hadoop ecosystem, originally based on the MapReduce processing scheme, deeper into the realm of real-time processing, according to Mike Matchett, an analyst and consultant at Taneja Group in Hopkinton, Mass.

"When you look at MapReduce, the original Hive and the like, you see a batch approach to analytics. They are not designed as interactive tools," Matchett said. Meanwhile, he continued, data managers want to deliver interactivity to their audience of users in the enterprise.

Impala, like other emerging tools, is intended to bring SQL capabilities, widely supported in organizations, to distributed data processing. These tools may find different fits within big data pipelines.

Brock Nolan, phDataBrock Nolan

"We see Impala getting broad adoption," said Nolan, who also employs the Apache Spark processing engine's Spark SQL module for different use cases as part of the work phData does for its clients.

"Spark SQL is really good at ETL [extract, transform, load]," Nolan said. "At the same time, Impala is our go-to SQL on Hadoop for presenting results to data scientists and business analysts. We use both, but Impala is what broadens the user base of the data."

Technology creator Kornacker said the design of Apache Impala puts special emphasis on low latency and high concurrency, characteristics that have sometimes eluded Hadoop-style applications when they go into actual production and armies of users start to ask questions of their data trove.

Making the grade

The Impala advances are part of a broader move, one that sees much of the innovation in data processing tools centered on open source software, rather than proprietary tooling. That move has also seen the concept of Hadoop data processing gain a wider definition.

"These days, Hadoop is a collection of things," Nolan said. "Two years ago, it meant something specific. But because of its open architecture, it is different now."

Such open architecture has led to the seemingly sudden arrival of a host of data options. It allows users to, in Nolan's words, "bring in different layers of software now to enable new functions."

Software such as Impala drives forward the notion that data integration these days has more aspects, according to Rick Sherman, founder of consultancy Athena IT Solutions in Maynard, Mass. Sherman, who also teaches big data analytics at Northeastern University, counts Impala, as well as Hive, among the tools his students employ as part of their education.

"They have to learn that data integration isn't just relational," he said. "Today, there are different use cases for Hadoop, for NoSQL or for relational and columnar processing. Figuring out where the best uses of these tools are -- that is what you have to learn to do."

Dig Deeper on Hadoop framework