Joshua Resnick - Fotolia

HPE adapts Vertica analytical database to world with Hadoop, Spark

Vertica 8.0 expands the analytical database's support for Kafka, Spark and Hadoop. That's an important step, as the Hewlett Packard Enterprise technology tries to compete in a field of diverse data tools.

Hewlett Packard Enterprise this week rolled out a version of its Vertica analytical database system intended to...

improve Apache Kafka pipeline management, as well as Apache Spark and Hadoop integration.

The updates are part of HPE's effort to adapt to a data management space that has seen major proliferation of open source big data tools since Vertica first appeared -- an altered technology landscape that contributed to a recent decision by rival vendor Actian Corp. to drop out of the analytical database market.

Vertica has been able to access Hadoop data before, but with Vertica 8.0 the analytical engine can work with Hadoop data in place, thus reducing data movement.

That's part of a general trend with such engines, according to IDC analyst Carl Olofson. Still, he cautioned that Hadoop is far from a full replacement for analytical databases such as Vertica. "This means you can expand the types of data that you query. But it doesn't mean Hadoop takes over," Olofson said. "It's not an either-or situation."

Instead, he continued, better links between Vertica and Hadoop show that the different data processing types can coexist. High-performance querying capabilities of Vertica, he said, can in effect "reach into [Hadoop] data and bring valid result sets back to the database environment."

The in-place processing update for Hadoop, along with new links to Apache Spark, are intended to enable Vertica to play alongside open-source Hadoop and Spark tools. While less mature, the open source offerings are finding use for new types of analytics, especially ones dealing with massive amounts of web data.

To this end, HPE Vertica 8.0 supports faster data loading, visual monitoring of Apache Kafka data streams, and in-database machine learning libraries. The new Apache Spark connector is said to support faster data exchange between Vertica and Spark systems.

Also on tap is support for the Apache Parquet storage format that complements the ORC Hadoop file format support already in place. The Vertica enhancements were discussed at the company's Big Data Conference 2016 in Boston.

Crowded field gets less crowded

Highly scalable analytical databases like Vertica arose during the past 10 years as an alternative to general-purpose relational database management systems for some types of data warehousing and analytics number crunching.

The analytical database field is crowded and competitive -- and sales as a whole haven't lived up to the original optimistic expectations.

Based largely on fast column-store architectures powered by massively parallel processing, the early field also included Netezza, Greenplum, ParAccel and others in addition to Vertica. Collectively, they made a mark in data management by running queries more quickly than established databases and data warehouses, where many such jobs were taking too long. Large vendors quickly took notice and bought up technologies one by one -- for example, IBM acquired Netezza, EMC purchased Greenplum and HP took over Vertica.

But with all the entries, the analytical database field is crowded and competitive -- and sales as a whole haven't lived up to the original optimistic expectations. That combination was enough to drive Actian out of the market: The company this week confirmed it's pulling the plug on its Actian Analytics Platform, which includes analytical database Actian Matrix, in order to focus on operational data management and data integration technologies.

Actian Matrix was based on technology the vendor gained with its 2013 acquisition of analytical DBMS startup ParAccel. Other products being discontinued along with it include Actian Vector, another database that's built on a symmetric multiprocessing architecture; VectorH, Actian's SQL-on-Hadoop query engine; and the company's DataFlow processing engine. In a statement, Actian said it was shifting its resources "to other, more predictable business segments."

Some of what could be called "the ParAccel torch" is carried forward in the increasingly popular Amazon Redshift cloud data warehouse. Amazon Redshift is based in large part on the ParAccel technology, which Amazon Web Services licensed when ParAccel was still an independent company.

Diversity in data processing

That an analytical database like Vertica often exists among other, diverse data technologies is shown in a quick inventory of Etsy Inc., an online marketplace for artisans. Rafe Colburn, Etsy's director of engineering, lists Kafka, AWS, Scalding (for developing machine language routines), Hadoop MapReduce and Parquet as just some of the software the company employs along with Vertica -- not to mention that Etsy is "looking into Spark."

Colburn said Etsy is on Version 7.1 of Vertica, and looking at 7.2 features. He added that Vertica is used for supporting internal dashboards and financial reporting, among other jobs, and has improved users' ability to query Etsy customer activity over an earlier Postgres DB implementation.

Vertica 8.0's support of Parquet is of interest, Colburn said, because his shop has begun to work with the Parquet format. "Parquet is the data format of the future for us," he explained, while acknowledging that the future may hold still more data formats to support.

Vertica, he said, provided horizontal scalability that was welcome, wasn't difficult to install in Etsy's data center and has proved relatively easy to use for ingesting data into the system. He said HPE's engineering improvements to the Spark-Vertica connector showed promise in terms of performance.

Enter machine learning, exit databases?

SQL queries requiring high concurrency have been a sweet spot for analytical databases like Vertica. Where they may be challenged going forward, according to some analysts, is in the statistically oriented machine learning approaches now making headway among some big web companies.

Analytical RDBMSs were successful because initially they offered radical price-performance advantages over existing database and data warehouse alternatives in analytical SQL, according to Curt Monash, president of Monash Research.

"They scaled out well to many nodes, and a number either started out with columnar systems or added columnar capabilities early on," he said. But as a result, the incumbents did cut prices and improve their capabilities for analytical SQL use cases, in Monash's view.

In a recent blog post that ponders the future of the analytical RDBMS, Monash said the systems still excel at key business intelligence jobs, such as complex ad hoc queries and high-concurrency reporting and dashboards. But he also suggested that new types of advanced analytics, such as machine learning, may find a better home in Spark.

[UPDATE:  Following the Vertica software update described here, Hewlett Packard Enterprise announced plans for a spin-off of "non-core" software assets, including its Vertica software, to mainframe and operating system software vendor Micro Focus in a transaction the companies valued at about $8.8 billion. By the terms of the deal, HPE shareholders will own 50.1% of the newly combined operations. The transaction is expected to be completed in the second half of 2017.]

Next Steps

Review benefits and drawbacks of analytical databases

Find out how new data frameworks affect DBAs

Look into a massively parallel processing database

Dig Deeper on Data warehouse software