HP said it will boost the data streaming proficiency of its Vertica analytics platform, and improve SQL-on-Hadoop...
capabilities. According to the company, the next Vertica release, code-named "Excavator," will support Apache Kafka, the up-and-coming open-source distributed messaging system.
Excavator details emerged during the company's recent Big Data Conference in Boston, at a time when HP was still in the midst of preparations to split into separate server-data center and PC-printer businesses, a move expected later this year.
Getting data to the right systems for processing is a big problem today, and the HP Vertica updates appear helpful, according to Chris Bohn, a senior database engineer with Etsy, Inc., which runs an online marketplace for artisans.
"What we want is a unified place for data, and Kafka can be used for that," said Bohn, who is an Excavator beta user. He said his team will be looking at the SQL-on-Hadoop advances as well, adding that existing software of this variety has yet to meet his requirements.
What is now known as Apache Kafka emerged in 2011 from the work of development teams at LinkedIn Corp. It is a messaging system, built to be distributed, and to support publish and subscribe computing architecture. It has often been used to feed Hadoop-based data processing systems.
From batch to real time
HP and others have used Apache open-source code to create versions of Kafka that can put large-scale data -- including log and other file types -- in motion. Although not on par with the fastest ''real-time'' middleware, software like Kafka moves data handling closer to the realms of operations, and further away from roots in batch processing.
"The ability to take log files that are streaming in is important," said Dan Vesset, program vice president of IDC's Business Analytics research.
"This is not Wall Street trading speed streaming, but it is still a much higher speed than the traditional ETL, where you may be loading batches," he said, noting that Vertica's roots were in high-performance batch processing.
Vertica is "growing up," adding features like Kafka into its portfolio, Vesset said. "We see Vertica and others moving to streaming from the traditional data warehousing scenario where batch jobs were loaded often at nightly intervals."
Digging into Excavator
"There is a need for innovations that the traditional data warehouse market hasn't serviced," said Colin Mahony, general manager of Vertica within HP Software. In that regard, the Hadoop application framework has been an example, he said.
It takes a looser approach to data schema design than was common in traditional data warehousing, where full-fledged, set schema were often an obstacle to agile development.
Mahony said Vertica has addressed such needs with Flex Table technology that adapts to the unique demands of ever-changing semi-structured data. To further use Flex Table, he said HP open-sourced the Vertica Flex Zone Table Library for designers looking to pursue schema-on-the-fly.
He said Excavator will support high-performance Hadoop data access by working directly on the Optimized Row Column (ORC) file format. This has been developed and promoted by HP technology partner Hortonworks, which provides a Hadoop distribution.
Excavator's new SQL-on-Hadoop tooling will enable existing Vertica SQL queries to run directly on this new variety of columnar files to improve performance of Hive SQL-on-Hadoop data reads. The release is also expected to include advanced machine-log text search and native integration with Apache Spark. The Vertica Excavator software is due for general release this fall.
Read about Vertica analytics uses in health care
Learn about Netezza analytics applications for marketing
See how others use SQL-on-Hadoop technology