imtmphoto - Fotolia
Startup Dremio has tapped into the budding Apache Arrow columnar format with an eye toward speeding interactive queries and improving self-service capabilities for data analytics.
The Dremio Self-Service Data Platform, released earlier this month, targets data engineers' needs to connect multiple big data processing components with less data movement and without undue data reformatting.
At the same time, the software provides data curation and lineage interfaces, as well as a ready interface for data scientists working in languages such as Python and R.
The platform is also built to speed queries, automatically rewriting SQL queries to run in the native query language of such data sources as Elasticsearch, HBase and MongoDB, as well as relational data stores. It is the fast querying that is important to an early Dremio adopter at a French cloud services and hosting provider.
Alexis Gendronneaudata convergence team leader, OVH
"For us, the main key in the picture is the ability of the software to speed up queries," said Alexis Gendronneau, data convergence team leader at OVH, based in Roubaix, France. "With Dremio, we experienced up to 100 times faster queries on some data sets."
Gendronneau said the software's ability to create aggregations, or caches, of specific data sets and then optimize for specific queries helped accelerate operations.
Roots of the Dremio team
To forge its offering, Dremio tapped into several Apache open source projects. Dremio co-founder and CEO Tomer Shiran came by way of IBM and MapR, and was part of the project that brought forward the Apache Drill query engine.
Meanwhile, co-founder and CTO Jacques Nadeau, also from MapR, was both an Apache Drill and Apache Arrow committer. The Arrow software is intended to support columnar means of analytical processing that have become increasingly prevalent.
"Dremio has assembled a team of data veterans who methodically set out to solve the big problem of accessing high-scale data from many, disparate sources," said Doug Henschen, an analyst at Constellation Research.
Important among several open source projects underlying Dremio's work, he said, is Apache Arrow. "It provides columnar, in-memory analytical speed and a way to establish consistent representation of data across disparate technologies."
Dremio is like "big data middleware," in the estimation of Wayne Eckerson, founder and principal consultant at Eckerson Group.
"It is really designed to connect any BI tool to any data source, without the need to build an interim data warehouse, or the need for [extract, transform and load] and modeling," he said. This will make it easier for business users to get the data they want from any source, according to Eckerson, without having to wait for a data warehouse to be built.
Building data sets
Dremio's potential for better self-service speaks to some of the needs of OVH's Gendronneau, whose job it is to provide views of cloud services' use for decision-making both inside and outside the company.
It is not easy, according to Gendronneau, if only because data resides in a range of data stores, including MySQL, PostgreSQL, MongoDB, Elasticsearch and SQL Server.
"Building the right data sets for users can be painful and has a large cost," he said. "Our team has had to talk with the users to understand what they need, and then create usable data sets."
With Dremio, he continued, users were able to create their own data sets without explaining their needs to a technical team member. Software like Dremio software can enable users to serve themselves with helpings of analytical data, and break a bottleneck in the big data pipeline, he said.
Find out more on self-service analytics
Learn about data preparation in machine learning
Listen to a podcast with Apache Arrow expert Jacques Nadeau