Joshua Resnick - Fotolia

DataTorrent data ingestion tool aims to speed Hadoop feeds

A new data ingestion and extraction tool supporting the Hadoop Distributed File System is at the heart of startup vendor DataTorrent's efforts to broaden its big data analytics engine's appeal.

Big data analytics platform vendor DataTorrent has released its first standalone application, a fault-tolerant data ingestion and extraction tool for users of the Hadoop Distributed File System (HDFS). The software, called dtIngest, can move data between HDFS, Kafka, the Java Message Service and other data formats.

The new tool includes a point-and-click user interface and runs on Hadoop 2 clusters as a native YARN application. The software is designed to support both large and small files, allowing smaller ones to be aggregated into larger files to reduce the overall number of ingestion jobs that Hadoop systems have to process. Besides running in standalone mode, dtIngest can also work with DataTorrent's flagship RTS 3 in-memory analytics engine. As such, it demonstrates the Santa Clara, Calif., company's goal to go beyond streaming data and also cover batch processing. And coupled with other recent moves, the release of dtIngest shows DataTorrent's intention to become more perceptibly a card-carrying member of the Apache Hadoop ecosystem.

For example, DataTorrent said it would allow unlimited free use of the ingestion software by organizations. The company also said that an open source implementation of RTS called Project Apex is now available on Github, following earlier word that the core engine would be released as an open source technology under the Apache 2.0 license.

Data ingestion as foot in door

One pixel Exploring the obstacles to widespread
adoption of Hadoop

Data ingestion could be an entry point into user organizations for DataTorrent, which was formed by expatriates from Yahoo in 2012 as the Hadoop software that originated at the Internet services company took early flight. Now, several years into the Hadoop experience, the challenge of loading data into HDFS remains one of several factors cited when people ponder slow Hadoop uptake in mainstream organizations.

"Getting data into Hadoop can be hard, and getting it out can be just as difficult," said John Fanelli, vice president of marketing at DataTorrent. Fanelli said dtIngest enables point-and-click configuration of Hadoop ingestion and extraction jobs, easing the development burden.

According to a 2014 report by Jason Stamper, an analyst at 451 Research LLC, Hadoop data analysis becomes much more troublesome without the right ingestion and data management tools. He noted that DataTorrent's founders have a strong real-time engineering pedigree.

Spark, Storm lurk in waiting

Stamper's report notes as well that DataTorrent RTS faces serious competition from the Apache Storm and Apache Spark processing engines, both of which have gained attention since the advent of Hadoop 2.0 in late 2013.

DataTorrent's claimed customers include PubMatic, which uses RTS as part of a real-time ad analytics platform; and Silver Spring Networks, which has deployed it to help power a sensor networking application. The competitive environment marked by Spark and Storm can be seen as a likely driver of the company's recent moves to open up more access to its offerings.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Next Steps

Take a look at Hadoop performance bottlenecks

Learn more about big data in motion

Find out how online ad companies use Spark for data streaming

Data ingestion means a bigger role for NoSQL in IoT


Dig Deeper on Hadoop framework