Sergey Nivens - Fotolia

Spark helps power Paxata big data preparation platform

Count Paxata among startups set to apply machine learning to problems of big data preparation.

Paxata is among startups looking to apply machine learning techniques to the growing problems of big data preparation. Like Tamr, Trifacta and others, the company has set out to meet the needs of data scientists and business analysts whose days are too often consumed with data preparation tasks that keep them from their chief objective, which is data analysis.

A recent release of the Paxata platform intends to prepare larger, more varied amounts of data for use by back-end tools. The software couples a model-free, in-memory pipeline processor and Spark-based distributed processing engine to the Hadoop Distributed File System.

Machine learning and semantic indexing capabilities are part of Paxata's effort to bring a higher degree of automation to the task of data preparation. These are intended to take on data transformation and related work that kept data scientists and business analysts too busy.

Paxata co-founder and vice president of products Nenshad Bardoliwalla said the software is meant to address what could be called the "first mile'' in the quest for data analytics: The front-end preparation and integration of big data for consumption. While Bardoliwalla gives high points to recent advances in "last mile" data visualization software, singling out provider Tableau Software for its product's ease of use, he says more is needed.

Bardoliwalla said Paxata's "Spring '15" platform release supports data extraction with a REST API toolkit. This is meant to pave the way for a more general style of data extraction.

"If getting data to the visualization tool requires considerable effort, then you have just punted the real problem further down the road, even if you have easy-to-use visual tools," he said.

Paxata seeks to provide an interface to its machine learning capabilities that bears some resemblance to the spreadsheet view that Tableau and others have pursued. Bardoliwalla described it as a "table view."

"The most difficult part of the analytics process is pulling a lot of data from a lot of sources," said Bardoliwalla. "In the same way Tableau has been in the forefront of putting visual analytics into the hands of Excel users, we would like to lead in a new way of thinking about data preparation."

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Next Steps

Learn about ETL and EDR tooling.

Consider options for data integration

Find out how data management feeds successful analtyics

Learn more about cloud versions of the Apache Spark framework 

Dig Deeper on Extract transform load tools