Sergey Nivens - Fotolia

Data preparation process comes under new scrutiny

Data preparation processes are changing as business users cope with big data in all its variations, according to the latest Talking Data podcast.

The data preparation process was something of a sleepy back water in recent years -- its component tools were often overlooked as attention was poured on new methods of data storage and data analytics. But, as new software storage frameworks and analytical engines appear, so, too, do new breeds of data preparation. This makes for an exciting amalgam.

In our most recent edition of the Talking Data, we take a look at aspects of data preparation today, for which there really may be something novel in store. Coming in for scrutiny are SQL-on-Hadoop offerings that prepare Hadoop data for use by business analysts skilled in SQL; data curation software meant to bring greater rigor to the integration of structured and semi-structured data; and a newly developed breed of tools that apply machine learning algorithms to data integration in order to put data manipulation in the hands of business users.

SQL-on-Hadoop, particularly, highlights the renewed importance that the data preparation process enjoys. Filling up bins of Hadoop with data, it turns out, is just step one for modern big data processing. People actually want to do something with the data. That means cleaning and formatting the data for general consumption within an organization, which may require SQL tools familiar to wide ranks within companies.

Also discussed in the podcast  is a presentation by MIT Associate Professor -- and recent ACM/Turing Award Winner -- Michael Stonebraker, who spoke at the 2015 MIT Chief Data Officer & Information Quality Symposium. While soundly criticizing traditional methods of data preparation, Stonebraker promoted new data preparation techniques formed to cope with the growing variety and velocity of data.

Finally, a dialog ensues that centers on startups looking to improve data preparation process results in the face of the challenges Stonebraker outlined. Alation, Paxata, Tamr, Trifecta and others are among software vendors bringing artificial-intelligence-style machine learning to the problem of data preparation.

Next Steps

Listen to a report on conference news of Microsoft Azure

Hear about NoSQL database trends uncovered at this year’s MongoDB World

Catch the Talking Data podcast take on Spark at the Spark Summit

Dig Deeper on Enterprise data integration (EDI) software