Sergey Nivens - Fotolia

Self-service data preparation tools tap into machine learning

Machine learning is a hot technology in analytics applications -- and it also underpins new data preparation tools that let business analysts and users integrate data for analysis.

Interest in machine learning technology often revolves around its capacity to automate and improve analytical predictions. But it has other uses, too. In one emerging example, machine learning underlays data management advancements in the zone that lies between IT-based data developers and analysts working in business units, powering a new category of self-service data preparation tools.

Such tools can search for and access data throughout an organization, combine it with other data sets and do format conversions as needed, before feeding the integrated data into back-end business intelligence systems for analysis. Software vendors assert that the machine learning techniques built into the tools enable them to learn as they go and improve integration performance with continued use.

Machine learning itself may not be the first concern for business users trying to exploit their enterprise data for use in analytics applications. Still, they approve of the results produced by the new tools, which make it possible for useful sets of queries and data management task sequences to be saved and re-used.

"I don't see any machine learning happening per se," said Kunal Patel, a data analyst at Inc., a maker of identity management, recruiting, and public-records search software in Redwood City, Calif. But Patel, whose group is using a data preparation platform from Alation Inc., said the software makes real-time suggestions concerning tactics for mixing data streams. The suggestions can be based on scenarios he has implemented before, or on similar jobs that colleagues have run -- evidence of the machine learning functionality in action.

We're already seeing more engagement from [users] who just don't want to spend from four to eight hours a day writing queries.
Kunal Pateldata analyst, Inc.

Patel said users of the Alation software at Inflection sign in and search for existing queries that they can use to do business analytics. They can also make copies of query sets they or others have created, without calling on the IT department for development help.

"We now have the foundations for non-technical people to get up to speed," he said. "We're already seeing more engagement from, for example, product managers -- ones who just don't want to spend from four to eight hours a day writing queries."

Data, prepare thyself

Along with Alation, startups like Alteryx, Paxata, Tamr and Trifacta are pursuing self-service data preparation with puzzle pieces or full offerings. More established companies such as IBM, Informatica, Progress Software and Salesforce are entering the fray, too.

Alation CEO Satyen Sangani said the company's software is designed to help users "increase their data literacy" by capturing information about things such as who is using a particular table or query and how often. Sangani doesn't emphasize the machine learning technology that underpins the knowledge capture process. Instead, it's "something under the hood, in the way a transmission is under the hood of a car," he said.

Self-service data prep is becoming central to big data analytics applications, according to Nenshad Bardoliwalla, co-founder and vice president of products at Paxata. The most difficult part of the analytics process is pulling a lot of data from a lot of different sources, said Bardoliwalla, whose company's data preparation software employs semantic indexing, Spark machine learning and the Hadoop Distributed File System  in an effort to help users handle large and diverse data workloads.

Del Monte migrates to new process

Another benefit of self-service data preparation is that IT resources are freed up to focus on other tasks, said Matthew Heinze, who heads BI at Del Monte Foods Co. in Walnut Creek, Calif. He and his team deployed Paxata's platform as part of a companywide migration to a cloud architecture.

A limited transaction set involved critical SAP general ledger data and product-shipment information. For reporting purposes, that needed to be blended with other types of data. The data preparation tools helped the BI team streamline the process of creating the reports and make it easier for business users to integrate data themselves.

"Before," Heinze said, "people would take SAP data and do offline integration with, say, Nielsen point-of-sale data, using Excel models. But that didn't run very well." Now, the two data sets are sent to a cloud-based Paxata implementation, where the vendor's specialized machine learning algorithms can be applied.

In addition, each step in an integration routine is saved in the Paxata environment, which "helps you form repeatable integrations that can be consumed by the reporting platform," Heinze said. He added that business users can load data, put integrations together themselves and see the effects of what they've done immediately, without taking up IT staff time.

Data prep meets cloud, data lake

The move toward cloud applications with new data integration requirements and the growing need to navigate Hadoop data lakes filled with a wide variety of data are propelling the interest in self-service data preparation, according to Philip Howard, an analyst at London-based Bloor Research International.

With the advent of the data lake, IT is no longer acting as the data gatekeeper, as it did in the past, Howard said. "You have a data lake and you want to explore it. But the issue for analysts has remained the same over time -- that is, how you get access to that information." Meanwhile, he added, people increasingly want to take application data that is in the cloud and bring it together with in-house data, without a long wait for an IT project to do the integration work.

And Howard thinks many business users are ready to take on data preparation chores. "Most of the vendors are addressing the needs of people who are reasonably tech-savvy," he said, although he noted that users who aren't data scientists probably don't care much about the fact that the new breed of data prep tools are driven by machine learning technology. "If the software running inside has some smarts, if it can give you recommendations based on what you or others have done, that's what is useful," Howard said.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Next Steps

Learn more about Paxata's use of Apache Spark

Read expert commentary on the self-service data preparation process

Find out what's behind Salesforce's move into data analytics

Dig Deeper on Enterprise data integration (EDI) software