Big data makes data preparation steps more complicated to navigate

Sergey Nivens - Fotolia

Data prep for deep learning applications means careful planning

Deep learning applications often require a mix of data, and assorted preprocessing techniques. That makes data preparation a priority, and conventional machine learning may have a role to play.

With all the talk about predictive machine learning and deep learning applications, one can lose sight of the data...

engineering -- some might call it data art -- that is needed to prepare the data to work on.

Many questions go into the planning for deep learning applications that tap into deep banks of neural networks. Here are just a few: Should the processing be distributed? Is "labeled data" -- or, known good data -- available to help supervise learning? How much noise obscures the signal arriving from internet of things devices such as cellphones?

In the case of mobile phone sensors, data preparation for deep learning applications can present unique problems, as described by a data scientist working to uncover safe or unsafe patterns of drivers via data from the cellphone that they take along for the ride.

In such cases, data preparation can involve considerable preprocessing, according to Dan Shiebler, a data scientist at startup TrueMotion  who discussed the problems of messy sensor data at the 2017 Deep Learning Summit in Boston last week.

Boston-based TrueMotion offers free downloadable apps for people who want to measure their driving skills, while also working with insurance companies offering save driving incentive programs for customers who allow their driving habits to be remotely monitored.

Insurance and other industries are entering "the golden age of sensor data," Shiebler said. That sensor data often includes data from gyroscopes, accelerometers, magnetometers and GPS chips inside the cellphone a typical driver carries.

But the data needs preprocessing. TrueMotion's systems may first churn through combinations of this sensor data using more traditional, and more supervised, machine learning algorithms, before finding the right mix of data for subsequent deep neural learning.

"The data we get is very noisy. There are a lot of signals in there that we are not interested in," Shiebler said in a follow-up interview. "We figure out what algorithms we need to learn things best. Given enough data, the algorithms in turn will figure out the right transformations for the data."

With experience, he and his colleagues have learned how to filter data from different sensors to, for example, ascertain a cellphone's orientation -- knowing whether a phone is in landscape or portrait position helps clarify the meaning of accelerometer and other data, he said.

Press here for signal-to-noise

The process of preparing data for automated analysis is undergoing some change, as deep learning gains ground, according to Deep Learning Summit attendee Sean Cantrell, a senior consultant and data scientist at Excella Consulting Inc. in Arlington, Va. There is a temptation to leave more of the data sorting to the deep learning engine, he suggested.

"One of the appeals of neural networks to a lot of people seems to be that they help mitigate the necessity for properly engineering features, or attempting to enhance signal-to-noise [ratios]," Cantrell said.

He said there is still much merit in "properly grooming some data sets before doing deep learning on them." As an example, Cantrell pointed to TrueMotion's work to determine the relative orientation of a user's phone with respect to his car.

The nature of problems varies, Cantrell emphasized, and some require more data preparation than others. Deep learning that relies only on untagged data gathered for unsupervised learning may not always be best.

"Many problems really require supervised learning," he said. This supervised learning produces what are called "labeled data sets" where data is labeled, or tagged, as either good or bad.

"Even with an abundance of data all around us, tagging the sets appropriately can really be a challenge at times," he said.

The art and science of deep learning

A key question to ask when preparing data for deep learning is whether the data can fit in available memory, according to Sam Zimmerman, CTO and co-founder of Freebird, a mobile booking service in Cambridge, Mass., that offers tools for tracking and responding to flight delays or cancellations and other travel issues.

Zimmerman spoke at the Deep Learning Summit about Freebird's use of deep learning to estimate the risk involved in ensuring that travelers can quickly rebook, particularly delayed or cancelled flights without paying extra. In an interview, he said the way data engineers answer the question of available memory leads them to either engineer the system to run on a single computer or to run in a distributed manner.

Data sampling, which some practitioners see as contrary to the spirit of deep learning, can be useful for curbing memory requirements. But some problems can be handled well through sampling, Zimmerman said, and using too much data can sometimes place the engineer in a difficult programming environment.

Preparing data for deep learning today, Zimmerman said, is a combination of art and science. "It is certainly more of a frontier experience than the one you find with SQL databases," he mused.

Next Steps

Discover the difference: Machine learning vs. deep learning

Learn about infrastructure for deep learning applications

Find out how to navigate deep learning data prep hurdles

Dig Deeper on Data preparation