Sergey Nivens - Fotolia

Tip

Good data quality for machine learning is an analytics must

As companies add machine learning applications, they need to really understand -- and be able to improve -- their data. That's where data quality initiatives come in.

David A. Teich

Published: 19 Jul 2018

As machine learning moves closer to production use in mainstream organizations, more and more is being said about the need for large amounts of data to train and run machine learning models. While some technical changes are also required to adapt to the processing demands of machine learning, one thing is increasingly clear: Good data quality is paramount.

Data quality is a core part of data management -- and of making the results of analytics applications believable. We've been talking about it as long as we've had an IT industry; I worked on data quality and management issues on mainframes in the 1980s and Unix servers in the 1990s.

User experience is important, and being able to explain analytical findings to business users is a must. The ongoing advances in analytics capabilities continue to amaze. But for all the talk about data quality, it often still isn't given equal footing with those other aspects. That's sad -- and potentially hazardous, particularly with machine learning data.

More data to deal with

Machine learning is still very early in the adoption cycle. You may ask yourself if its requirement for large data sets is really , as many existing analytics systems also use them. That's one of the reasons why I laugh at the phrase big data. Enterprises have always dealt with a lot of data, and calling one specific time period the big data era is like calling one specific period of art modern. Where do you go from there?

However, the reality is that deep learning, an advanced offshoot of machine learning, does need larger data sets than conventional analytics applications.

Before a company starts adding machine learning or deep learning applications, it needs to understand -- and improve -- its data.

In deterministic analytics and statistical processing, there are fixed relationships between data elements, and fixed expectations for analyzing the data are codified in algorithms. The power of deep learning is that it can build and refine algorithms on its own as it learns from the data. Doing so requires a sufficient data set to provide the variation necessary for algorithms to evolve accurately.

Given the increased volume of data needed and the evolutionary methods used, good data quality has become even more important than it already was. Before a company starts adding machine learning or deep learning applications, it needs to understand -- and improve -- its data. Which systems provide data, how it can be accessed and how data sets can be combined for analysis are questions that must be addressed before you seriously investigate machine learning systems.

Data cleanliness is good in machine learning

You also need a solid process for cleansing data. For machine learning and deep learning models to accurately learn, the data sets being used to train them must be trustworthy. That elevates the need to provide clean data to machine learning systems.

In turn, machine learning algorithms can aid in the data quality process by checking large data sets for matching issues, anomalies and other errors.

Prepping machine learning data

Back in the late 1990s, I worked for a company that built analytics software to look for patterns in data; the work was done on a Unix box with an algorithm developed as part of a Ph.D. thesis. The company was using statistical processes to better understand both individual data points and patterns within a data set.

The product had early success, but the company failed because of poor sales decisions. Still, in its short life, it demonstrated the power of using analytics to solve data quality issues. That's becoming more feasible now as machine learning tools mature.

Build a solid foundation before buying in

The early state of machine learning adoption shouldn't prevent you from talking to technology vendors that promise to add machine learning to your analytics toolbox. But before you move beyond the discussion stage, some preparation is needed to make sure you have good data quality for machine learning uses.

Catalog your existing analytics applications and look at the data that they use and create. Then, consider how to increase the accuracy of that data to make it usable in machine learning and deep learning applications without concerns about its quality and consistency. The benefits of doing so are twofold.

, resources spent on data quality will help improve data throughout your information infrastructure, not just in machine learning applications.

Second, good data quality is critical for the increasingly regulated data environment. The European Union's General Data Protection Regulation, popularly known as GDPR, is the most visible aspect of the growing need to better understand, secure, track and control corporate data.

The machine learning bandwagon has barely left the stables, but now is the time to get your data quality ducks in a row -- to mangle my animal-related metaphors. The technology has such strong potential; starting to build the required data foundation is imperative so you can take full advantage of machine learning tools when you're ready to adopt them.

Data matters, and it will continue to do so; the same goes for good data quality. Listen to the promise of machine learning, but focus on preparing your data and ensuring that it's up to the task.

Next Steps

Clean data is the foundation of machine learning

Good data quality for machine learning is an analytics must

As companies add machine learning applications, they need to really understand -- and be able to improve -- their data. That's where data quality initiatives come in.

More data to deal with

Data cleanliness is good in machine learning

Build a solid foundation before buying in

Next Steps

Dig Deeper on Data governance

What is augmented analytics?

Artificial intelligence in healthcare: defining the most common terms

Predictive analytics vs. machine learning

Deep Learning May Detect Breast Cancer Earlier than Radiologists

More data to deal with

Data cleanliness is good in machine learning

Build a solid foundation before buying in

Next Steps

Related Resources

Dig Deeper on Data governance

What is augmented analytics?

Artificial intelligence in healthcare: defining the most common terms

Predictive analytics vs. machine learning

Deep Learning May Detect Breast Cancer Earlier than Radiologists