As machine learning moves closer to production use in mainstream organizations, more and more is being said about...
the need for large amounts of data to train and run machine learning models. While some technical changes are also required to adapt to the processing demands of machine learning, one thing is increasingly clear: Good data quality is paramount.
Data quality is a core part of data management -- and of making the results of analytics applications believable. We've been talking about it as long as we've had an IT industry; I worked on data quality and management issues on mainframes in the 1980s and Unix servers in the 1990s.
User experience is important, and being able to explain analytical findings to business users is a must. The ongoing advances in analytics capabilities continue to amaze. But for all the talk about data quality, it often still isn't given equal footing with those other aspects. That's sad -- and potentially hazardous, particularly with machine learning data.
More data to deal with
Machine learning is still very early in the adoption cycle. You may ask yourself if its requirement for large data sets is really new, as many existing analytics systems also use them. That's one of the reasons why I laugh at the phrase big data. Enterprises have always dealt with a lot of data, and calling one specific time period the big data era is like calling one specific period of art modern. Where do you go from there?
However, the reality is that deep learning, an advanced offshoot of machine learning, does need larger data sets than conventional analytics applications.
In deterministic analytics and statistical processing, there are fixed relationships between data elements, and fixed expectations for analyzing the data are codified in algorithms. The power of deep learning is that it can build and refine algorithms on its own as it learns from the data. Doing so requires a sufficient data set to provide the variation necessary for algorithms to evolve accurately.
Given the increased volume of data needed and the evolutionary methods used, good data quality has become even more important than it already was. Before a company starts adding machine learning or deep learning applications, it needs to understand -- and improve -- its data. Which systems provide data, how it can be accessed and how data sets can be combined for analysis are questions that must be addressed before you seriously investigate machine learning systems.
Data cleanliness is good in machine learning
You also need a solid process for cleansing data. For machine learning and deep learning models to accurately learn, the data sets being used to train them must be trustworthy. That elevates the need to provide clean data to machine learning systems.
In turn, machine learning algorithms can aid in the data quality process by checking large data sets for matching issues, anomalies and other errors.
Back in the late 1990s, I worked for a company that built analytics software to look for patterns in data; the work was done on a Unix box with an algorithm developed as part of a Ph.D. thesis. The company was using statistical processes to better understand both individual data points and patterns within a data set.
The product had early success, but the company failed because of poor sales decisions. Still, in its short life, it demonstrated the power of using analytics to solve data quality issues. That's becoming more feasible now as machine learning tools mature.
Build a solid foundation before buying in
The early state of machine learning adoption shouldn't prevent you from talking to technology vendors that promise to add machine learning to your analytics toolbox. But before you move beyond the discussion stage, some preparation is needed to make sure you have good data quality for machine learning uses.
Catalog your existing analytics applications and look at the data that they use and create. Then, consider how to increase the accuracy of that data to make it usable in machine learning and deep learning applications without concerns about its quality and consistency. The benefits of doing so are twofold.
First, resources spent on data quality will help improve data throughout your information infrastructure, not just in machine learning applications.
Second, good data quality is critical for the increasingly regulated data environment. The European Union's General Data Protection Regulation, popularly known as GDPR, is the most visible aspect of the growing need to better understand, secure, track and control corporate data.
The machine learning bandwagon has barely left the stables, but now is the time to get your data quality ducks in a row -- to mangle my animal-related metaphors. The technology has such strong potential; starting to build the required data foundation is imperative so you can take full advantage of machine learning tools when you're ready to adopt them.
Data matters, and it will continue to do so; the same goes for good data quality. Listen to the promise of machine learning, but focus on preparing your data and ensuring that it's up to the task.