|
|
||||||||||||||||||||
| Home > Data quality assessment helps identify, fix data quality problems | |
| Podcast: |
|
||
Data quality problems can't be fixed until they are identified. In this section of the tactical data quality guide, you'll find a podcast and interview transcript with author and data quality expert Arkady Maydanchik. In this Q/A with SearchDataManagement.com, Maydanchik reveals how a data quality assessment can help pinpoint data quality problems and how some companies are successfully managing data quality initiatives.
Don't miss the other installments in this data quality management guide Managing data quality efforts during a recession Trends in the data quality market Avoiding data quality pitfalls and using data quality tools for discovering new opportunities Q/A: Identifying data quality problems with a data quality assessment FAQ: Best practices/tips for data quality
What's the point of analyzing incorrect, out-of-date or mislabeled data? Actually, that's a trick question. There really is no point, which is why achieving comprehensive data quality management – that is, ensuring the accuracy, reliability and effectiveness of data – and overcoming data quality problems is so critical to business success. But data quality is still an emerging field, one often overlooked by companies and organizations. To help understand how to start a data quality program, SearchDataManagement.com caught up with Arkady Maydanchik, a member of the Data Quality Group and author of the recent book Data Quality Assessment. In this 20-minute podcast, appropriate for both business and IT professionals, listeners will:
About the speaker: Arkady Maydanchik is a recognized practitioner, author, and educator in the field of data quality and information integration. His data quality management methodology was used to provide data quality services to numerous Fortune 500 companies. Arkady is a frequent speaker at various conferences and seminars, author of the aforementioned Data Quality Assessment book, and a contributor to many journals and on-line publications.
Editor's note: The following Q/A is an edited transcript based on the podcast with Arkady Maydanchik. Please contact editor@searchdatamanagement.com with any questions. SearchDataManagement.com: Hello and welcome to a SearchDataManagement.com podcast. I'm Jeff Kelly, news editor. What's the point of analyzing incorrect, out-of-date or mislabeled data? Actually, that's a trick question. There really is no point, which is why achieving comprehensive data quality — that is, ensuring the accuracy, reliability and effectiveness of data – and overcoming data quality problems is so critical to business success. But data quality is still an emerging field, one often overlooked by companies and organizations. To help understand how to start a data quality program, SearchDataManagement.com caught up with Arkady Maydanchik, a member of the Data Quality Group and author of the recent book, Data Quality Assessment. But before we get started, here's a little more background on our guest speaker. He is a recognized practitioner, author and educator in the field of data quality and information integration. His data quality management methodology was used to provide data quality services to numerous Fortune 500 companies. Arkady is a frequent speaker at various conferences and seminars, author of the aforementioned Data Quality Assessment book, and a contributor to many journals and online publications. So Arkady, you said that data quality assessment is the starting point for any data quality program. Can you explain to our listeners what a data quality assessment is and why it is so important when starting out on any data quality program? Maydanchik: Sure. Basically, we've been talking about data quality for probably the last 10 or 15 years, and back 10 years ago, it wasn't a very popular topic. Now it seems like it is a very popular topic. Everybody wants to talk about it; everybody says, "Our data is really bad." A lot of people are saying, "Hey, let's try to do something about it. Let's try to fix the problems. Let's try to improve the processes." The reality is you can't fix the problem until you know what it is. That's actually one of the reasons why, even though we've been talking about data quality for 10 or 12 years, not much has really been done. Frankly, data quality deteriorated over the last 10 years quite a bit. Data quality assessment is the process of taking your existing data and systematically going through it and identifying what's wrong with it, [figuring out] where the data problems are, and then connecting that to your real business process and understanding how the data problems impact your processes and what is the cost of your data quality problems. Once you have done that, you can do a lot of things. Then you can say, "OK, we understand now how much it costs, so now we can possibly do a return on investment (ROI) analysis and figure out if it makes sense for us to invest money into fixing some problems." Once you know where the problems are, you can try to analyze where they came from — do a root-cause analysis and try to prevent problems from happening in the future. All this relies on the fact that you know where your problems are and what they cost your organization. It's an important step for both the technical reasons and the political reasons. Political reasons are also important. Most organizations today don't really have data quality departments. Most of them are taking the first step, and I oftentimes have people come to me and say, "Hey, you know we think our data is bad; we want to start a data quality management program, but we need some way of selling it to our management." They say, "Hey, is there any template, is there any way we can sell this to our management and tell them, prove to them, that we need a data quality management program?" The truth is you can't just go and say, "Hey, everybody is talking about it, and there are many reasons why you need to do data quality management." What you need to do is show how it affects your specific organization. Until you have that knowledge, you really can't get any budget or start any programs. So, both on the side of starting a program and on the side of getting anything done, you need to have a specific, systematic knowledge of what is wrong with your data and how it impacts your business. SearchDataManagement.com: You've also talked about data quality rules. What role do they play, and why are they so crucial to the outcome of any data quality program, as you wrote in your recent book? Maydanchik: Now, we said we want to find the problems with our data. Now, fundamentally, there are three approaches you can follow. One is what we'll call a complete manual validation. Basically, you take every piece of data and compare it with some trusted source. Maybe it's a paper file or maybe it's checking with some real people who have the knowledge of that information. Now, that is totally impractical because our databases are made of millions or billions of data pieces, so we really cannot do it. Another way is to take the page from the common quality assessment. Data quality assessment is something reasonably new. But quality assessment in many industries is always done. Automobiles go through quality assessment, vacuum cleaners, airplanes, every product that is made out there goes through quality assessment. The other approach is to say, "Let's try and pick up a sample of the data. So we have millions of records. Instead of doing all of them, we'll take a few hundred and analyze them. Based on their quality, we'll try and project how good the quality of the rest of the data is." Well, that also doesn't work because it can give us an idea of how bad the data is, but it can never point us to any specific data elements that are wrong because we'll only know that about our sample. The beauty of the data is it describes real, concrete objects. It describes people. It describes products. It describes customers. It describes processes. And they have many attributes and characteristics that are all intertwined and interrelated. And so the attributes of the data itself are also tied by millions of different relationships. Those are the constraints on the valid value. Those are the relationships across data elements. The relationships in the order certain events can occur, on the condition in which events can occur, the timing of when certain measurements are made. So there are hundreds and hundreds and hundreds and for bigger databases thousands, of different constraints and those are the ones we call data quality rules. Constraints that we can apply to the data -- and anytime data violates those constraints then there's something wrong with it. Now, the beauty of this is that those constraints can be implemented in computer programs. We can run all of our data no matter how large our database. We can run all of the data through these constraints, and basically, within the reasonably short time frame of the project, we can find all of the inconsistencies. It's not going to be 100%, but we can find probably 95%, 98% of all potential data problems. So, data quality rules are at the heart of this because they are a tool, they're the main tool of a data quality professional. SearchDataManagement.com: How do you ensure that your data quality assessment is comprehensive and complete? Maydanchik: That's a very good question because it's quite easy to design a bunch of rules, but the challenge, the difficulty is making sure we really did find all of the problems, and also the challenge is to make sure the discrepancies we find are true problems. The kind of analogy I usually give is if I was asked to be an umpire in a Major League Baseball (MLB) game. Now, if you look at the rule book, it has 136 or 137 rules. Now let's say I learn all but 10% of them, so I learn 120 rules. The odds that I'm going to cause a riot during the games because I missed some play are pretty high. Just knowing 90% of the rules doesn't really do the job. The same problem with the assessment; because there are so many different constraints on the data you can think of, it's not hard to quickly write a short list of scores of hundreds of them. But how do we know we're finding them all? Here is where the other approach, assembling, comes to help. What we do is first come up with all these rules, and there's a very systematic approach to that in my book. I dedicate half of the book to systematic processes of how to identify all of the different rules. Then what we do is take a sample of the data and we use the true data expert, somebody who has a knowledge of and access to the data sources. And we ask that expert to validate a sample of the data, and then we compare the findings with the results of the rules. Basically, the objective is, even though we're using the rules, we want to find all the same problems that the data expert would've found using the objective trusted data approach. So, once we match what the data expert finds with what the rules find, we'll see whether the rules have shortcomings. There might be some errors that the expert found that our rules didn't. Well, then we should try to find what other type of constraints and rules we can design that'll find those errors. On the other hand, we can say that there are some discrepancies the rules indicate that the expert believes are really not errors. Now the question is: How can we refine and fine-tune the rules to make sure they really don't catch those false-positives? So really the process of fine-tuning the rules is based on comparing the results of the automated assessments with a sampling validation of the data by data experts. SearchDataManagement.com: We've heard a lot about data quality scorecards. In your own words, what really is a data quality scorecard and what role does it play? How is it developed and how does one work in the real world? Maydanchik: Data quality scorecards are really the final product of the data quality assessments. It's a common mistake that a lot of people will think the final product of data quality assessments is a bunch of listings with errors. Say we have a rule that says attributes gender value as M or F and we're going to try to find a list of all records that have other values. Those are the errors. Now if we have 1,000 rules, we're going to have 1,000 errors. The problem is you're going to end up with listings that have 1,000s or 10,000s of lines, but that doesn't really tell you how good or how bad your data is at the higher level. You really can't translate that into real dollars, you can't really translate it into, "How does it impact our business process? How much does it cost our organization?" That's where data quality scorecards come into play. It's a hierarchy. On top of it are a few meaningful numbers that we call aggregate scores. Each of them ties the data quality to some specific data use. So, let's say you have a process that uses a certain part of your data, and let's say we ask ourselves, "Given the existing data, how many times, how often is the process going to fail? What is the fraction of the data record that is relevant to this specific process that is correct?" Once we find all the errors form the error reports, we narrow down the targets' subsets that are relevant to this business purpose. Then we calculate the good records among those records that are relevant. So then we can say, "For this specific process, data quality is 95%, and for another process, data quality is 90%, and for a third process or for another use, data quality is 85%." We can differentiate it by business processes by the different groups that use the data and in many other ways. Once we have those aggregate scores, we can basically translate that into the true costs of the business. We can say, "OK, let's see what the significance of any specific type of data error is." Let's say we have a certain kind of data error and it costs us three hours of rework. OK, that translates directly into costs, and now this 95% number can translate it into something meaningful. That's the top level of data quality scorecards. It tells you what the data quality is for different business purposes or different business uses. Of course, below that we have a hierarchy, so we have an ability to drill down and say, "OK, so let's say for this specific business use, data quality is 95%. Well, what's really contributing to that?" Maybe there are 10 different data elements or 100 different data elements that are useful for this purpose. Which of them really contributed to most of the errors? So we can say that data in one of the data elements is 99% and the other one is maybe 89%. That helps us prioritize if we decide to try to improve the data. That helps us prioritize and say, "OK, if we want to improve this number from 95% to 97%, where should we start?" And then the next layer from there is that you can drill down and get into the specific error reports, and that's when you're really looking into any specific data errors and trying to fix them. So scorecards are a pyramid that provides both the high-level numbers that'll tell you, overall, how good or how bad the data is and how it impacts your business. And on the bottom, the atomic level of data quality that'll tell you if any specific records are good or bad. And then, obviously, there are layers in between that. SearchDataManagement.com: What role in a data quality program does the metadata warehouse play and, ultimately, how does it help improve data quality? Maydanchik: The quality of the data is in direct correlation to the quality of the metadata. Let's start with a basic sentence. In a database, we have hundreds of different attributes and each has different meanings, all with different codes that have different meanings. The codes' meanings change over time and can be different for different subsets of the data. Usually, the metadata, which is basically data dictionaries, data models, data catalogues … the metadata, unfortunately, is usually very much out of sync with reality. It's incorrect, it's incomplete, it's obsolete. Organizations usually lack the discipline and tools and staff to ensure that the metadata -- the description of the data -- is really correct and up-to-date. Now, unfortunately what happens is that the metadata somehow tends to drag the quality of the data down. If the metadata is not accurate, inevitably, over time, the data quality suffers. This happens a lot of times during conversions, any system upgrades and conversion exercises, because usually the conversion mappings are based on the metadata, and if it's wrong, then we end up with wrong data. Data users suffer because people assume that they have something or it means something which it doesn't. And data quality efforts suffer because even when we talk about data quality assessment, we design the rules, and the rules are based on an understanding of what the data is, so if we don't have good metadata, our rules are going to be incorrect and our assessment is going to be incorrect. So metadata is really very important, and the reality is that we have many different tools to create and collect metadata, we can do data profiling, and we can gather lots of metadata. Also, the assessment itself produces lots of metadata -- it produces listings of errors and it produces rules; it's a rules catalog. So there is lots of metadata. You'll find, actually, that the volume of metadata sometimes approaches the volume of the data itself, or at least the complexity, so if you cannot find a way to officially organize your metadata, then you can collect a lot of it, but you cannot use the information and it becomes useless. So the challenge is a similar challenge to 20 years ago. Our technology made a huge leap forward, and we learned the ability to gather large volumes of data, to organize and process large volumes of data with relational databases and later data warehousing. But the data quality started suffering, since we didn't have the discipline to think about the quality of the data. So now, kind of the same thing with metadata, we have the ability to collect a lot of metadata and to profile data quality assessments, but if we don't plan how to organize all of the metadata we produce, then the same thing happens -- the quality of it suffers. If the metadata is no good, the data is no good. SearchDataManagement.com: I think a good way to round out our conversation is to get your take on the data quality tools and technologies available on the market. You mentioned that it is still a pretty young market. And what would be your advice to companies just starting out with data quality? Maydanchik: That's a tough question. There are many vendors of data quality and data profiling tools out there. Most of the tools focus on several reasonably narrow areas of data quality management – in those areas, the tools have made tremendous progress and are really very good. For example, if you need a tool for records matching and deduplication, that's something that's been out on the market for seven, eight, 10 years. And there are great tools for that. If you are looking for tools for column profiling, or attribute profiling -- things that are going to take attributes one at a time in your database and go through all of the records and give you distributions of values, frequency charts, minimums, maximums, statistical characteristics -- there are phenomenal tools. They have come a long way, and they have good interfaces and they do a lot of things. Unfortunately, at the same time, the tools focus on these few, well-defined, very narrow types of problems, and obviously that was where it was easiest to start, because everybody had the same problem there. But where we stand now is that everyone is starting to deal with data quality across all kinds of data, and as we want to do a more comprehensive data quality assessment -- not just look at names and addresses, not just look at customer data, but look at financial data, HR data, payroll data, insurance claims data -- there's really a very big gap between what tools have and what people need. I know that a couple of years ago, in a survey, something to the tune of 80+% of respondents said they were basically custom quoting. And I don't think it has changed much, because as people are looking at data quality problems, they probably find that the tools can only give them a reasonably small mileage. So the bottom line is there is a big need there, on the market, and at this point overall the data quality needs of the customers are lost by the existing vendors. But in certain niches, the tools have made great progress. SearchDataManagement.com: Does this mean that companies which want to prove their data quality are going to have to do a lot of this work in-house? Maydanchik: Yes, that's definitely the case. It's kind of a Catch-22. I talk to a lot of vendors, and the reality is there isn't enough demand for some advanced tools for them to make this a serious investment, because the market is still young and most companies are just starting and feeling their way through it. So vendors are not making a leap forward, and they are not creating a tool that could help companies do what they really need to do. Now part of it, of course, is that up until a few years ago, there wasn't a good place to go to learn how to do it [data quality management]. Right now, between me and a couple of other experts, with books coming out, with all of the training courses that we are teaching at various conferences, there's more and more people. I'd say that just last year I taught more than a thousand people in my classes. So I'd say we're kind of giving it a new beginning. As more people know how to do it, and what exactly they need to do, they're starting to do it [data quality management]. They are going to vendors with more pointed questions and are asking for more specific features, so I think eventually the market is going to start to pick up. SearchDataManagement.com: Thanks so much for your time today.
Learn how integration competency centers centralize data integration projects and help improve data quality. Find out why Informatica bought identity resolution software maker Identity Systems. Get a definition of high quality information, and find out what it means for overall data quality. Find out where to go to get unbiased assessment and analysis of data quality management tools. Learn the fundamentals of getting started with a data quality program from understanding business drivers to evaluating data quality software options. Email SearchDataManagement.com with story ideas or comments about data quality. Don't miss the other installments in this data quality management guide Managing data quality efforts during a recession Trends in the data quality market Avoiding data quality pitfalls and using data quality tools for discovering new opportunities Q/A: Identifying data quality problems with a data quality assessment FAQ: Best practices/tips for data quality
'); // -->
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
| About Us | Contact Us | For Advertisers | For Business Partners | Site Index | RSS |
|
|
|
|||||||