DataFlux demonstration of product data quality software


A high level of data quality is critical to understanding product inventory and supply chain operations. Products data must be accurate if you're to to succesfully conduct spend analysis and spot trends and iniefficiencies. Good product data is also critical to master data management (MDM) initiatives.

In this screencast, Ron Agresta, a solutions manager for DataFlux, a subsidiary of SAS Institute based in Cary, N.C., demonstrates the vendor's Accelerator for Materials Data Classification (AMDC). AMDC helps companies improve the quality of inventory and product data, and is designed for business users, according to Agresta. It has a relatively simple interface that lets business users, those with domain expertise, manually and automatically classify product data, he said.

Read the full transcript from this video below:  

DataFlux demonstration of product data quality software

Jeff Kelly: Hello, I'm Jeff Kelly, the News Editor for Today we're joined by Ron Agresta, solutions manager at DataFlux. And Ron's going to give us a demo of DataFlux's accelerator for materials data classification. Welcome Ron, and why don't you tell us a little bit about the product first and then we can move right into your demo.

Ron Agresta: Sounds good, thanks for having me. Okay so today we're going to look at just a couple of materials data classification. And what it does is takes product or service description data and matches that to a taxonomy, and that could be an interest taxonomy like UNCSE or [e-client] or it could be a personal taxonomy for your organization. The taxonomy or the [toting] process what is ultimately provides is a means to categorize your data for [hiccups], the way data might be categorized, not a retail like Amazon or something like that. But usually it's done, the classification is done for spec analysis, so you can put your product in a hierarchy of product categories, and then you can drill and mange what your spending]with various vendors, how can you save money, how can you optimize your spending, things like that. So it's part of a bigger process but it's important piece because it's what is the strength of data matching, clinging of data, and things like that.

So I guess I'll just dive right in. So what I’ve already done is brought about thousands of data through our application here. And it can handle hundreds or thousands or millions of roads of data, I've just kept of smaller just for this demo. And I've already tried to match it against the USDSC taxonomy. So I've got the data mark here, let it load, and when it comes up what we'll see is a screen, it's really our match confirmation screen, that lets me see whether the system is matching your product description data to the taxonomy correctly. And if it didn't make a good guess it allows you to modify the match that it makes. So you get a screen here that say I evaluated nine hundred ninety nine roads, we managed to confirm 836 of these automatically, that's pretty good. So we got a match threshold that we set. So as we're matching we also score things, and if it scores above a level then the system is reliable at that point and you probably do now have to review those matches. Anything below that threshold ends up here on our unconfirmed screen that we'll see in a second, and that lets me go and see out of the guesses that the system makes, educated guesses, what matches should we actually make.

Jeff Kelly: Now how do you come up with a number, a threshold that's acceptable to you?

Ron Agresta. Okay, so a couple things go on. The score itself is generated by a number of phrases and words that match, how closely they match, things like that, and how many times it match the taxonomy. That threshold generally speaking, it's around the 100. The threshold is usually set at a 100 to start out with, and then what organizations usually do is they run the match processes over time. They feel that 105 or 110 is a better threshold, maybe it's 90, 85. They get comfort levels of for how reliable they thing the system is for matching. So it improved per use I guess over several iterations.

Jeff Kelly: Okay.

Ron Agresta: So the screen that we see first is all the unconfirmed matches and real quickly if you look on the far left screen you see the ID field. And that ID is representing the fact that a product description called color, which is not very descriptive, came in. And we said how many matches are here, four, so we said we think that it could be matched to one of these eight or so categories in this UNCSE taxonomy. You can see the class and the class description field, that's what we're actually matching to. That was pulled right in from the taxonomy. And the classes are hierarchical so they start out at the two digit all the way to the left there, 44 for the first row, and then 44 has a lot of sub levels, 10, 11, 12, and so on.

And then the next level down has lots of sub levels, it goes. And then from product segments, class, family, I'm not exactly sure if I got that hierarchy right. It's down here at the bottom, got that totally wrong, segment, family, class, yeah I was close. We'll go into this browser here in a second, but what I do want to show is the confirmed matches that give you an idea what's going on. So if we look across this what we see is for each product that the system saw or there is a clamp tracing, which should have been paper there but input data sometimes gets truncated and we may have to figure out that that should match the tracing paper. So if you look across to class description, that's the taxonomy category that we managed to map this column to.

As you can see the score there for lots, good scores. You can also see confirmed by, that says what the auto confirm. In the case of the second one here, tracing paper, that was something that I manually confirmed, it wasn't originally in my confirmed results, and I did some investigation and managed to figure out that it really ought to be matched to tracing paper, and I confirmed that. What probably happened is I had several options to pick from and me as business user went and said okay that's a good match, I'm going to confirm that. And that's what you see by the confirmed by there.

There's a lot of different ways we can find matches. DataFlux match plays the data match strength that is the same but had some differences and we can match that pretty reliably. There are also business rules, match rules, and a couple other mechanisms that we use to try and get the best matches. All this other data is configurable per process so there are other pieces of information that might help the business user to determine a proper match, then you can show it on the screen here, the things like supplier name, it gives insight due to what a part might need to be described or might need to be matches to figure a category.

Jeff Kelly: Just quickly, you mentioned a business user can use this tool. Why is it important for business users to be able to use a tool like this as opposed to a more technical user?

Ron Agresta: That's a good question. The main reason is that a technical user who may be good at moving data around, communicating, or hooking up your database and evaluating or generating reports, they're not going to understand the actual terms they're looking at. So usually it's a domain expert, somebody that knows parts data identification in the line of business, or a business user that understands if I see something like substrates, that's probably a particular category of product that they know just by them as a matter of course doing business. So the business user is really going to be the ones to tell if this is matching correctly. And if it comes up with several possible matches they don't need that threshold, they will be able to choose the best match much better than a technical user.

Jeff Kelly: Okay.

Ron Agresta: But really the heart of this program is looking at unconfirmed matches and trying to figure out what to match that to. If we look at some of these like plug, one inch, well we try to match that to a category in the taxonomy, we'll probably see plug in a lot of different places. Drain plugs, [race] plugs, spark plugs, and so on. So it would be up to a business user to determine what kind of plug it is, if they have enough information. And that's why these other additional fields become very useful, what was the supplier name? What division is it from? If it was in the division that works in electronics then it might be a piece of a kind of plug and the business user will very quickly be able to confirm these matches.

So if you confirm a match, you can right click on whatever the preferred categorization is, let's say in this case it's an electrical plug, and we'll confirm that match. And what that is doing is that database in our datamart there it says whenever I see that term now I know that it is an electrical plug. And what will happen once all of the data is categorizes is that it will probably be moved into a data warehouse, maybe integrated with some ERP applications to help people start to analyze their spend data. In fact DataFlux has another component that does just that, it's a spend analysis piece, how much am I spending on what. And this is a critical piece to that larger process.

But some of the other tools that are available here to help me figure out the match are things like looking at, a web search for example, might see something like Panasonic AJ-250, that doesn't mean anything to me ordinarily but if I do a search on that particular model number it may come back as video editor or something like that, in this case I think that's what that is, it's a video editing machine. You can say we didn't know what that was because there weren't really enough clues there beyond just the name, so what we can do is use some of the other tools we have here, for example the quick code tool lets me look up something like video editor and figure out what the code might be for video editor.

In this case you can see 45111805, so what I can do, let's see if I can do this a little bit better, use a slightly different way to do it. I can browse my taxonomy for video editor, here's my commodity description here and I can drag that up to that particular line and now I've categorized that piece of code there. I can do a lot of different things with sorting all this data, I can look at all of the items confirmed, I can set filters, so getting everything that confirms the word circuit and let me categorize those things in that fashion. And I can also just do general browsing as well, so if I want to, for example, maybe I'm looking at something like this cover and I just go through the taxonomy itself and if I'm familiar with the business I can just start to drill down right into the categories that are useable for me.

So I might go down to hardware components, all the way down to the lowest level to find something that might be appropriate for this cover. But it's pretty general there cover, and there's not a whole lot of information so it's hard to make a determination in this case what the categorization ought to be, but business users who know the data really well can usually figure out what they ought to apply there. So one whole piece of this is actually assigning all these codes. We saw everything, eighty four percent of the data was automatically assigned to code either by our fuzzy match process or some business rules. The business rules themselves live in a different place in the application. They're usually company specific. They help to figure out product numbers and synonyms and brand names and things like that, and they go beyond our fuzzy matching.

So fuzzy matching is good at figuring out what data should be regardless of typos, regardless of ordering, regardless of extra noise characters and things that might be in the data. Business rules start to be specific to lines of business or to industry that we see something like, some of these chemicals, sulfate [09] is also an antibiotic, something that an ordinary fuzzy search wouldn't necessarily know but your match rules are designed to figure those things out for you. So what the business rules really are saying is any time I come to a product description that has some combination of these words and they don't have to be in order and they can be in any part of the product description, then I know to assign a particular category to that product description.

So you can add business rule to your, excuse me, natural [tier], and you can also affect the way that some of the fuzzy matches works beyond just the things that I mentioned earlier, the noise words and things like that, or the noise characters. You can set up common words that should be combined or should be split, so that if I come across something like a ball point pen, whether it's split out as ball point or jammed together I can still assign a correct code. I can do the same type of editing and rule building with abbreviations, brand names, synonyms, and so on.

So what this does for us is, as business user are evaluating the matches that they get, and you saw we did pretty well on our first pass, about 84, 85 percent of automatic matches, that additional 15 percent as the business users go through, they may see themes there, say okay every time I see circuit with the word integrated, for whatever reason that wasn't coded correctly. If I build a business rule for those two things, or a match rule, then I can make sure that every time I see those two words in a product description it always matches a particular code.

And likewise if I see integrated abbreviated Intg then I also want to make sure that matches to a particular category. That's how all these rules come into play. What is ultimate to mention is every time I classify a product description I can instruct the system to use that as a natural moving forward, so I never have to re-categorize the same product description twice. It improves every time, the more data the system sees the better the matching becomes, the better and bigger the match library becomes. And it really makes that classification process very easy and what a lot of companies do is struggle with it because they've got five, ten, or more business users that are manually coming through data, trying to match these things up in spreadsheets. But we give them an automated process pooling data, assigning code, lets them review the codes, build up some libraries around how to do the matching, and then that repeatable process gets better and better every time.

Jeff Kelly: And ultimately this supports, as you've said, kind of doing supply chain analysis and other types of analyses.

Ron Agresta: That's right yeah, supply chain analysis is real common. It would be also common to use this to take product data and prep it for online retail sites like I mentioned the Amazon's of the world where you have a lot of different products that you want to put in certain buckets and let users drill through those terms. And you can assign those categories using this type of application.

Jeff Kelly: Great.

Ron Agresta: And things that I haven't shown but you can also run all of your data mark creation processes from here. So if I want to create a new run of a match job or create a whole new data mark using a whole new set of rules or a different taxonomy, you can manage that all from here. And there also are some reports that you can run as well that will help you figure out how the system is working, what are the overall match rates across different taxonomy levels, what are common words that come into my system that maybe don't have rules built around them already. So that's what some of these [war] reports do, they go and analyze your data and try to determine how much or what types of rules you can build to support additional match, or get your match rate higher. And so it tells me things like when I see assembling cables, we'd already have a rule for that, it assigns this code. But there are some other things like cable and SP occur a fair amount together, twenty one times they occur together but we have no automated rule for that. That doesn't mean we won't get it using a fuzzy match, but you can explicitly tell the system, whenever you see those two together assign it a particular category. This report helps you do those kinds of things.

Jeff Kelly: Oh okay.

Ron Agresta: But the whole idea here you've got this match confirmation and rule building application that really helps you streamline you commodity code process and really enables higher level data quality and higher level precision for all the downstream processes that might be built on this product description data.

Jeff Kelly: Well Ron thank you so much today.

Ron Agresta: I think that's about it.

Jeff Kelly: Okay. Well Ron thanks again, we really appreciate it.

View All Videos

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

We can provide the online training on dataflux 2.4v