Integrating information quality and entity resolution

Trends in entity resolution research and applications

Integrating information quality and entity resolution

In this chapter excerpt from the book Entity Resolution and Information Quality by John R. Talburt, readers will learn about the integration of entity resolution and information quality processes. Readers will also learn about large-scale entity analytics and entity-based data integration.

In this part:

Large-Scale Entity Analytics

Commercial IT vendors are also exploiting the advantages of high-performance computing (HPC). IBM has been a leader in the area of large-scale entity resolution (ER) and entity analytics. Its Entity Analytic Solutions (EAS) product “... scales to process hundreds of millions of entities ...” (IBM EAS, 2006) and was reported to be able to perform 2,000 identity resolutions/second against a 3-billion-row database describing 600 million resolved identities (Jonas, 2005). IBM is also developing the Extreme Analytics Platform (XAP) at its Almaden Research facility. The goal of XAP is to create business intelligence by processing and integrating large volumes of both structured and unstructured information (IBM Research, 2010).

XAP and other data-intensive computing platforms are taking advantage of two recently developed HPC technologies: Hadoop and MapReduce. Hadoop, sponsored by Apache, is an open-source project to develop software for reliable, scalable, distributed computing (Apache Hadoop, 2010). Hadoop comprises a collection of software modules written in Java that facilitate the development and implementation of distributed HPC systems. MapReduce, originally developed and patented by Google, is a programming model and an associated implementation for processing and generating large data sets (Dean, Ghemawat, 2010).

Integration of ER and IQ

Another area of ER research is the development methods and tools that more closely integrate ER and information quality (IQ) processes. In most ER processes, the IQ activities are usually part of a separate reference preparation step (ERA2) that executes prior to the actual resolution step (ERA3) similar to that shown in Figure 5.9 in Chapter 5. The coupling between the ERA2 and ERA3 processes is often simply a mapping of the sources into a fixed target layout. Although there are tools available to assist in mapping out the process, it is typically a very labor-intensive manual process. Acxiom Corporation faces this problem on a daily basis in the processing of client data and has been active in supporting research in this area through the Acxiom Laboratory for Applied Research (research.acxiom.com).

One approach to this problem is to first describe the desired ER outcome (business objective) and the reference sources available to produce the outcome. The description of the outcome and the descriptions of the sources then become the input to a system that automatically generates the item mappings, ETL flows (including data cleansing) and other ER steps necessary to produce the desired outcome. Deneke, Eno, Li, et al. (2008) proposed a declarative language called DSML (Domain-Specific Modeling Language) as means of expressing ER goals as business objectives.

In a similar vein, Gibbs (2010) has proposed the development of a declarative approach to entity resolution that has yet to be realized as an operational system. He compares the current state of ER processing as similar to the way that databases were queried before the advent of a standardized query language (SQL). The objective of his proposed research is to formulate a declarative language that hides the implementation of common ER operations such as matching, transitive closure and blocking that are currently done through procedural, ETL-style processes.

Another related line of research is the problem of layout inference -- that is, an automated process for analyzing the raw byte content of files to determine their physical and logical structure, such as record length, field layout and field content. In many large-scale ER operations, the preparation of input references can consume a disproportionate amount of the overall time and labor in an ER process can be expended in recreating and verifying missing or untrustworthy layout metadata. Talburt, Chiang, Howe, Wu, et al. (2008) proposed a semiotic approach to the layout inference problem that not only analyzes character patterns but also generates and analyzes semantic token patterns. Semantic patterns extend the notion of syntactic patterns that are provided by many data profiling systems.

In the rows of Table 7.1, three character strings are shown as both syntactic patterns and semantic patterns in the domain of customer contact information. A common syntactic pattern scheme is to replace each uppercase letter with a single token such as “A,” lowercase letter with “a” and digits with “9.” Punctuation characters and spaces are typically left embedded in the syntactic pattern.

The third column of Table 7.1 shows a method of creating a semantic pattern by replacing each token in the original string with a single character that most likely signifies its meaning. In the scheme shown here, common first-name tokens such as “John” are replaced by the single character “F,” common last names by “L” and common street suffix tokens by “S.”Numeric tokens are replaced by the single character “9,” single letters are replaced by “i” and unrecognized alphabetic tokens by “a.” Together, the syntactic and semantic patterns can be used to classify each string into its most likely category in the customer contact ontology.

Again, HPC has a role to play. For example, suppose that an incoming file was found to have a fixed field-length record format with a record length of 200 bytes. In this case, any given field of contact information could occupy any one of 20,100 possible start and end positions in the record. Systematically testing each of these possible fields over a significant sample of records for the syntactic and semantic patterns just described can be a very computationally intensive process.

Entity-Based Data Integration

As discussed in Chapter 4, EBDI is an area in which various soft computing and machine-learning techniques are being applied to create more accurate and efficient integration operators. Kooshesh, Zhou and Talburt (2010) are currently conducting a series of experiments on the application of genetic programming to the problem of maximizing the accuracy of EBDI selection operators -- in particular, how to move beyond the accuracy of naıve and naıve-voting selection schemes into the opportunity region, as shown in Figure 4.1 in Chapter 4.

Their current approach is based on defining an integration hypothesis that has the form:

Ht(e) = {(C1(e), S1), (C2(e), S2), …, (Cn(e), Sn), D}

where Sj and D represent sources in an integration context, Cj represents a logical proposition related to a particular integration attribute (t) and integration entity (e) as defined in Chapter 4. For example, suppose there are four sources A, B, C, and D that provide entity references where the integration attribute (t) represents a telephone number. In this case, the condition C1 might look like:

C1(e) = {(C.t(e) : NE : null) : AND

In this notation, A.t(e) represents the telephone number given for entity (e) by source A. Similarly B.t(e) represents the number given for (e) by B, and so on. In this example, C1 would only return a true value for an entity (e) when the telephone number for (e) provided by A is not missing (null) and the number provided for (e) by A agrees with either the number provided by C or the number provided by D. The telephone number for (e) provided by B does affect whether C1 is true or false.

A hypothesis condition can be visualized as a binary tree. Figure 7.3 shows the same condition C1 as a binary tree, where the leaves of the tree are the attribute values and the interior nodes are the logical operators.

For a given integration entity (e) and hypothesis H, the final selection of a value for the integration attribute (t) is always selected from the source preceded by the first condition in the hypothesis (from left to right) that evaluates as true. For example, if C1 is true for entity (e), the value for (t) is selected from source S1; else if C2 is true, it is selected from S2, and so on. In the case that none of the conditions in the hypothesis is true, the final value for (t) is selected from D, the default source.

Hypotheses constructed in this way represent a very expressive family of selection operators. For example, it is not difficult to design a hypothesis of this format that coincides with the naıve selection operator or the naıve-voting selection operator that are described in Chapter 4. It is also easy to see how these hypotheses lend themselves to a genetic programming (GP) approach. In the case where the true set of selections is known for some sample of the integration entities, the accuracy of each hypothesis can be calculated. Using accuracy as the “survival fitness” associated with each hypothesis, it becomes a simple matter to rank a given set of hypotheses from the most to the least fit with respect to their integration accuracy.

In the experiments currently being conducted at the ERIQ Research Center, the truth set and sources are synthetically generated and are represented in tabular format. Each row of the table represents an integration entity and each column one of the contributing sources for a given integration attribute. The attribute values are represented as single uppercase letters A through Z. A particular integration context is created by first selecting the number of sources (columns), including their desired accuracy and completeness, the number of values for the attribute and the number of entities to be generated (rows). Typically, half the entities are used for training and the other half are saved for testing. For example, an integration context might comprise four sources, 12 attribute values (A–L) and 2,000 rows (1,000 training and 1,000 testing).

The first step in creating such a context is to randomly generate a list of 2,000 values that represent the true values. Next, each source is generated by probabilistic selection of the true values based on desired completeness and accuracy of the source. For example, if a source is intended to be 50% complete and 80% accurate, the source generator would randomly select 50% of the rows in the source to have null values, and the non-null values would be randomly selected to agree with the true values 80% of the time.

More on this title

©2011 Elsevier, Inc. All rights reserved. Printed with permission from Morgan Kaufmann Publishers, an imprint of Elsevier. Copyright 2011. For more information on this title and other similar books, please visit elsevierdirect.com.