In this chapter excerpt from the book Entity Resolution and Information Quality by John R. Talburt, readers will learn about the genetic programming approach to entity resolution for integrating data and ensuring information quality. Readers will also get examples of how to use genetic programming principles and learn about ongoing research on entity resolution models.
In this part:
More on Entity-Based Data Integration
The first step in creating such a context is to randomly generate a list of 2,000 values that represent the true values. Next, each source is generated by probabilistic selection of the true values based on desired completeness and accuracy of the source. For example, if a source is intended to be 50% complete and 80% accurate, the source generator would randomly select 50% of the rows in the source to have null values, and the non-null values would be randomly selected to agree with the true values 80% of the time.
After all the sources are generated, the next step in a genetic programming (GP) approach is to randomly generate an initial generation of hypotheses within a proscribed set of limits on the number and complexity of the conditions. These initial hypotheses are the initial seed generation for an iterative process, creating a sequence of generations whereby each new generation is created from its predecessor by a process of evaluation and mutation. For example, if each generation comprises 100 hypotheses, the next generation might be created by carrying forward without change the 10 hypotheses with the highest fitness scores, then probabilistically selecting another 10 based on their relative fitness, then creating the final 80 hypotheses for the next generation by random mutation.
Hypothesis mutations fall into two primary categories: crossover mutations that rearrange the condition-source pairs within and between hypotheses, and internal changes to a specific condition-source pair. The condition-source pairs are also called the clauses of the hypothesis. One example of a crossover mutation is to randomly select two hypotheses in the current generation and, within each hypothesis, also randomly select a clause. The crossover is to divide the two hypotheses at the selected clauses and exchange (splice) the clauses between the two hypotheses past the selected clauses. Another might be to simply randomly change the order of the clauses in a single hypothesis. The crossover mutations change the overall structure of the hypotheses but leave each clause intact.
In addition to crossover mutations, there are also a series of possible mutations that make changes internal to a specific clause in the hypothesis. These range from simple changes such as randomly changing the selection source of the clause or randomly changing the Boolean operator in the condition of a clause to more complex operations such as pruning a subtree of a condition and splicing it into the tree of a condition in another clause.
The following is a description of a GP experiment to optimize the integration of four synthetically generated sources with respect to the accuracy of one integration attribute. The attribute can take on any of 12 possible values, represented by the letters A through L. Each source was generated with 2,000 rows (integration entities) so that the GP algorithm could be trained on the first 1,000 rows, with the best-performing hypothesis tested against the remaining 1,000 rows.
Table 7.2 shows the accuracy and completeness of each generated source. Due to random variation, these percentages vary slightly from the target values given to the source generator. For example, the actual targets for the first source were 85% accuracy and 50% completeness. This particular experiment simulates the situation commonly found when integrating multiple sources, namely that the sources with the higher completeness (coverage) often have lower accuracy. Table 7.2 also shows that in the best case, the integration of these four sources could have at most an accuracy of 93.1%, with almost 100% completeness. However, the naıve integration operator in the context only achieves 64% accuracy.
Table 7.3 shows the results of applying a GP algorithm to this integration context based on successive generations of 100 hypotheses each. In this algorithm, the best 10 hypotheses and a random selection of 10 other hypotheses were always carried forward into the next generation without change. The remaining 80 hypotheses in the new generation were created by a combination of crossover and clause mutations, as previously described. For the experiment shown here, the crossover mutation rate was set at 80% and the clause mutation rate at 50%. It is interesting to note the in the first 100 randomly generated hypotheses, the most accurate hypothesis (66.7%) was already higher than the naıve selection operator (64%). After 100 generations, the highest accuracy achieved was 81.5% on the training set and 81.8% on the test set. These values are much higher than for the naı¨ve selection operator but are still below the best possible accuracy of 93.1%.
Figure 7.4 shows the same information as in Table 7.3 but showing all the generations on the horizontal axis. The pattern of the experiment was a sharp increase in accuracy in the first few generations followed by longer periods punctuated with increases. More information about these and similar experiments can be found on the ERIQ Research Center website (ualr.edu/eriq).
Fundamental ER Research
There is still continuing research toward refining the basic models of entity resolution (ER). As noted earlier, Winkler (1988, 1989a, 1989b) and others have developed a number of refinements to the basic Fellegi-Sunter record linkage model, and Benjelloun, Garcia-Molina, et al. (2006) have extended the basic R-Swoosh algorithm to exploit the efficiencies of distributed and parallel processing. Originally, the calculation of the weights associated with the Fellegi-Sunter agreement patterns was done manually by validating samples of each pattern. Much of the research in probabilistic matching is directed toward developing methods and techniques to automate and optimize the pattern assessment process. In some cases, these techniques have been patented for commercial applications, e.g., Automatic weight generation for probabilistic matching (Schumacher, Adams, Ellard, 2007). These automated machine-learning techniques are also called unsupervised learning (Michalowski, Thakkar, Knoblock, 2004; Christen, 2007). Some of these unsupervised learning techniques have been incorporated into toolkits that support ER processes such as the FRIL (Jurczyk, Lu, Xiong, et al., 2008) and TAILOR (Elfeky, Elmagarmid, Verykio, 2002) toolkits.
Jonas (2006) has emphasized the need to develop ER systems that are sequence-neutral, as described in Chapter 3. He contends that sequence-neutral systems create an effect he calls “data find the data,” meaning that each new piece of data coming into the system is dynamically related to existing information without waiting for a user query or batch process to run. Jonas (2007) also promotes the superiority of identity resolution and identity capture architectures, which he describes as record-to-set-based matching, over merge/purge-based batch architectures, which he describes as record-to-record matching. This is mainly because identity resolution systems support real-time transactional processing and the ongoing capture and retention of identity information in a process that he describes as “perpetual analytics.”
The open source ER systems OYSTER and Febrl discussed in Chapter 6 were originally developed for specific applications -- education and health care, respectively. However, both are evolving into general-purpose ER systems that support ER instruction and research. In particular, the OYSTER system supports strategies for identity capture, identity management and asserted equivalence. With the growing interest in hub architectures and information fusion centers, there has been commensurate increased research around the problem of identity and identity management.
Several corporate, government and academic organizations, such as the Center for Applied Identity Management Research (CAIMR, caimr.org), have joined together to promote research in this area. Working in collaboration with the Center for Excellence in Distributed Global Environments (EDGE, www.edge.utexas.edu) at the University of Texas at Austin, the CAIMR group is developing a comprehensive model for identity management that includes ER and identity resolution. Another effort in this direction is the Consortium for Identity Systems Research and Education (CISRE, autoid.astate.edu), hosted by the Laboratory for the Study of Automatic Identification at Arkansas State University (autoid.astate.edu/autoid/). The CISRE focus is on how automatic identification such as biometrics for humans and animals and radio frequency identification (RFID) tags for products (Moeeni, 2008) can facilitate identity management solutions in complex systems such as health care, food supply chains and transportation logistics.
Identity resolution is proving to be a key component of the information hub architecture that is increasingly being used to harness entity-based information dispersed over different systems and organizations. High-performance computing (HPC) is having an impact on ER, making it possible to process immense numbers of records in real, or near-real, time. HPC is also opening doors to new data-intensive applications that use data-driven statistical models in lieu of traditional rule-based models. This is especially true in the area of entity reference extraction and named-entity recognition.
Fundamental research also continues around the basic activities of ER, including unsupervised learning in the optimization of probabilistic matching models, the application of graph and network methods to association analysis, identity management, open source ER systems and tighter integration of ER and information quality (IQ) activities.
©2011 Elsevier, Inc. All rights reserved. Printed with permission from Morgan Kaufmann Publishers, an imprint of Elsevier. Copyright 2011. For more information on this title and other similar books, please visit elsevierdirect.com.