Dealing with association analysis, unstructured data in ER processes

Trends in entity resolution research and applications

Dealing with association analysis, unstructured data in ER processes

In this chapter excerpt from the book Entity Resolution and Information Quality by John R. Talburt, readers will learn how association analysis techniques can aid entity resolution and information quality strategies on social network data and other types of unstructured data. Readers will also learn how high-performance computing systems are affecting entity resolution on unstructured data.

In this part:

Association Analysis and Social Networks

Most commercial entity resolution (ER) systems were developed around customer contact information, which has traditionally been name, mailing address and telephone number. With the growth of the World Wide Web and, most recently, social networks, customer contact information increasingly includes email address, URL, IP address and social network usernames and handles. Usernames and handles are aliases for the user’s real name, and people often use different usernames in different networks. When users establish multiple identities in the same network for malicious purposes, these are referred to as Sybil attacks (Viswanath, Post, Gummandi, Mislove, 2010). Other factors contributing to the complexity of ER in the context of the Internet are that messages and other communications tend to be less structured and more compressed and often employ slang and abbreviations.

The overall result is that direct matching tends to be less effective for linking equivalent references than in the traditional structured IT environment. For this reason, much of the current research for ER in the Web and social network focuses on determining equivalence through association analysis, as discussed in Chapter 1. Associations among references are often modeled as a network graph in which references are represented as nodes (vertices) in the graph and the edges between the nodes represent an association between the references. The benefit of using this model is that it allows researchers to draw on the existing body of knowledge on network analysis in addressing the problems related to ER in social networks.

Figure 7.2 shows an undirected network graph representing 23 associations among 16 references. In general, these network models try to determine equivalent references as “clusters” of nodes that share a large number of mutual associations. The problem is that in an actual application it is not always clear where one cluster ends and another begins. Figure 7.2 illustrates four situations that commonly occur. Cluster D, formed by nodes 15 and 16, is clearly not connected to any of the other nodes in the graph and is easily identified. On the other hand, even though the other 14 nodes are all interconnected, they appear to cluster into three groups, designated by the circles labeled A, B, and C. For example, the nodes in 1, 2, 3, and 4 in circle A form a closed interconnected group except for one connection between nodes 4 and 5. Cutting or removing this single connection would leave A as a separate group similar to D.

The situation with the nodes in circles B and C is somewhat different. Node 9 is connected to several nodes in each group. Xu, Yuruk, Feng and Schweiger (2007) in their SCAN (Structural Clustering Algorithm for Networks) model identify node 9 as a “hub” node. They also term node 14 as an “outlier” because it is connected to a cluster but is not part of a cluster.

What comprises a cluster or determines equivalent references depends on the underlying model. For example, Elmaclogul, Kan, Lee and Zhang (2007) defined Web-based linkage in terms of the commonality of “representative data” -- data that is likely to be associated with a particular entity. In this model, the representative data is not necessarily an identity attribute of an entity in the sense that it describes some feature of the entity, but only that it is typically found in association with that entity. For example, if an author typically writes papers with certain co-authors, the presence or absence of these co-authors can help disambiguate references to that author versus references to another similarly named author. Berkkerman and McCallum (2005) proposed a similar model for disambiguation of Web references based on a link structure model. In their model, the nodes (references) are Web pages and an edge between two nodes means that “their hyperlinks share something in common.”

In many models, the goal of the ER algorithm is to partition the graph into disconnected subgraphs by removing a minimal number of associations (edges) in such a way that the references represented in each subgraph are equivalent. For example, in Figure 7.2 the edge between nodes 4 and 5 is a likely “cut point” in the graph that would make cluster A into its own partition class. Bilgic et al. (2006) developed an interactive tool called D-Dupe to assist in the visual analysis of these clusters. D-Dupe provides a number of visual features to assist in the analysis, such as different node shapes to highlight references of interest and different widths of edges to show the strength of the association.

In an interesting extension of the Fellegi-Sunter Model, Schweiger and Xu (2009) have proposed a method for applying network cluster analysis to traditional probabilistic matching. In their model, the graph nodes represent references and the edges represent a probabilistic match between two references. Using a graph metric called modularity, introduced by Newman (2004), resolution decisions in their algorithm are made based on the overall weight of matches in a cluster of references rather than at the level of successive pairwise comparisons. Their work has since been extended to take into account negative (conflicting) information (Schweiger, Xu, 2010).

Network graphs are also very useful for the analysis and visualization of entity relationship analysis (ER Activity 5) such as those shown in exploration and discovery functions of the Infoglide Identity Resolution Engine in Chapter 5. Association analysis can be thought of in terms of degrees of separation. In this terminology, zero-degree separation of references would mean that the references are equivalent. One degree of separation would be when distinct entities share a direct relationship, such as the same residential address. Two degrees of separation means that two entities both share a relationship with a third entity, but not directly with each other. For example, Agarwal, Liu, Murthy, Sen and Wang (2009) have developed a network analysis algorithm for identifying “familiar strangers” in social networks -- that is, persons who exhibit similar behaviors but who are not directly connected. In other research, Agarwal, Liu, Tang and Yu (2008) have developed an association analysis for identifying “influential bloggers” in a community of interest.

HPC in ER

The advent of low-cost, high-performance computing (HPC) and cloud computing is changing the IT landscape in many ways and is already having an impact on ER. As discussed in Chapter 5, Acxiom Corporation was one of the first to explore the application of HPC to ER in the development of its AbiliTec CDI technology through the use of a massively parallel grid computing system. Much of their work was done before the general availability of HPC, thus requiring their R&D teams to put almost as much effort into the development of the HPC platform, operating system, and storage network as the ER application itself. Now that HPC and cloud computing are readily available, they are not only increasing the performance of the basic preparation, resolution and identity management processes (ERA2, ERA3, ERA4), they are also being brought to bear on solving some of the difficult problems related to entity reference extraction (ERA1) and entity analytics (ERA5).

Unstructured Data

Inmon and Nesavich (2008) estimate that the majority of information available in most organizations is in unstructured formats, most of it in documents that they call unstructured textual information (UTI). HPC is making headway in the race to unlock this reservoir of information that is largely unavailable for use in the structured environment of most organizations’ IT systems. Vanover (2010) has even suggested that the processing of unstructured data is one of the main drivers for business adoption of cloud computing.

The reason that HPC is so important to processing unstructured data is that it supports data-intensive computing -- that is, the ability to process very large amounts of data in or near real time. One of the side effects of data-intensive computing capability is a shift from rule-based computing models to data-driven statistical computing models (Talburt, Bell, DeClue, 2000). As an example, consider natural language translation. In any given language, the number of ways in which a sentence can be constructed is virtually uncountable. Until recently, the approach to managing this complexity has been to try to construct models of the language’s grammar rules. In the case of translation, there must be an additional model (a meta-model) that describes how the grammatical rule model of the source language is mapped in the grammatical rule model of the target language, such as English to Spanish.

However, an alternative approach to language translation that has been around for some time (Weaver, 1955) is called statistical machine translation (SMT). SMT is basically an application of Bayes and other probabilistic models to estimate the relative frequency with which a given phrase in the source language has been translated into the target language. Until the advent of HPC, SMT has largely been a theoretical model because it was impractical to find and count instances of previous translations on a large scale. Interestingly, Google has done just that with its translation service. HPC has made it possible to automatically compute these statistics. The effort began in 2007 with the analysis of more than 200 billion words of parallel translations from archives of the United Nations and continues with new translations produced by both the United Nations and the European Union (Tanner, 2007).

Similar data-driven approaches are now being used to extract entity references from unstructured and semi-structured information. For example, Osesina and Talburt (2010) have proposed an approach to recognizing named entity references in unstructured text by exhaustively extracting phrases from the unstructured document and assessing their similarity to statistical profiles of known references in an annotated corpus of documents of the same type, such as business news articles or customer service notes. The statistics are collected for some of the indicative properties exhibited by the entity references in the annotated corpus. These properties are both contextual, such as commonly found phrases that precede or succeed the reference, and intrinsic, such as the length and character pattern of the reference. Without HPC, this kind of approach would be impractical for anything other than very small test samples.

One of the greatest advantages of data-driven statistical models over rule-based models is in how the models are updated and refined. In a data-driven model, the statistical calculations remain fixed and refinements to the model come primarily from acquiring and processing more information. In rule-based models, refinements typically require continual introspection and reprogramming of the rule structure.

More on this title

©2011 Elsevier, Inc. All rights reserved. Printed with permission from Morgan Kaufmann Publishers, an imprint of Elsevier. Copyright 2011. For more information on this title and other similar books, please visit elsevierdirect.com.