Difference between revisions of "Huang et al, ACL 2009: Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
Line 13: | Line 13: | ||
== Constructing entity profiles using IE and WDC == | == Constructing entity profiles using IE and WDC == | ||
− | An information extraction tool first extracts Named Entities and their relationships. For the NEs of interest, a [[AddressesProblem::Within Document Coreference (WDC)]] module then links the entities deemed as referring to the same underlying identity into a WDC chain. They use the information extraction tool AeroText for this purpose. AeroText extracts two types of information for an entity: the attribute information about the person named entity includes first/middle/last names, gender, mention, etc, and also, relationship information between named entities, such as Family, List, Employment, Ownership, Citizen-Resident-Religion-Ethnicity and so on, as specified in the ACE evaluation. AeroText resolves the references of entities within a document and produces the entity profiles | + | An information extraction tool first extracts Named Entities and their relationships. For the NEs of interest, a [[AddressesProblem::Within Document Coreference (WDC)]] module then links the entities deemed as referring to the same underlying identity into a WDC chain. They use the information extraction tool AeroText for this purpose. AeroText extracts two types of information for an entity: the attribute information about the person named entity includes first/middle/last names, gender, mention, etc, and also, relationship information between named entities, such as Family, List, Employment, Ownership, Citizen-Resident-Religion-Ethnicity and so on, as specified in the ACE evaluation. AeroText resolves the references of entities within a document and produces the entity profiles used as input to their CDC system. Each entity is represented as a profile which contains the NE, its attributes and associated relationships. |
== Kernelized Fuzzy Relational Clustering == | == Kernelized Fuzzy Relational Clustering == |
Latest revision as of 00:45, 1 December 2010
Contents
Citation
Jian Huang, Sarah M. Taylor, Jonathan L. Smith, Konstantinos A. Fotiadis and C. Lee Giles. 2009. Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 414–422.
Online version
An online version of this paper is available [1].
Summary
This paper solves the problem of Cross Document Coreference (CDC) by using Information Extraction tools to make profiles of entities, measuring the distance between profiles by a learned distance function, and finally clustering them using kernelized fuzzy relational clustering.
Constructing entity profiles using IE and WDC
An information extraction tool first extracts Named Entities and their relationships. For the NEs of interest, a Within Document Coreference (WDC) module then links the entities deemed as referring to the same underlying identity into a WDC chain. They use the information extraction tool AeroText for this purpose. AeroText extracts two types of information for an entity: the attribute information about the person named entity includes first/middle/last names, gender, mention, etc, and also, relationship information between named entities, such as Family, List, Employment, Ownership, Citizen-Resident-Religion-Ethnicity and so on, as specified in the ACE evaluation. AeroText resolves the references of entities within a document and produces the entity profiles used as input to their CDC system. Each entity is represented as a profile which contains the NE, its attributes and associated relationships.
Kernelized Fuzzy Relational Clustering
For clustering the entities, they use the Kernelized Fuzzy Relational Clustering algorithm (KARC). This algorithm is based on the Any Relation Clustering Algorithm (ARCA), which represents relational data as object data using their mutual relation strength and uses Fuzzy C-Means for clustering. Each chained entity is represented as a vector of its relation strengths with all the entities. Fuzzy clusters can then be obtained by grouping closely related patterns using object clustering algorithm.
The kernelized fuzzy clustering algorithm KARC works as follows. The chained entities E are first objectified into a relation strength matrix R using Specialist Exponentiated Gradient (SEG). A Gram matrix K is then computed based on the relation strength vectors using the kernel function. For a given number of clusters C, the initialization step is done by randomly picking C patterns as cluster centers. The kernel distance matrix D is initialized and subsequently KARC alternately updates the membership matrix U and the kernel distance matrix D until convergence or running more than a certain number of iterations. Finally, the soft partition is generated based on the membership matrix U, which is the desired cross document coreference result.
The number of true underlying identities may vary depending on the entities’ level of ambiguity (e.g. name frequency). To select the optimal number of clusters, the authors adopt the Xie-Beni Index (XBI) (Xie and Beni, 1991) as in ARCA, which is one of the most popular cluster validities for fuzzy clustering algorithms.
Learning Distance Functions from a suite of similarity measures
A suite of similarity functions is designed to determine if the attributes relationships in a pair of entity profiles match or not: SoftTFIDF, JC Semantic Similarity, Rule-based metrics, etc. The authors treat each similarity function as a specialist that specializes in computing the similarity of a particular type of relationship. They utilize a specialist ensemble learning framework (SEG) to combine these component similarities into the relation strength for clustering. Here, a specialist is awakened for prediction only when the same type of relationships are present in both chained entities. A specialist can choose not to make a prediction if it is not confident enough for an instance. Also, specialists have different weights (in addition to their prediction) on the final relation strength.
Experiments and Evaluation
They use the ACL SemEval-2007 web person search task (WePS). The authors use the standard purity and inverse purity clustering metrics as in the WePS evaluation. The test collection consists of three sets of 10 different names, sampled from ambiguous names from English Wikipedia (famous people), participants of the ACL 2006 conference (computer scientists) and common names from the US Census data, respectively. For each name, the top 100 documents retrieved from the Yahoo! Search API are used.
The authors report macro-averaged purity of 0.657, inverse purity of 0.795 and an F score of 0.740. This compares better than the results of the first tier systems in the WePS 2007 official evaluation.
Conclusion
The authors present interesting learning (SEG) and clustering (KARC) methods to solve the problem of Cross Document Coreference (CDC).