Difference between revisions of "Huang et al, ACL 2009: Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering"

From Cohen Courses
Jump to navigationJump to search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Citation ==
 
== Citation ==
  
Jian Huang, Sarah M. Taylor, Jonathan L. Smith, Konstantinos A. Fotiadis and C. Lee Giles. 2010. Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 414–422.
+
Jian Huang, Sarah M. Taylor, Jonathan L. Smith, Konstantinos A. Fotiadis and C. Lee Giles. 2009. Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 414–422.
  
 
== Online version ==
 
== Online version ==
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] introduces [[UsesMethod::Conditional Random Fields]] as sequential classification model.  
+
This [[Category::paper]] solves the problem of [[AddressesProblem::Cross Document Coreference (CDC)]] by using Information Extraction tools to make profiles of entities, measuring the distance between profiles by a learned distance function, and finally clustering them using kernelized fuzzy relational clustering.
  
==Drawbacks with HMMs and MeMMs ==
+
== Constructing entity profiles using IE and WDC ==
  
The paper points out the some of main drawbacks in generative models such as HMMs and in discriminative ones such as Maximum Entropy Markov Models:
+
An information extraction tool first extracts Named Entities and their relationships. For the NEs of interest, a [[AddressesProblem::Within Document Coreference (WDC)]] module then links the entities deemed as referring to the same underlying identity into a WDC  chain. They use the information extraction tool AeroText for this purpose. AeroText extracts two types of information for an entity: the attribute information about the person named entity includes first/middle/last names, gender, mention, etc, and also, relationship information between named entities, such as Family, List, Employment, Ownership, Citizen-Resident-Religion-Ethnicity and so on, as specified in the ACE evaluation. AeroText resolves the references of entities within a document and produces the entity profiles used as input to their CDC system. Each entity is represented as a profile which contains the NE, its attributes and associated relationships.
* HMMs don't allow the addition of overlapping arbitrary features of the observations.  
 
  
* MeMMs are normalized locally over each observation, and hence suffer from the [[AddressesProblem::Label Bias problem]], where the transitions going out from a state compete only against each other, as opposed to all the other transitions in the model.
+
== Kernelized Fuzzy Relational Clustering ==
  
==Mathematical Definition of CRFs==
+
For clustering the entities, they use the Kernelized Fuzzy Relational Clustering algorithm (KARC). This algorithm is based on the Any Relation Clustering Algorithm (ARCA), which  represents relational data as object data using their mutual relation strength and uses Fuzzy C-Means for clustering. Each chained entity is represented as a vector of its relation strengths with all the entities. Fuzzy clusters can then be obtained by grouping closely related patterns using object clustering algorithm.
  
The paper then introduces Conditional Random Fields as a model that addresses both of these problems, and defines them formally:
+
The kernelized fuzzy clustering algorithm KARC works as follows. The chained entities E are first objectified into a relation strength matrix R using Specialist Exponentiated Gradient (SEG). A Gram matrix K is then computed based on the relation strength vectors using the kernel function. For a given number of clusters C, the initialization step is done by randomly picking C
 +
patterns as cluster centers. The kernel distance matrix D is initialized and subsequently KARC alternately updates the membership matrix U and the kernel distance matrix D until convergence or running more than a certain number of iterations. Finally, the soft partition is generated based on the membership matrix U, which is the desired cross document coreference result.
  
[[File:Crf2.png]]
+
The number of true underlying identities may vary depending on the entities’ level of ambiguity (e.g. name frequency). To select the optimal number of clusters, the authors adopt the Xie-Beni Index (XBI) (Xie and Beni, 1991) as in ARCA, which is one of the most popular cluster validities for fuzzy clustering algorithms.
  
The probability P(Y/X) of a state sequence Y given an observation sequence X is:
+
== Learning Distance Functions from a suite of similarity measures ==
  
[[File:Crf3.png]]
+
A suite of similarity functions is designed to determine if the attributes relationships in a pair of entity profiles match or not: SoftTFIDF, JC Semantic Similarity, Rule-based metrics, etc. The authors treat each similarity function as a specialist that specializes in computing the similarity of a particular type of relationship. They utilize a specialist ensemble learning framework (SEG) to combine these component similarities into the relation strength for clustering. Here, a specialist is awakened for prediction only when the same type of relationships are present in both chained entities. A specialist can choose not to make a prediction if it is not confident enough for an instance. Also, specialists have different weights (in addition to their prediction) on the final relation strength.
  
where y|S is the set of components of y associated with the vertices in subgraph S, and features <math>f_k</math> and <math>g_k</math> are given and fixed
+
== Experiments and Evaluation ==
CRFs can be conceptualized as:
 
  
[[File:Crf1.png]]
+
They use the ACL SemEval-2007 web person search task ([[UsesDataset::WePS]]). The authors use the standard purity and inverse purity clustering metrics as in the WePS evaluation. The test collection consists of three sets of 10 different names, sampled from ambiguous names from English Wikipedia (famous people), participants of the ACL 2006 conference (computer scientists) and common names from the US Census data, respectively. For each name, the top 100 documents retrieved from the Yahoo! Search API are used.
  
==Parameter Estimation for CRFs==
+
The authors report macro-averaged purity of 0.657, inverse purity of 0.795 and an F score of 0.740. This compares better than the results of the first tier systems in the WePS 2007 official evaluation.  
The paper provides two methods to perform parameter estimation (training) for CRFs, both based on improved iterative scaling. One iteration of either of these methods has roughly the same time and space complexity as the Baum-Welch training algorithm for HMMs. Both are guaranteed to converge.
 
 
 
==Experiments==
 
 
 
The authors performs three different types of experiments:
 
 
 
* They generate synthetic data from an HMM, and use this data to train a CRF and an MeMM with the same structures. They find that the CRF error is 4.6% and the MeMM error is 42%, thus proving that the CRF solves the label bias problem encountered by the MeMM.
 
 
 
* They generate synthetic data using a mix of first-order and second-order HMMs, and train first-order HMM, MeMM and CRF models on this data, without using any overlapping features. They find that CRFs perform the best, thus showing their robustness to incorrect modeling assumptions. The addition of overlapping features substantially improves the performance of CRFs and MeMMs
 
 
 
* Finally, they perform POS tagging on a subset of the Penn Treebank, using an HMM, MeMM and a CRF. They repeat this both without and with orthographic features. Without orthographic features, the HMM outperforms the MeMM and the CRF outperforms the HMM, while with them, the MeMM and CRF both significantly outperform HMMs, and the CRF still remains the best.
 
  
 
==Conclusion==
 
==Conclusion==
The authors conclude that CRFs offer the following significant advantages: discriminative training, combination of arbitrary, overlapping  features from both the past and future; efficient training and decoding based on dynamic programming; and parameter estimation guaranteed to find the global optimum. Their main disadvantage is the slow convergence of the training algorithm compared to MeMMs and HMMs.
+
The authors present interesting learning (SEG) and clustering (KARC) methods to solve the problem of [[AddressesProblem::Cross Document Coreference (CDC)]].
 
 
== Related papers ==
 
 
 
  
 
== Relevant Papers ==
 
== Relevant Papers ==

Latest revision as of 01:45, 1 December 2010

Citation

Jian Huang, Sarah M. Taylor, Jonathan L. Smith, Konstantinos A. Fotiadis and C. Lee Giles. 2009. Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 414–422.

Online version

An online version of this paper is available [1].

Summary

This paper solves the problem of Cross Document Coreference (CDC) by using Information Extraction tools to make profiles of entities, measuring the distance between profiles by a learned distance function, and finally clustering them using kernelized fuzzy relational clustering.

Constructing entity profiles using IE and WDC

An information extraction tool first extracts Named Entities and their relationships. For the NEs of interest, a Within Document Coreference (WDC) module then links the entities deemed as referring to the same underlying identity into a WDC chain. They use the information extraction tool AeroText for this purpose. AeroText extracts two types of information for an entity: the attribute information about the person named entity includes first/middle/last names, gender, mention, etc, and also, relationship information between named entities, such as Family, List, Employment, Ownership, Citizen-Resident-Religion-Ethnicity and so on, as specified in the ACE evaluation. AeroText resolves the references of entities within a document and produces the entity profiles used as input to their CDC system. Each entity is represented as a profile which contains the NE, its attributes and associated relationships.

Kernelized Fuzzy Relational Clustering

For clustering the entities, they use the Kernelized Fuzzy Relational Clustering algorithm (KARC). This algorithm is based on the Any Relation Clustering Algorithm (ARCA), which represents relational data as object data using their mutual relation strength and uses Fuzzy C-Means for clustering. Each chained entity is represented as a vector of its relation strengths with all the entities. Fuzzy clusters can then be obtained by grouping closely related patterns using object clustering algorithm.

The kernelized fuzzy clustering algorithm KARC works as follows. The chained entities E are first objectified into a relation strength matrix R using Specialist Exponentiated Gradient (SEG). A Gram matrix K is then computed based on the relation strength vectors using the kernel function. For a given number of clusters C, the initialization step is done by randomly picking C patterns as cluster centers. The kernel distance matrix D is initialized and subsequently KARC alternately updates the membership matrix U and the kernel distance matrix D until convergence or running more than a certain number of iterations. Finally, the soft partition is generated based on the membership matrix U, which is the desired cross document coreference result.

The number of true underlying identities may vary depending on the entities’ level of ambiguity (e.g. name frequency). To select the optimal number of clusters, the authors adopt the Xie-Beni Index (XBI) (Xie and Beni, 1991) as in ARCA, which is one of the most popular cluster validities for fuzzy clustering algorithms.

Learning Distance Functions from a suite of similarity measures

A suite of similarity functions is designed to determine if the attributes relationships in a pair of entity profiles match or not: SoftTFIDF, JC Semantic Similarity, Rule-based metrics, etc. The authors treat each similarity function as a specialist that specializes in computing the similarity of a particular type of relationship. They utilize a specialist ensemble learning framework (SEG) to combine these component similarities into the relation strength for clustering. Here, a specialist is awakened for prediction only when the same type of relationships are present in both chained entities. A specialist can choose not to make a prediction if it is not confident enough for an instance. Also, specialists have different weights (in addition to their prediction) on the final relation strength.

Experiments and Evaluation

They use the ACL SemEval-2007 web person search task (WePS). The authors use the standard purity and inverse purity clustering metrics as in the WePS evaluation. The test collection consists of three sets of 10 different names, sampled from ambiguous names from English Wikipedia (famous people), participants of the ACL 2006 conference (computer scientists) and common names from the US Census data, respectively. For each name, the top 100 documents retrieved from the Yahoo! Search API are used.

The authors report macro-averaged purity of 0.657, inverse purity of 0.795 and an F score of 0.740. This compares better than the results of the first tier systems in the WePS 2007 official evaluation.

Conclusion

The authors present interesting learning (SEG) and clustering (KARC) methods to solve the problem of Cross Document Coreference (CDC).

Relevant Papers