Difference between revisions of "J. Artiles et al. EMNLP 2009"

From Cohen Courses
Jump to navigationJump to search
 
(7 intermediate revisions by the same user not shown)
Line 11: Line 11:
  
 
This [[Category::paper]] tries to determine the role of a number of features on solving [[AddressesProblem::Web People Search clustering problem]]. The paper focused on  
 
This [[Category::paper]] tries to determine the role of a number of features on solving [[AddressesProblem::Web People Search clustering problem]]. The paper focused on  
the role of NE in this task.
+
the role of Named Entities (NEs) in this task.
 
In order to compare different features, they reformulated this clustering problem into
 
In order to compare different features, they reformulated this clustering problem into
 
a classification problem such that each pair of documents will be classified as coreferent
 
a classification problem such that each pair of documents will be classified as coreferent
Line 21: Line 21:
  
 
For experiments, they used two standard datasets for Web People Search Systems: [[UsesDataset::WePS-1]] and [[UsesDataset::WePS-2]].
 
For experiments, they used two standard datasets for Web People Search Systems: [[UsesDataset::WePS-1]] and [[UsesDataset::WePS-2]].
 +
NEs are extracted using Stanford and OAK NER systems.
 
They concluded
 
They concluded
  
# NEs do not improve the clustering when compared with a combination of simpler features
+
# NEs do not improve the clustering when compared with a combination of simpler features such as local, global and snippet tokens, n-grams, etc.  
such as local, global and snippet tokens, n-grams, etc.  
+
 
 
# results are sensitive to the NER system used.
 
# results are sensitive to the NER system used.
 +
 +
The counter-intuitive results tell us linguistic features do not necessarily lead to better results in some NLP tasks.
  
 
== MPA ==
 
== MPA ==
Given a feature set <math> X = {x_{1}, x_{2}, \dots, x_{n} }</math>, a perfect algorithm would always choose
+
Given a feature set <math> X = \{x_{1}, x_{2}, \dots, x_{n}\}</math>, a perfect algorithm would always choose
 
the features that give the correct information and ignores the ones that are misleading.
 
the features that give the correct information and ignores the ones that are misleading.
 
In other words if at least one feature gives correct information, then the perfect algorithm
 
In other words if at least one feature gives correct information, then the perfect algorithm
Line 37: Line 40:
 
\text{MaxPWA}(X) = \text{Prob}(\exist x \in X, x(a, a^{\prime}) > x(c,d))
 
\text{MaxPWA}(X) = \text{Prob}(\exist x \in X, x(a, a^{\prime}) > x(c,d))
 
</math>
 
</math>
 +
 +
where <math> x(a, a^{\prime}) </math> measures the similarity between two pages
 +
referring to the same person and
 +
<math> x(c, d) </math> is the similarity referring to two different person.

Latest revision as of 05:47, 23 November 2010

Citation

Javier Artiles, Enrique Amigó & Julio Gonzalo, The role of named entities in web people search, in EMNLP 2009

Online version

The role of named entities in web people search

Summary

This paper tries to determine the role of a number of features on solving Web People Search clustering problem. The paper focused on the role of Named Entities (NEs) in this task. In order to compare different features, they reformulated this clustering problem into a classification problem such that each pair of documents will be classified as coreferent if they share the same cluster or not coreferent, Otherwise.

The major contribution of this paper is to introduce Maximal Pairwise Accurary (MPA) measure that is an upper bound score for a combination of features regardless of the underlying machine learning algorithms used and parameter settings.

For experiments, they used two standard datasets for Web People Search Systems: WePS-1 and WePS-2. NEs are extracted using Stanford and OAK NER systems. They concluded

  1. NEs do not improve the clustering when compared with a combination of simpler features such as local, global and snippet tokens, n-grams, etc.
  1. results are sensitive to the NER system used.

The counter-intuitive results tell us linguistic features do not necessarily lead to better results in some NLP tasks.

MPA

Given a feature set , a perfect algorithm would always choose the features that give the correct information and ignores the ones that are misleading. In other words if at least one feature gives correct information, then the perfect algorithm would produce a correct output. This is MPA estimation of an upper bound for any ML using the feature set

where measures the similarity between two pages referring to the same person and is the similarity referring to two different person.