Difference between revisions of "E. Minkov et al. HLT/EMNLP 2005"

From Cohen Courses
E. Minkov et al. HLTE. Minkov et al. HLT/EMNLP 2005
Jump to navigationJump to search
 
(5 intermediate revisions by the same user not shown)
Line 10: Line 10:
 
== Summary ==
 
== Summary ==
 
This is a [[Category::paper]] about extracting person names from emails.
 
This is a [[Category::paper]] about extracting person names from emails.
The authors addressed this problem by treating NER as tagging. [[UsesMethod::Conditional Random Fields|CRF]] model is used for this task.
+
The authors addressed this problem by treating NER as [[AddressesProblem::POS tagging|tagging]]. [[UsesMethod::Conditional Random Fields|CRF]] model is used for this task.
  
Contribution:  
+
To evaluate their method, four corpora are used where two of them are from [[UsesDataset::Enron email corpus]] and another two are from [[UsesDataset::CSpace email corpus]].
* email-specific feature set.
 
* The authors found that repetitions within single document are more often in newwires while repetitions occurred in multiple files are more often in emails. Based on this discovery, the authors introduced a new recall-enhancing method which is appropriate for emails.
 
  
Recall-enhancing Techniques:
+
The major two contributions of this paper are a set of email-specific features and new recall-enhancing methods.
 +
The authors found that repetitions within single document are more often in newwires while repetitions occurred in multiple files are more often in emails. Based on this discovery, the authors introduced new email-specific recall-enhancing methods.
 +
 
 +
The following explains such techniques:
 
* single document repetition (SDR): mark repeated tokens within a single document as a name.
 
* single document repetition (SDR): mark repeated tokens within a single document as a name.
 
* multiple document repetition (MDR): mark repeated tokens appearing in multiple documents as a name.
 
* multiple document repetition (MDR): mark repeated tokens appearing in multiple documents as a name.
Line 22: Line 23:
 
* PF: measures the ratio between the number of times that a word predicted as part of a name and the number of occurrences of this word.
 
* PF: measures the ratio between the number of times that a word predicted as part of a name and the number of occurrences of this word.
 
* IDF: measures word frequency.
 
* IDF: measures word frequency.
 +
 +
== Related Papers ==
 +
[[RelatedPaper::Culotta, CEAS 04]]

Latest revision as of 15:48, 23 October 2010

Citation

Einat Minkov, Richard C. Wang & William W. Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, in HLT/EMNLP 2005

Online version

Extracting Personal Names from Emails

Summary

This is a paper about extracting person names from emails. The authors addressed this problem by treating NER as tagging. CRF model is used for this task.

To evaluate their method, four corpora are used where two of them are from Enron email corpus and another two are from CSpace email corpus.

The major two contributions of this paper are a set of email-specific features and new recall-enhancing methods. The authors found that repetitions within single document are more often in newwires while repetitions occurred in multiple files are more often in emails. Based on this discovery, the authors introduced new email-specific recall-enhancing methods.

The following explains such techniques:

  • single document repetition (SDR): mark repeated tokens within a single document as a name.
  • multiple document repetition (MDR): mark repeated tokens appearing in multiple documents as a name.
  • inferred dictionaries: Build a dictionary from preliminary names from an extractor learned from training data. Then, perform filtering process based on predicted frequency (PF) and inverse document frequency (IDF). Words with low PF.IDF scores are either highly ambiguous in the corpus or the common words, which inaccurately predicted as names by the extractor.
  • PF: measures the ratio between the number of times that a word predicted as part of a name and the number of occurrences of this word.
  • IDF: measures word frequency.

Related Papers

Culotta, CEAS 04