Difference between revisions of "User talk:Xxiong"

From Cohen Courses
Jump to navigationJump to search
Line 9: Line 9:
  
 
== Summary ==
 
== Summary ==
 +
Task: NER from emails
  
This unpublished [[Category::Paper|manuscript]] describes how [[UsesMethod::SEARN]] can be used for three Natural Language Processing related tasks: [[AddressesProblem::Sequence Labeling]], [[AddressesProblem::Parsing]], and [[AddressesProblem::Machine Translation]]
+
Techniques: treating NER as tagging. CRF model is used for this task.
  
The key points of the paper are:
+
Contribution:  
* Authors state that [[UsesMethod::SEARN]] is efficient, widely applicable, theoretically justified, and simple.
+
* email-specific feature set
* [[UsesMethod::SEARN]] looks at problems a search problems, and learns classifiers that walk through the search space in a good way.
+
 
* Authors looked at 3 sample problems: [[AddressesProblem::Sequence Labeling]], [[AddressesProblem::Parsing]], and [[AddressesProblem::Machine Translation]]
+
repetitions within single document are more often in newwires while repetitions occurred in multiple files are more often in emails.
* Efficacy of [[UsesMethod::SEARN]] hinges on ability to compute an optimal/near-optimal policy. When an optimal policy is not available, authors suggest performing explicit search as an approximation. For segmentaiton and parsing, optimal policy is closed form; for summarization and machine translation, the optimal policy is not available.
 
  
 
== Example SEARN Usage ==
 
== Example SEARN Usage ==

Revision as of 15:35, 8 October 2010

Citation

Einat Minkov, Richard C. Wang & William W. Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, in HLT/EMNLP 2005

Online version

Extracting Personal Names from Emails

Summary

Task: NER from emails

Techniques: treating NER as tagging. CRF model is used for this task.

Contribution:

  • email-specific feature set

repetitions within single document are more often in newwires while repetitions occurred in multiple files are more often in emails.

Example SEARN Usage

Sequence Labeling

Tagging

  • Task is to produce a label sequence from an input sequence.
  • Search framed as left-to-right greedy search.
  • Loss function: Hamming loss
  • Optimal Policy:

Op-tagging.png


NP Chunking

  • Chunking is a joint segmentation and labeling problem.
  • Loss function: F1 measure
  • Optimal Policy:

Op-chunking.png

Parsing

  • Looked at dependency parsing with a shift-reduce framework.
  • Loss funtion: Hamming loss over dependencies.
  • Decisions: shift/reduce
  • Optimal Policy:

Op-parsing.png

Machine Translation

  • Framed task as a left-to-right translation problem.
  • Search space over prefixes of translations.
  • Actions are adding a word (or phrase to end of existing translation.
  • Loss function: 1 - BLEU or 1 - NIST
  • Optimal policy: given set of reference translations R, English translation prefix e_1, ... e_i-1, what word (or phrase) should be produced next / are we finished.

Related papers

  • Search-based Structured Prediction: This is the journal version of the paper that introduces the SEARN algorithm - Daume_et_al,_ML_2009.