Difference between revisions of "Lacoste-Julien et al, NAACL 2006"

From Cohen Courses
Jump to navigationJump to search
Line 3: Line 3:
 
== Citation ==
 
== Citation ==
  
D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine Learning Journal, 34: 211–-231, 1999.
+
Simon Lacoste-Julien, Ben Taskar, Dan Klein, and Michael I. Jordan. Word alignment via quadratic assignment. In Human Language Technology–North American Association for Computational Linguistics, New York, NY, 2006. [http://acl.ldc.upenn.edu/N/N06/N06-1015.pdf| On-line]
  
 
== Summary ==
 
== Summary ==
  
In this [[Category::paper]] the authors present IdentiFinder, an [[UsesMethod::HMM|Hidden Markov Model]] approach to the [[AddressesProblem::Named Entity Recognition]] problem. Most techniques used in  [[AddressesProblem::Named Entity Recognition]] until the time of the paper, were mainly based on handcrafted patterns that are completely language dependent, and not flexible to different inputs (speech input, upper case texts, etc).
+
In this [[Category::paper]] the authors address two major limitations of Taskar et al. (2005) paper on discriminative [[AddressesProblem::Word Alignments|word alignment]] methods: fertility and first order interactions.
  
This was the first [[Category::paper]] that addressed [[AddressesProblem::Named Entity Recognition]] with [[UsesMethod::HMM]]'s, recognizing a structure in the identification of named entities, formulating it as a classification problem where a word is either part of some class or belongs to a specific class "NOT-A-NAME".
+
The authors show how to extend a discriminative approach to word alignment to allow fertility modeling and to capture first-order interactions between alignments of consecutive words, enhancing the expressive power of the discriminative approach.  Allowing to capture phenomena of monotonicity, local inversion and contiguos fertility trends - phenomena that are highly informative for alignment. They do so while remaining computationally efficient.
  
The proposed solution can be efficiently trained, achieving results equivalent to the best NE solution that existed so far (aprox. 95%). The latter was a hand-crafted solution that needed experts to maintain the rules, being completely language dependent.The authors also show how the language dependent component of the model affect the performance of the solution, proving that although it helps, it is not nearly as significant as the other parts of the model.
+
Their best models achieves a relative AER reduction of 25 % over the basic matching formulation, beating intersected IBM Model 4 without the use of any compute-intensive features. The authors claim their contribution to be particularly promising for phrase based systems since they perform better with higher recall on the task.
  
As future work the authors point to the development of a hierarchical model to capture neste names (like "Bank of Boston") and modifications to the model that allow longer distance information.
+
[[UsesMethod::HMM]]
  
 
== Brief Description of the Method ==
 
== Brief Description of the Method ==
  
Their solution had a model for each name-class and a model for the not-a-name text. Additionally, there are tow special states, the START-OF-SENTENCE and END-OF-SENTENCE. The figure below provides a graphical representation of the model (the dashed edges assure the completion of the graph).
+
The authors base their work in a previous discriminative approach to word alignments by Taskar et al. (2005). In this model, nodes <math>V^s</math> and <math>V^t</math> correspond to words in the source and target language, respectively, and edges <math>\varepsilon = \{ jk:j \in V^s, k\in V^t\}</math> correspond to alignments between words. The edge wights <math>s_{jk}</math> represent the degree to which word <math>j</math> in one sentence cans be translated using word <math>k</math> in the other sentence. The predicted alignments are then chosen by maximizing the sum of the edge scores, which can be formulated in linear programming as:
 
 
[[File:BikelHmmGraph.png|400px]]
 
 
 
Each of the regions in the above graph was modeled with a different statistical bigram language model (likelihood of words occurring within that region), meaning that each type of name is considered a different language, with separate bigram probabilities. Formally, one is trying to find the most likely sequence of name classes <math>NC</math> given a sequence of words <math>NC</math>. Using [[UsesMethod::Viterbi]] algorithm one can then search the entire space of all possible name-class assignments, that maximize the following equation
 
  
 
<math>
 
<math>
\max P(NC|W) = \max \frac {P(W,NC)}{P(W)} = \max P(W,NC)
+
\max_{0 \le z \le 1} \sum_{jk \in \varepsilon} s_{jk} z_{jk}
 
</math>
 
</math>
  
 +
      <math>
 +
        s.t. \sum _{j \in V^s} z_{jk} \le 1, \forall k \in V^t;
 +
</math>
  
Additionally, the authors represented words as two-element vectors. <math>\left \langle w,f \right \rangle</math> represents a word occurrence where <math>w</math> is the text of the word and <math>f</math> is a feature that is assigned to it. The set of features as long as the motivation behind them can be found in the figure below. These feature require a language dependent computation, but that is simple and deterministic.
+
          <math>
 +
              \sum _{k \in V^t} z_{jk} \le 1, \forall j \in V^s,
 +
</math>
  
[[File:BikelWordFeatures.png|500px]]
+
where <math>z_{jk}</math> are relaxations of the binary variables that indicate if <math>j</math> is assigned to <math>k</math>.
  
 
+
 
The final model that is presented is divided in two smaller models: the Top Level Model and Back-off Model & Smoothing.
+
[[File:BikelHmmGraph.png|400px]]
 
 
The Top Level Model, the most accurate and powerful, generates the words and name-classes following three steps:
 
*(1) Select a name-class, conditioning on the previous class and previous word, i.e., <math>P(NC|NC_{-1},w_{-1}) =\frac{ c(NC, NC_{-1}, w_{-1})}{c(NC_{-1}, w_{-1})}</math>.
 
 
 
*(2) Generate the first word inside that name-class, conditioning on the current and previous name-classes, i.e., <math>P(\left \langle w,f \right \rangle|NC,NC_{-1}) = \frac{ c(\left \langle w,f \right \rangle , NC, NC_{-1})}{c(NC, NC_{-1})}</math>.
 
 
 
*(3) Generate all subsequent words inside that name-class, conditioned on its predecessor, i.e., <math>P(\left \langle w,f \right \rangle|\left \langle w,f \right \rangle _{-1},NC) = \frac{ c(\left \langle w,f \right \rangle , \left \langle w,f \right \rangle _{-1}NC)}{c(\left \langle w,f \right \rangle _{-1}, NC)}</math>.
 
 
 
where <math>c(.)</math> is the <math>count</math> function (number of times the events occurred in the training data.
 
 
 
The Back-Off Model & Smoothing is responsible to deal with the fact that there is not enough training data to estimate accurate probabilities for all possibilities. This can happen whether when there is a word not seen in training or an event with insufficient data to predict. This model adopts a back-off approach, going back to unigrams, or smoothing techniques.
 
  
 
== Experimental Results ==
 
== Experimental Results ==

Revision as of 13:09, 2 November 2011

Being edited by Rui Correia

Citation

Simon Lacoste-Julien, Ben Taskar, Dan Klein, and Michael I. Jordan. Word alignment via quadratic assignment. In Human Language Technology–North American Association for Computational Linguistics, New York, NY, 2006. On-line

Summary

In this paper the authors address two major limitations of Taskar et al. (2005) paper on discriminative word alignment methods: fertility and first order interactions.

The authors show how to extend a discriminative approach to word alignment to allow fertility modeling and to capture first-order interactions between alignments of consecutive words, enhancing the expressive power of the discriminative approach. Allowing to capture phenomena of monotonicity, local inversion and contiguos fertility trends - phenomena that are highly informative for alignment. They do so while remaining computationally efficient.

Their best models achieves a relative AER reduction of 25 % over the basic matching formulation, beating intersected IBM Model 4 without the use of any compute-intensive features. The authors claim their contribution to be particularly promising for phrase based systems since they perform better with higher recall on the task.

HMM

Brief Description of the Method

The authors base their work in a previous discriminative approach to word alignments by Taskar et al. (2005). In this model, nodes and correspond to words in the source and target language, respectively, and edges correspond to alignments between words. The edge wights represent the degree to which word in one sentence cans be translated using word in the other sentence. The predicted alignments are then chosen by maximizing the sum of the edge scores, which can be formulated in linear programming as:

      
          

where are relaxations of the binary variables that indicate if is assigned to .


BikelHmmGraph.png

Experimental Results

The model was tested with MUC-6 dataset, a collection of 30 Wall Street Journal documents. The authors compared the performance of their model in comparison with the best NE-system so far, a rule-based system. They also tested their solution across different types of input material (mixed case, upper case and speech form) and with a different language (Spanish). The results are shown in the table below:

BikelResults.png

In the Mixed Case setting the best rules system performed better than the proposed solution in the present paper, although the different is not statistically significant, and does not compensate the effort of having experts maintaining the set of rules. The authors justify the low score for Spanish with the low quantity and quality (inconsistencies) in the training data of the Spanish model.

Another result that came out of this work was the fact that 100k words of training seems to suffice to obtain state-of-the-art results for the NE task.