Difference between revisions of "Elworthy, 1994 Does Baum-Welch re-estimation help taggers"

From Cohen Courses
Jump to navigationJump to search
Line 75: Line 75:
 
== Conclusion ==  
 
== Conclusion ==  
  
1. Baum-Welch doesn't necessarily increase accuracy over the number of iterations.
+
# Baum-Welch doesn't necessarily increase accuracy over the number of iterations.
2. If a hand tagged corpus is available, use it.  
+
# If a hand tagged corpus is available, use it.  
3. If a hand tagged corpus or the lexicon frequencies are available, use Baum-Welch only for a few iterations.
+
# If a hand tagged corpus or the lexicon frequencies are available, use Baum-Welch only for a few iterations.
4. If no prior data is available, use Baum-Welch in the prescribed manner, iterating until it converges (over some metric like perplexity).
+
# If no prior data is available, use Baum-Welch in the prescribed manner, iterating until it converges (over some metric like perplexity).
  
  
  
 
Under construction by [[User:dkulkarn]]
 
Under construction by [[User:dkulkarn]]

Revision as of 20:12, 2 November 2011

Citation

David Elworthy, Does Baum-Welch Re-estimation :Help Taggers?. In Proceedings of the Fourth Conference on Applied Natural Language Processing, 53--58

Online version

ACL anthology

Summary

This Paper originated during mid-nineties when there was increasing interest in hidden markov models for the determining POS Tagging. It built on the work done at Xerox on supervised learning. It analysed the effect of the Baum-Welch over the quality of annotations of the data.

The paper discusses two experiments -

  1. Effect of initial conditions on Baum-Welch re-estimation
  2. Study the behavior of Baum-Welch on a given data.

Another important feature of the paper is the method of evaluation. Accuracy is measured as proportion of ambiguous words that are given correct tag. A word is ambiguous if more than one tag is hypothesized.


Method

Effect of initial conditions

Dataset

For the first experiment, the author constructs four corpora from the LOB corpus

  1. LOB-B from part B
  2. LOB-L from part L
  3. LOB-B-G from parts B to G inclusive
  4. LOB-B-J from parts B to J inclusive

The last part is used to train the model and the other three were used as untagged data.

Design

The experiment was designed to observe the effect of the quality of data over unsupervised learning. The hand-tagged corpus was stripped of its annotations step-wise to simulate the effect of poorer training -

Lexicon

D0 Undegraded lexical probabilities, calculated from

D1 Lexical probabilities are correctly ordered, so that the most frequent tag has highest lexical probability, but the absolute values are unreliable.

D2 Lexical probabilities are proportional to tag frequencies but independent of the actual occurring of the tag.

D3 Uniform distribution over the lexical probability. This implies that there are no prior assumptions over the data.


Transitions

T0 Undegraded transition probabilities calculated from .

T1 Uniform distribution over the transition probabilities.

Results

The order of accuracies are LOB-L <= LOB-B <= LOB-G except for D3+T1 when LOB-L >= LOB-B, LOB-G. In fact, it does better than D0+T1 and D1+T1 which suggests that the test data might not overlap with the training data. This reduces the effect of better training annotations.

D2+T1 and D3+T1 give poor performances over various other corpora (including the Penn Treebank). This suggest that for using Baum-Welch one must have good training data about either the lexicons or the transitions.


Patterns of re-estimation

The second experiment used Penn Treebank to observe if there were any patterns over accuracy on the number of training iterations. There were three patterns that emerged -

  1. Classical, where the accuracy increases for every iteration
  2. Initial maximum, where the accuracy is maximum in the first iteration and then subsequently degrades
  3. Early maximum, where the maximum accuracy is reached in less number of iterations (2-4) and then subsequently degrades

The experiment was performed in four steps of degradations - D0+T0, D2+T0, D0+T1 and D2+T1. The trained models were tested on three types of test corpora - same as the training, similar to training and different from training.

When the model is trained over undegraded data, the pattern of re-estimation was either initial or early maximum. For D2+*, the pattern is more classical.

Conclusion

  1. Baum-Welch doesn't necessarily increase accuracy over the number of iterations.
  2. If a hand tagged corpus is available, use it.
  3. If a hand tagged corpus or the lexicon frequencies are available, use Baum-Welch only for a few iterations.
  4. If no prior data is available, use Baum-Welch in the prescribed manner, iterating until it converges (over some metric like perplexity).


Under construction by User:dkulkarn