Comparative Study of Discriminative Models in SMT

From Cohen Courses
Jump to navigationJump to search

Summary

This page compares and contrasts two discriminative methods for Machine Translation that have been proposed in An_End-to-End_Discriminative_Approach_to_Machine_Translation and A_Discriminative_Latent_Variable_Model_for_SMT. The main different between these methods is in the approach taken for building the translation model. In the former case, a vector of features is trained using parallel data in order to maximize the likelihood of the data, and weight vector is trained using a weighted perceptron method on a separate phase. On the other hand, the latter work employs a log-linear model, where the feature set and the weights are trained jointly in order to maximize the likelihood of the data.

We will call the work in An_End-to-End_Discriminative_Approach_to_Machine_Translation, "A" and the work in A_Discriminative_Latent_Variable_Model_for_SMT "B".

Model Differences

  • The model detailed in A defines a combination of features give as:

where features are extracted from parallel data, and we wish to learn a weight function w to maximize the likelihood of the training data. This maximization of the weight function w is performed using a weighted perceptron approach. The model in B, on the other hand, is a log-linear/max entropy model given as:

Where we want to find the MAP estimator of the parameters that maximizes L. The translation probability p(s,t) is penalized using a Gaussian prior . The maximization is performed using the L-BFGS.

  • It is stated and supported by experiments in B, that the ML translation probabity feature (in the Blanket features) in A tends to overfit the training data, and the regularalized maximum a posterior model (using the Gaussian prior ) proposed in B is less prone to this overfitting. It the work in A, also addresses this problem indirectly using additional features and training a weight vectors using held-out data. Nonetheless, the issue of overfitting the data seems to be better handled in the max entropy model in B, since this issue is addressed in the maximization step directly. One approach that is similar that could have been applied in A, is to smooth the translation probability (using Kneser-Ney smoothing, for instance).
  • Both papers have to address the problem of derivational ambiguity, where multiple hypothesis h can be considered for each (s,t) pair. The question both work ask is whether it is better to use only the derivation h with highest score, or optimize using all derivations of the reference translation. Experimental results on both indicate the the later yields better results. This makes sense, since, in general, higher score derivations tend to use larger phrase pairs, and thus, do not generalize well for unseem sentence pairs. Work A also shows that only using derivations that produce the reference translation is not optimal, and using derivations based on locality and translation quality produces significant improvements.

Minor Differences

  • The baseline models used in these 2 papers are different statistical machine translations models. The work in A uses phrase-based models proposed in Koehn_et_al,_ACL_2003, while the work in B uses Hierarchical_phrase-based_translation. In terms of translation quality, hierarchical models tend to work better with language pairs with strong reorderings, such as Chinese to English.
  • In terms of datasets, both use the EUROPARL corpora, but choose different language pairs, training, held-out and test sets. This makes a quantitative comparison of the model results unreliable. Furthermore, the size of the training sets used in both methods are relatively small 67K sentence pairs in A and 170K sentence pairs in B. Thus, the results presented in these papers might not hold in large scale corpora with billions of sentence pairs.

Additional Questions

1. How much time did you spend reading the (new, non-wikified) paper you summarized? 30 mins

2. How much time did you spend reading the old wikified paper? 10 mins

3. How much time did you spend reading the summary of the old paper? 5 mins

4. How much time did you spend reading background materiel? 0 mins, I am familiar with these papers. Never read then before, but already had an idea of what they were doing.

5. Was there a study plan for the old paper? No, I think the old paper was from Structured Prediction.

6. Give us any additional feedback you might have about this assignment. I think it would be interesing to read papers using discriminative approaches for reordering rather than just translation ex:Discriminative Word Alignment with a Function Word Reordering Model.