http://curtis.ml.cmu.edu/w/courses/api.php?action=feedcontributions&user=Asaluja&feedformat=atomCohen Courses - User contributions [en]2024-03-29T12:29:40ZUser contributionsMediaWiki 1.33.1http://curtis.ml.cmu.edu/w/courses/index.php?title=User:Asaluja&diff=10681User:Asaluja2011-11-29T17:59:39Z<p>Asaluja: </p>
<hr />
<div>I'm Avneesh, a second year PhD student in ECE & LTI. I'm taking this class because it's quite pertinent to my research interests. <br />
<br />
[http://www.cs.cmu.edu/~avneesh My External Page]<br />
<br />
You can find more info on my page, including a picture.<br />
<br />
<br />
== Class Project ==<br />
Proposed project: Training SMT Systems with Latent Structural SVM<br />
<br />
[[Training_SMT_Systems_with_the_Latent_Structured_SVM | Project Page]]<br />
<br />
== Wiki Write-Ups ==<br />
Please grade write-ups under Papers and Methods. <br />
<br />
=== Papers ===<br />
* [[SoftSupervisedTextClassification | A.Subramanya & J.Bilmes, EMNLP 2008: Soft-Supervised Learning for Text Classification]] (Sept 30 deadline)<br />
* [[Structured Output Learning with Indirect Supervision | M.W. Chang, V. Srikumar, D. Goldwasser, D. Roth, ICML 2010: Structured Output Learning with Indirect Supervision]] (Sept 30 deadline)<br />
* [[A Discriminative Latent Variable Model for SMT | P. Blunsom, T. Cohn, M. Osborne, ACL 2008: A Discriminative Latent Variable Model for Statistical Machine Translation]] (Oct 31 deadline)<br />
* [[Hitting the Right Paraphrases in Good Time | S.Kok, C.Brockett, NAACL/HLT 2010: Hitting the Right Paraphrases in Good Time]] (Nov 30 deadline)<br />
* [[Unsupervised Word Alignment with Arbitrary Features | C.Dyer, J.Clark, A.Lavie, N.Smith, ACL 2011: Unsupervised Word Alignment with Arbitrary Features]] (Nov 30 deadline)<br />
<br />
=== Methods ===<br />
* [[Alternating Minimization | Alternating Minimization]] (Sept 30 deadline)<br />
* [[Measure Propagation | Measure Propagation]] (Sept 30 deadline)<br />
* [[L-BFGS | L-BFGS]] (Oct 31 deadline)<br />
<br />
=== Datasets ===<br />
* [[Reuters_21578 | Reuters 21578 Dataset ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10299Unsupervised Word Alignment with Arbitrary Features2011-11-25T20:13:00Z<p>Asaluja: /* Approach */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use [[Expectation_Maximization | EM ]] for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements.<br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further broken down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n))}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph, similar to the original [[Conditional_Random_Fields | CRF ]] formulation ([[Lafferty_2001_Conditional_Random_Fields | wiki writeup link]])<br />
<br />
To learn parameters, we add an <math>\ell_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math>. Training is an expensive process though, and involves discrimination against the discriminative neighborhood, which although can be pruned (see next paragraph), other techniques can be used to keep the discriminative neighborhood manageable, e.g., [[Contrastive_Estimation|constrastive estimation]] ([[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data |wiki link]] here). <br />
<br />
Inference is done by modeling the translation search space as a WFSA. The [[Forward-Backward|forward backward]] algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used), positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). Lastly, the authors also include high-level, coarse features for example the model 1 probabilities and the Dice coefficient.<br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
Overall one sees that across the board, the proposed model improves upon model 4 in terms of AER (where available), average fertility of singleton source words, and the number of rule types learned that match the test sets. The proposed model also improves upon model 4 when it comes to extrinsic measures, but the most noticeable results are when model 4 and the proposed model are combined for translation purposes. An analysis is also done in terms of the highest feature weights for the language pairs, and these weights definitely reflect the characteristics of the individual language pairs. <br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10298Unsupervised Word Alignment with Arbitrary Features2011-11-25T20:10:35Z<p>Asaluja: /* Summary */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use [[Expectation_Maximization | EM ]] for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements.<br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further broken down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n))}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph, similar to the original [[Conditional_Random_Fields | CRF ]] formulation ([[Lafferty_2001_Conditional_Random_Fields | wiki writeup link]])<br />
<br />
To learn parameters, we add an <math>\ell_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math>. Training is an expensive process though, and involves discrimination against the discriminative neighborhood, which although can be pruned (see next paragraph), other techniques can be used to keep the discriminative neighborhood manageable, e.g., [[Contrastive_Estimation|constrastive estimation]] ([[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data |wiki link]] here). <br />
<br />
Inference is done by modeling the translation search space as a WFSA. The [[Forward-Backward|forward backward]] algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). Lastly, the authors also include high-level, coarse features for example the model 1 probabilities and the Dice coefficient. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
Overall one sees that across the board, the proposed model improves upon model 4 in terms of AER (where available), average fertility of singleton source words, and the number of rule types learned that match the test sets. The proposed model also improves upon model 4 when it comes to extrinsic measures, but the most noticeable results are when model 4 and the proposed model are combined for translation purposes. An analysis is also done in terms of the highest feature weights for the language pairs, and these weights definitely reflect the characteristics of the individual language pairs. <br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10297Unsupervised Word Alignment with Arbitrary Features2011-11-25T20:09:47Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM [[Expectation_Maximization | EM ]]for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements. <br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further broken down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n))}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph, similar to the original [[Conditional_Random_Fields | CRF ]] formulation ([[Lafferty_2001_Conditional_Random_Fields | wiki writeup link]])<br />
<br />
To learn parameters, we add an <math>\ell_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math>. Training is an expensive process though, and involves discrimination against the discriminative neighborhood, which although can be pruned (see next paragraph), other techniques can be used to keep the discriminative neighborhood manageable, e.g., [[Contrastive_Estimation|constrastive estimation]] ([[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data |wiki link]] here). <br />
<br />
Inference is done by modeling the translation search space as a WFSA. The [[Forward-Backward|forward backward]] algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). Lastly, the authors also include high-level, coarse features for example the model 1 probabilities and the Dice coefficient. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
Overall one sees that across the board, the proposed model improves upon model 4 in terms of AER (where available), average fertility of singleton source words, and the number of rule types learned that match the test sets. The proposed model also improves upon model 4 when it comes to extrinsic measures, but the most noticeable results are when model 4 and the proposed model are combined for translation purposes. An analysis is also done in terms of the highest feature weights for the language pairs, and these weights definitely reflect the characteristics of the individual language pairs. <br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10296Unsupervised Word Alignment with Arbitrary Features2011-11-25T19:55:12Z<p>Asaluja: /* Baseline & Results */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements. <br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further broken down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n))}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link). <br />
<br />
To learn parameters, we add an <math>\ell_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math><br />
<br />
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself).<br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
Overall one sees that across the board, the proposed model improves upon model 4 in terms of AER (where available), average fertility of singleton source words, and the number of rule types learned that match the test sets. The proposed model also improves upon model 4 when it comes to extrinsic measures, but the most noticeable results are when model 4 and the proposed model are combined for translation purposes.<br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10295Unsupervised Word Alignment with Arbitrary Features2011-11-25T19:53:06Z<p>Asaluja: /* Approach */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements. <br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further broken down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n))}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link). <br />
<br />
To learn parameters, we add an <math>\ell_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math><br />
<br />
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself).<br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10294Unsupervised Word Alignment with Arbitrary Features2011-11-25T19:50:53Z<p>Asaluja: /* Citation */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements. <br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further broken down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n))}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link). <br />
<br />
To learn parameters, we add an <math>\ell_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_{p_{\theta}} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta}} (t, a | s, n)} [H(\cdot)]</math><br />
<br />
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10293Unsupervised Word Alignment with Arbitrary Features2011-11-25T19:48:46Z<p>Asaluja: /* Citation */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements. <br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further brokwn down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n)}{Z_\theta (s,n)}</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link). <br />
<br />
To learn parameters, we add an <math>\mathcal{l}_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math><br />
<br />
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10292Unsupervised Word Alignment with Arbitrary Features2011-11-25T19:48:08Z<p>Asaluja: /* Citation */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL 2011 | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements. <br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further brokwn down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n)}{Z_\theta (s,n)</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link). <br />
<br />
To learn parameters, we add an <math>\mathcal{l}_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math><br />
<br />
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Uren.png&diff=10291File:Uren.png2011-11-25T19:47:41Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Zhen.png&diff=10290File:Zhen.png2011-11-25T19:47:25Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Czen.png&diff=10289File:Czen.png2011-11-25T19:47:09Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10288Unsupervised Word Alignment with Arbitrary Features2011-11-25T19:46:01Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL 2011 | coauthors = J. Clark, A. Lavie, N.A. Smith | date = 2011| first = C.| last = Dyer | title = Unsupervised Word Alignment with Arbitrary Features | url =http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf}<br />
<br />
This [[Category::Paper]] is available online [http://www.cs.cmu.edu/~jhclark/pubs/alignment.pdf].<br />
<br />
=== Summary ===<br />
<br />
The authors present a discriminatively-trained, globally normalized log-linear model to model lexical translation in a statistical machine translation system (i.e., word alignment). Unlike previous discriminative approaches, the authors do not need supervised or gold standard word alignments in their approach, just supervision in the form of bilingual parallel sentence corpora. And unlike generative approaches that use EM for unsupervised word alignment, the proposed method can incorporate arbitrary, overlapping features that can help improve word alignment. The authors compare their model to IBM Model 4: they propose two intrinsic metrics, and also evaluate the end-to-end performance of their system with BLEU, METEOR and TER and find improvements. <br />
<br />
=== Approach ===<br />
<br />
The proposed model aims to maximize the conditional likelihood <math>p(t | s)</math>, i.e, the probability of a target sentence given a source sentence. The authors include a random variable for the length of the target sentence, and thus write <math>p(t|s) = p(t, n | s) = p(t | s, n) \times p (n|s)</math>, i.e., decomposing the likelihood into two models, a translation model and a length model. We introduce a latent variable for the alignment in the translation model, i.e., <math>p(t | s, n) = \sum_a p(t, a | s, n)</math>. Unlike Brown et al's (1993) version, where <math>p(t, a | s, n)</math> is further brokwn down, the authors use a log-linear model to model this directly: <math>p_{\theta} (t, a | s, n) = \frac{\exp(\theta^T H(t, a, s, n)}{Z_\theta (s,n)</math>, where <math>H</math> is a feature function vector dependent on the alignment, source and target sentences, as well as length of target sentence, and <math>Z_{\theta} (s,n)</math> is the partition function, which under reasonable assumptions is finite. The length model is irrelevant here, since we are using this model for alignment, where both source and target lengths are observed and completely determined. To make things tractable, the feature function vector is assumed to decompose linearly over all the cliques, if one views the random variables in the form of a graph (similar to the original CRF formulation, link). <br />
<br />
To learn parameters, we add an <math>\mathcal{l}_1</math> regularization term and maximize the conditional log likelihood of the training set data. The gradient, like in CRFs, boils down to the expected difference between feature function values, where in one instance we let the alignment random variable vector <math>a</math> vary over all possible alignments, and in the other case we let the translation output <math>t</math> as well as <math>a</math> vary over all possible values. <br />
<math>\frac{\partial \mathcal{L}}{\partial \theta} = \sum_{\langle s, t \rangle \in \mathcal{T}} \mathbb{E}_p_{\theta} (a | s, t, n)} [H(\cdot)] - \mathbb{E}_{p_{\theta} (t, a | s, n)} [H(\cdot)]</math><br />
<br />
Inference is done by modeling the translation search space as a WFSA. The forward-backward algorithm is used to compute the expectation terms in the gradient. However, the WFSA is very large, especially considering one of the expectations is taken when the translation output varies. Thus, the number of edges in the WFSA needs to be reduced, and the authors try 4 different ways of doing this, ranging from restricting the set of all possible translation outputs to be only the set of target words co-occurring in sentence pairs containing the source sentence, to using the empirical Bayes method and variational inference to prune. <br />
<br />
A significant component of this paper is the type of features the authors included to help elicit good word alignments. On top of using word association features (indicator functions for word pairs), they also use orthographic features (which is obviously limited in the case of Chinese-English, one of the language pairs that they used). Positional features (i.e., looking at how far words are to the alignment matrix diagonal), source features (to capture words in the source language that are purely functional and have no lexical component), source path features (features that assess the goodness of the alignment path through the source sentence, i.e., measuring jumps, etc. in the sequential alignment of the source sentence), and target string features (e.g., if a word translates to itself). <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors tested out their model on three language pairs: Chinese-English (travel/tourism domain), Czech-English (news commentary), and Urdu-English (NIST 2009 OpenMT evaluation). The three language pairs each offer up distinct issues for SMT. English was treated as both source and target in their experiments. For the baseline, Giza++ was used to learn model 4, and alignments are symmetrized using grow-diag-final-and heuristic. Only in the CZ-EN case did we have gold standard word alignments. The authors also proposed two additional intrinsic measures: average alignment fertility of source words that occur only once in the training data (as there is a tendency in model 4 for these words to have many target words aligned to them), and the number of rule types learned in grammar induction matching the translation test sets, which suggests better coverage. <br />
<br />
Czech-English results:<br />
<br />
[[File:czen.png]]<br />
<br />
Chinese-English results:<br />
<br />
[[File:zhen.png]]<br />
<br />
Urdu-English results:<br />
<br />
[[File:uren.png]]<br />
<br />
=== Related Work ===<br />
<br />
Generative model-based approaches to word alignment primarily rely on the IBM Models of Brown et al, CL 1993. These models make a host of independence assumptions, which limits the way we can incorporate features and additional information in these models. Discriminative model-based word alignment approaches, e.g., Taskar et al, HLT/EMNLP 2005 need gold-standard word alignments for training, something which is very difficult to obtain in practice.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hitting_the_Right_Paraphrases_in_Good_Time&diff=10287Hitting the Right Paraphrases in Good Time2011-11-25T00:01:33Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of NAACL/HLT 2010 | coauthors = C.Brockett | date = 2010| first = S.| last = Kok| title = Hitting the Right Paraphrases in Good Time | url =http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf].<br />
=== Summary ===<br />
<br />
The authors present an approach to paraphrasing based on [[RandomWalk | random walks]] and bilingual parallel phrase corpora. They first construct a graph, where the nodes are phrases in multiple languages, and the edges are between phrases that have been aligned to each other using some sort of phrase alignment model or system. Next, for a particular sentence they break it down into phrases, and then compute the hitting time between each phrase and all of the other phrases in the graph. The hitting time is the expected time it would take for a random walk going from node i (phrase i) to reach node j (phrase j), based on the transition probabilities of the graph (more on how this is computed in their particular case later). They sort the hitting times based on lowest first to come up with the paraphrases that are most probable as defined by random walks on the graph. The authors discuss a couple of methods to prune the graph and also to minimize spurious word alignments, as well as show how they can incorporate domain knowledge into the graph-based paraphrase model. <br />
<br />
[[LabelPropagation| Label Propagation]] and [[GraphPropagation | Graph Propagation]] are strongly related to random walks - namely, labels in a graph propagate as if undergoing a random walk from a labeled node to an unlabeled node (or vice versa), and the label of an unlabeled node is decided by looking at the sum of the probabilities of random walks from various labeled nodes to it (or vice versa), and choosing the labeled node with lowest hitting time (highest sum of transition probabilities) as the label. <br />
<br />
=== Proposed Approach ===<br />
Because the hitting time technically considers paths of length up to infinity, computing the hitting time can be prohibitively expensive for large graphs. The authors thus use a "truncated hitting time", where the limit the random walks on their graph to have at most <math>T</math> steps. <br />
<br />
The proposed algorithm takes as input a query phrase as well as a phrase table between various languages. It also has a number of parameters that will be explained. <br />
<br />
First, a graph is created between aligned phrases. Nodes are phrases, and edge weights are computed as the probability of a foreign phrase given an English phrase, i.e. the count of foreign and english phrases co-occurring divided by the total count of English phrases. Unlike prior approaches, that model this graph as a simple bipartite graph between foreign and english words of one particular language pair and therefore only look at random walks of length two (from english to foreign and then from foreign phrase to another english phrase), the authors present a general graph approach here, where nodes can be from different languages and the graph is not necessarily bipartite. While one can create a full graph for small phrase tables, for large phrase tables the authors use a graph approximation similar to k nearest neighbors. They approximate the full graph <math>H</math> with a graph <math>H'</math>, specifically by performing breadth-first search starting at the query node until a depth d. They approximate edge connections at the periphery by collapsing all edge connections between nodes at the periphery and nodes outside the periphery (but still in the full graph <math>H'</math>) <br />
into one edge. <br />
<br />
Once the graph is created, the authors sample <math>M</math> independent length <math>T</math> walks starting from the query node. Using these samples, they estimate the truncated hitting time between the query node and all other nodes in the approximate graph <math>H'</math>. Because a node may have large out-degree, and many of these edges may correspond to spurious word alignments, the authors limit the out-degree of each node to <math>l</math>. Using the hitting times, one can come up with a sorted (by hitting time) list of all the other nodes in the graph with respect to the query node. Nodes with hitting time <math>T</math> are additionally pruned.<br />
<br />
Lastly, the authors can add domain knowledge in their representation. They add three types of features: n-gram features, syntactic features, and substring/superstring features. For each feature, they add a node, for example for the n-gram "where is" they add a node representing this n-gram, and edges are generated between all phrase nodes and nodes that encode this domain knowledge if the phrase contains the feature represented by the feature node. This brings nodes (phrases) that share features closer together by reducing the hitting time between these nodes. <br />
<br />
=== Baseline and Results ===<br />
The authors used the [[EUROPARL | Europarl]] dataset. Phrases wer aligned using Giza++, and then further refinements to alignment were made using higher order aligners and additional techniques. <br />
<br />
The primary method that the authors compare their paraphrasing approach to is the Bannard and Callison Burch (BCB) approach from 2005 and 2008 (they actually compared to an extension to the basic BCB approach called syntactic bilingual phrases, SBP). The authors evaluated the quality of paraphrases on a 0-2 scale and used human evaluators (with relatively high Cohen's Kappa agreement). The comparison with the BCB approach was complicated by the fact that for a given query sentence, different numbers of paraphrases are returned by the two systems, and so to make comparison fair comparisons were done on both the minimum and the maximum of the number of phrases returned by the two systems (in terms of precision and recall numbers). Their system always generated more paraphrases than BCB/SBP. <br />
<br />
The authors tried two variants - they experimented with including and excluding the domain knowledge feature nodes, and they also enforced their graph to be bipartite. In general, by eliminating this bipartite constraint (thereby allowing random walks of more than two steps to generate more paraphrases) they found improved performance, and by incorporating domain knowledge they also found improved performance. <br />
<br />
[[File:vssbp.png]]<br />
<br />
[[File:bipartite.png]]<br />
<br />
=== Relevant Work ===<br />
<br />
The authors talk about two sets of work, one that uses the concept of hitting times in interesting areas (for example computer vision, collaborative filtering, etc.), and the other is the corpus of related work in paraphrasing. They note that earlier work in paraphrasing was primarily based on the availability of parallel monolingual phrase corpora, which isn't widespread.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hitting_the_Right_Paraphrases_in_Good_Time&diff=10286Hitting the Right Paraphrases in Good Time2011-11-24T23:56:58Z<p>Asaluja: /* Baseline and Results */</p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of NAACL/HLT 2010 | coauthors = C.Brockett | date = 2010| first = S.| last = Kok| title = Hitting the Right Paraphrases in Good Time | url =http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf].<br />
=== Summary ===<br />
<br />
The authors present an approach to paraphrasing based on [[RandomWalk | random walks]] and bilingual parallel phrase corpora. They first construct a graph, where the nodes are phrases in multiple languages, and the edges are between phrases that have been aligned to each other using some sort of phrase alignment model or system. Next, for a particular sentence they break it down into phrases, and then compute the hitting time for each phrase and all of the other phrases in the graph. The hitting time is the expected time it would take for a random walk going from node i to reach node j, based on the transition probabilities of the graph (more on how this is computed in their particular case later). They sort the hitting times based on lowest first to come up with the paraphrases that are most probable as defined by random walks on the graph. The authors discuss a couple of methods to prune the graph and also to minimize spurious word alignments, as well as show how they can incorporate domain knowledge into the graph-based paraphrase model. <br />
<br />
[[LabelPropagation| Label Propagation]] and [[GraphPropagation | Graph Propagation]] are strongly related to random walks - namely, labels in a graph propagate as if undergoing a random walk from a labeled node to an unlabeled node (or vice versa), and the label of an unlabeled node is simply the sum of the probabilities of random walks from various labeled nodes to it (or vice versa). <br />
<br />
=== Proposed Approach ===<br />
Because the hitting time technically considers paths of length up to infinity, computing the hitting time can be prohibitively expensive for large graphs. The authors thus use a "truncated hitting time", where the limit the random walks on their graph to have at most <math>T</math> steps. <br />
<br />
The proposed algorithm takes as input a query phrase as well as a phrase table between various languages. It also has a number of parameters that will be explained. <br />
<br />
First, a graph is created between aligned phrases. Nodes are phrases, and edge weights are computed as the probability of a foreign phrase given an English phrase, i.e. the count of foreign and english phrases co-occurring divided by the total count of English phrases. Unlike prior approaches, that model this graph as a simple bipartite graph between foreign and english words of one particular language pair and therefore only look at random walks of length two (from english to foreign and then from foreign phrase to another english phrase), the authors present a general graph approach here, where nodes can be from different languages and the graph is not necessarily bipartite. While one can create a full graph for small phrase tables, for large phrase tables the authors use a graph approximation similar to k nearest neighbors. They approximate the full graph <math>H</math> with a graph <math>H'</math>, specifically by performing breadth-first search starting at the query node until a depth d. They approximate edge connections at the periphery by collapsing all edge connections between nodes at the periphery and nodes outside the periphery (but still in the full graph <math>H'</math>) <br />
into one edge. <br />
<br />
Once the graph is created, the authors sample <math>M</math> independent length <math>T</math> walks starting from the query node. Using these samples, they estimate the truncated hitting time between the query node and all other nodes in the approximate graph <math>H'</math>. Because a node may have large out-degree, and many of these edges may correspond to spurious word alignments, the authors limit the out-degree of each node to <math>l</math>. Using the hitting times, one can come up with a sorted (by hitting time) list of all the other nodes in the graph with respect to the query node. Nodes with hitting time <math>T</math> are additionally pruned.<br />
<br />
Lastly, the authors can add domain knowledge in their representation. They add three types of features: n-gram features, syntactic features, and substring/superstring features. For each feature, they add a node, for example for the n-gram "where is" they add a node representing this n-gram, and edges are generated between all phrase nodes and nodes that encode this domain knowledge if the phrase contains the feature represented by the feature node. This brings nodes (phrases) that share features closer together by reducing the hitting time between these nodes. <br />
<br />
=== Baseline and Results ===<br />
The authors used the [[EUROPARL | Europarl]] dataset. Phrases wer aligned using Giza++, and then further refinements to alignment were made using higher order aligners and additional techniques. <br />
<br />
The primary method that the authors compare their paraphrasing approach to is the Bannard and Callison Burch (BCB) approach from 2005 and 2008 (they actually compared to an extension to the basic BCB approach called syntactic bilingual phrases, SBP). The authors evaluated the quality of paraphrases on a 0-2 scale and used human evaluators (with relatively high Cohen's Kappa agreement). The comparison with the BCB approach was complicated by the fact that for a given query sentence, different numbers of paraphrases are returned by the two systems, and so to make comparison fair comparisons were done on both the minimum and the maximum of the number of phrases returned by the two systems (in terms of precision and recall numbers). Their system always generated more paraphrases than BCB/SBP. <br />
<br />
The authors tried two variants - they experimented with including and excluding the domain knowledge feature nodes, and they also enforced their graph to be bipartite. In general, by eliminating this bipartite constraint (thereby allowing random walks of more than two steps to generate more paraphrases) they found improved performance, and by incorporating domain knowledge they also found improved performance. <br />
<br />
[[File:vssbp.png]]<br />
<br />
[[File:bipartite.png]]<br />
<br />
=== Relevant Work ===<br />
<br />
The authors talk about two sets of work, one that uses the concept of hitting times in interesting areas (for example computer vision, collaborative filtering, etc.), and the other is the corpus of related work in paraphrasing. They note that earlier work in paraphrasing was primarily based on the availability of parallel monolingual phrase corpora, which isn't widespread.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hitting_the_Right_Paraphrases_in_Good_Time&diff=10285Hitting the Right Paraphrases in Good Time2011-11-24T23:54:50Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of NAACL/HLT 2010 | coauthors = C.Brockett | date = 2010| first = S.| last = Kok| title = Hitting the Right Paraphrases in Good Time | url =http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf].<br />
=== Summary ===<br />
<br />
The authors present an approach to paraphrasing based on [[RandomWalk | random walks]] and bilingual parallel phrase corpora. They first construct a graph, where the nodes are phrases in multiple languages, and the edges are between phrases that have been aligned to each other using some sort of phrase alignment model or system. Next, for a particular sentence they break it down into phrases, and then compute the hitting time for each phrase and all of the other phrases in the graph. The hitting time is the expected time it would take for a random walk going from node i to reach node j, based on the transition probabilities of the graph (more on how this is computed in their particular case later). They sort the hitting times based on lowest first to come up with the paraphrases that are most probable as defined by random walks on the graph. The authors discuss a couple of methods to prune the graph and also to minimize spurious word alignments, as well as show how they can incorporate domain knowledge into the graph-based paraphrase model. <br />
<br />
[[LabelPropagation| Label Propagation]] and [[GraphPropagation | Graph Propagation]] are strongly related to random walks - namely, labels in a graph propagate as if undergoing a random walk from a labeled node to an unlabeled node (or vice versa), and the label of an unlabeled node is simply the sum of the probabilities of random walks from various labeled nodes to it (or vice versa). <br />
<br />
=== Proposed Approach ===<br />
Because the hitting time technically considers paths of length up to infinity, computing the hitting time can be prohibitively expensive for large graphs. The authors thus use a "truncated hitting time", where the limit the random walks on their graph to have at most <math>T</math> steps. <br />
<br />
The proposed algorithm takes as input a query phrase as well as a phrase table between various languages. It also has a number of parameters that will be explained. <br />
<br />
First, a graph is created between aligned phrases. Nodes are phrases, and edge weights are computed as the probability of a foreign phrase given an English phrase, i.e. the count of foreign and english phrases co-occurring divided by the total count of English phrases. Unlike prior approaches, that model this graph as a simple bipartite graph between foreign and english words of one particular language pair and therefore only look at random walks of length two (from english to foreign and then from foreign phrase to another english phrase), the authors present a general graph approach here, where nodes can be from different languages and the graph is not necessarily bipartite. While one can create a full graph for small phrase tables, for large phrase tables the authors use a graph approximation similar to k nearest neighbors. They approximate the full graph <math>H</math> with a graph <math>H'</math>, specifically by performing breadth-first search starting at the query node until a depth d. They approximate edge connections at the periphery by collapsing all edge connections between nodes at the periphery and nodes outside the periphery (but still in the full graph <math>H'</math>) <br />
into one edge. <br />
<br />
Once the graph is created, the authors sample <math>M</math> independent length <math>T</math> walks starting from the query node. Using these samples, they estimate the truncated hitting time between the query node and all other nodes in the approximate graph <math>H'</math>. Because a node may have large out-degree, and many of these edges may correspond to spurious word alignments, the authors limit the out-degree of each node to <math>l</math>. Using the hitting times, one can come up with a sorted (by hitting time) list of all the other nodes in the graph with respect to the query node. Nodes with hitting time <math>T</math> are additionally pruned.<br />
<br />
Lastly, the authors can add domain knowledge in their representation. They add three types of features: n-gram features, syntactic features, and substring/superstring features. For each feature, they add a node, for example for the n-gram "where is" they add a node representing this n-gram, and edges are generated between all phrase nodes and nodes that encode this domain knowledge if the phrase contains the feature represented by the feature node. This brings nodes (phrases) that share features closer together by reducing the hitting time between these nodes. <br />
<br />
=== Baseline and Results ===<br />
The authors used the Europarl (link) dataset. Phrases wer aligned using Giza++, and then further refinements to alignment were made using higher order aligners and additional techniques. <br />
<br />
The primary method that the authors compare their paraphrasing approach to is the Bannard and Callison Burch (BCB) approach from 2005 and 2008 (they actually compared to an extension to the basic BCB approach called syntactic bilingual phrases, SBP). The authors evaluated the quality of paraphrases on a 0-2 scale and used human evaluators (with relatively high Cohen's Kappa agreement). The comparison with the BCB approach was complicated by the fact that for a given query sentence, different numbers of paraphrases are returned by the two systems, and so to make comparison fair comparisons were done on both the minimum and the maximum of the number of phrases returned by the two systems (in terms of precision and recall numbers). Their system always generated more paraphrases than BCB/SBP. <br />
<br />
The authors tried two variants - they experimented with including and excluding the domain knowledge feature nodes, and they also enforced their graph to be bipartite. In general, by eliminating this bipartite constraint (thereby allowing random walks of more than two steps to generate more paraphrases) they found improved performance, and by incorporating domain knowledge they also found improved performance. <br />
<br />
[[File:vssbp.png]]<br />
<br />
[[File:bipartite.png]]<br />
<br />
=== Relevant Work ===<br />
<br />
The authors talk about two sets of work, one that uses the concept of hitting times in interesting areas (for example computer vision, collaborative filtering, etc.), and the other is the corpus of related work in paraphrasing. They note that earlier work in paraphrasing was primarily based on the availability of parallel monolingual phrase corpora, which isn't widespread.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Bipartite.png&diff=10284File:Bipartite.png2011-11-24T23:49:38Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Vssbp.png&diff=10283File:Vssbp.png2011-11-24T23:49:26Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hitting_the_Right_Paraphrases_in_Good_Time&diff=10282Hitting the Right Paraphrases in Good Time2011-11-24T23:49:04Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of NAACL/HLT 2010 | coauthors = C.Brockett | date = 2010| first = S.| last = Kok| title = Hitting the Right Paraphrases in Good Time | url =http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf].<br />
=== Summary ===<br />
<br />
The authors present an approach to paraphrasing based on random walks (link) and bilingual parallel phrase corpora. They first construct a graph, where the nodes are phrases in multiple languages, and the edges are between phrases that have been aligned to each other using some sort of phrase alignment model or system. Next, for a particular sentence they break it down into phrases, and then compute the hitting time for each phrase and all of the other phrases in the graph. The hitting time is the expected time it would take for a random walk going from node i to reach node j, based on the transition probabilities of the graph (more on how this is computed in their particular case later). They sort the hitting times based on lowest first to come up with the paraphrases that are most probable as defined by random walks on the graph. The authors discuss a couple of methods to prune the graph and also to minimize spurious word alignments, as well as show how they can incorporate domain knowledge into the graph-based paraphrase model. <br />
<br />
Talk about (link) with label propagation. <br />
<br />
=== Proposed Approach ===<br />
Because the hitting time technically considers paths of length up to infinity, computing the hitting time can be prohibitively expensive for large graphs. The authors thus use a "truncated hitting time", where the limit the random walks on their graph to have at most <math>T</math> steps. <br />
<br />
The proposed algorithm takes as input a query phrase as well as a phrase table between various languages. It also has a number of parameters that will be explained. <br />
<br />
First, a graph is created between aligned phrases. Nodes are phrases, and edge weights are computed as the probability of a foreign phrase given an English phrase, i.e. the count of foreign and english phrases co-occurring divided by the total count of English phrases. Unlike prior approaches, that model this graph as a simple bipartite graph between foreign and english words of one particular language pair and therefore only look at random walks of length two (from english to foreign and then from foreign phrase to another english phrase), the authors present a general graph approach here, where nodes can be from different languages and the graph is not necessarily bipartite. While one can create a full graph for small phrase tables, for large phrase tables the authors use a graph approximation similar to k nearest neighbors. They approximate the full graph <math>H</math> with a graph <math>H'</math>, specifically by performing breadth-first search starting at the query node until a depth d. They approximate edge connections at the periphery by collapsing all edge connections between nodes at the periphery and nodes outside the periphery (but still in the full graph <math>H'</math>) <br />
into one edge. <br />
<br />
Once the graph is created, the authors sample <math>M</math> independent length <math>T</math> walks starting from the query node. Using these samples, they estimate the truncated hitting time between the query node and all other nodes in the approximate graph <math>H'</math>. Because a node may have large out-degree, and many of these edges may correspond to spurious word alignments, the authors limit the out-degree of each node to <math>l</math>. Using the hitting times, one can come up with a sorted (by hitting time) list of all the other nodes in the graph with respect to the query node. Nodes with hitting time <math>T</math> are additionally pruned.<br />
<br />
Lastly, the authors can add domain knowledge in their representation. They add three types of features: n-gram features, syntactic features, and substring/superstring features. For each feature, they add a node, for example for the n-gram "where is" they add a node representing this n-gram, and edges are generated between all phrase nodes and nodes that encode this domain knowledge if the phrase contains the feature represented by the feature node. This brings nodes (phrases) that share features closer together by reducing the hitting time between these nodes. <br />
<br />
=== Baseline and Results ===<br />
The authors used the Europarl (link) dataset. Phrases wer aligned using Giza++, and then further refinements to alignment were made using higher order aligners and additional techniques. <br />
<br />
The primary method that the authors compare their paraphrasing approach to is the Bannard and Callison Burch (BCB) approach from 2005 and 2008 (they actually compared to an extension to the basic BCB approach called syntactic bilingual phrases, SBP). The authors evaluated the quality of paraphrases on a 0-2 scale and used human evaluators (with relatively high Cohen's Kappa agreement). The comparison with the BCB approach was complicated by the fact that for a given query sentence, different numbers of paraphrases are returned by the two systems, and so to make comparison fair comparisons were done on both the minimum and the maximum of the number of phrases returned by the two systems (in terms of precision and recall numbers). Their system always generated more paraphrases than BCB/SBP. <br />
<br />
The authors tried two variants - they experimented with including and excluding the domain knowledge feature nodes, and they also enforced their graph to be bipartite. In general, by eliminating this bipartite constraint (thereby allowing random walks of more than two steps to generate more paraphrases) they found improved performance, and by incorporating domain knowledge they also found improved performance. <br />
<br />
[[File:vssbp.png]]<br />
<br />
[[File:bipartite.png]]<br />
<br />
=== Relevant Work ===<br />
<br />
The authors talk about two sets of work, one that uses the concept of hitting times in interesting areas (for example computer vision, collaborative filtering, etc.), and the other is the corpus of related work in paraphrasing. They note that earlier work in paraphrasing was primarily based on the availability of parallel monolingual phrase corpora, which isn't widespread.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hitting_the_Right_Paraphrases_in_Good_Time&diff=10281Hitting the Right Paraphrases in Good Time2011-11-24T23:05:03Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of NAACL/HLT 2010 | coauthors = C.Brockett | date = 2010| first = S.| last = Kok| title = Hitting the Right Paraphrases in Good Time | url =http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://www.aclweb.org/anthology-new/N/N10/N10-1017.pdf].<br />
=== Summary ===<br />
<br />
The authors present an approach to paraphrasing based on random walks and bilingual parallel phrase corpora. They first construct a graph, where the nodes are phrases in multiple languages, and the edges are between phrases that have been aligned to each other using some sort of phrase alignment model or system. Next, for a particular sentence they break it down into phrases, and then compute the hitting time for each phrase and all of the other phrases in the graph. The hitting time is the expected time it would take for a random walk going from node i to reach node j, based on the transition probabilities of the graph (more on how this is computed in their particular case later). They sort the hitting times based on lowest first to come up with the paraphrases that are most probable as defined by random walks on the graph. The authors discuss a couple of methods to prune the graph and also to minimize spurious word alignments, as well as show how they can incorporate domain knowledge into the graph-based paraphrase model. <br />
<br />
=== Proposed Approach ===<br />
<br />
=== Results ===</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=Unsupervised_Word_Alignment_with_Arbitrary_Features&diff=10280Unsupervised Word Alignment with Arbitrary Features2011-11-24T22:17:25Z<p>Asaluja: Created page with 'This paper review will be done by Avneesh (shortly)'</p>
<hr />
<div>This paper review will be done by Avneesh (shortly)</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Asaluja&diff=10279User:Asaluja2011-11-24T22:17:06Z<p>Asaluja: /* Papers */</p>
<hr />
<div>I'm Avneesh, a second year PhD student in ECE & LTI. I'm taking this class because it's quite pertinent to my research interests. <br />
<br />
[http://www.cs.cmu.edu/~avneesh My External Page]<br />
<br />
You can find more info on my page, including a picture.<br />
<br />
<br />
== Class Project ==<br />
Proposed project: Training SMT Systems with Latent Structural SVM<br />
<br />
[[Training_SMT_Systems_with_the_Latent_Structured_SVM | Project Page]]<br />
<br />
== Wiki Write-Ups ==<br />
Please grade write-ups under Papers and Methods. <br />
<br />
=== Papers ===<br />
* [[SoftSupervisedTextClassification | A.Subramanya & J.Bilmes, EMNLP 2008: Soft-Supervised Learning for Text Classification]]<br />
* [[Structured Output Learning with Indirect Supervision | M.W. Chang, V. Srikumar, D. Goldwasser, D. Roth, ICML 2010: Structured Output Learning with Indirect Supervision]]<br />
* [[A Discriminative Latent Variable Model for SMT | P. Blunsom, T. Cohn, M. Osborne, ACL 2008: A Discriminative Latent Variable Model for Statistical Machine Translation]]<br />
* [[Hitting the Right Paraphrases in Good Time | S.Kok, C.Brockett, NAACL/HLT 2010: Hitting the Right Paraphrases in Good Time]]<br />
* [[Unsupervised Word Alignment with Arbitrary Features | C.Dyer, J.Clark, A.Lavie, N.Smith, ACL 2011: Unsupervised Word Alignment with Arbitrary Features]]<br />
<br />
=== Methods ===<br />
* [[Alternating Minimization | Alternating Minimization]]<br />
* [[Measure Propagation | Measure Propagation]]<br />
* [[L-BFGS | L-BFGS]]<br />
<br />
=== Datasets ===<br />
* [[Reuters_21578 | Reuters 21578 Dataset ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Asaluja&diff=9966User:Asaluja2011-11-02T22:17:11Z<p>Asaluja: </p>
<hr />
<div>I'm Avneesh, a second year PhD student in ECE & LTI. I'm taking this class because it's quite pertinent to my research interests. <br />
<br />
[http://www.cs.cmu.edu/~avneesh My External Page]<br />
<br />
You can find more info on my page, including a picture.<br />
<br />
<br />
== Class Project ==<br />
Proposed project: Training SMT Systems with Latent Structural SVM<br />
<br />
[[Training_SMT_Systems_with_the_Latent_Structured_SVM | Project Page]]<br />
<br />
== Wiki Write-Ups ==<br />
Please grade write-ups under Papers and Methods. <br />
<br />
=== Papers ===<br />
* [[SoftSupervisedTextClassification | A.Subramanya & J.Bilmes, EMNLP 2008: Soft-Supervised Learning for Text Classification]]<br />
* [[Structured Output Learning with Indirect Supervision | M.W. Chang, V. Srikumar, D. Goldwasser, D. Roth, ICML 2010: Structured Output Learning with Indirect Supervision]]<br />
* [[A Discriminative Latent Variable Model for SMT | P. Blunsom, T. Cohn, M. Osborne, ACL 2008: A Discriminative Latent Variable Model for Statistical Machine Translation]]<br />
<br />
=== Methods ===<br />
* [[Alternating Minimization | Alternating Minimization]]<br />
* [[Measure Propagation | Measure Propagation]]<br />
* [[L-BFGS | L-BFGS]]<br />
<br />
=== Datasets ===<br />
* [[Reuters_21578 | Reuters 21578 Dataset ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9613L-BFGS2011-10-31T21:55:48Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
{{MyCiteconference | booktitle = Journal of Optimization Theory and Applications | coauthors = | date = 1985| first = D.F.| last = Shanno | title = On the Broyden-Fletcher-Goldfarb-Shanno Method}}<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -\eta_t H_t^{-1} \nabla J(w_t)</math>, where <math>\eta_t</math> is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity). <br />
<br />
The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it <math>B</math>. Then, <br />
<math> B_{t+1} = \arg \min_B || B - B_t||_w \textrm{s.t.} s_t = By_t</math>, where <math>y_t = \nabla J(w_{t+1}) - \nabla J (w_t) </math> is the gradient difference between iterations, and <math>s_t = w_{t+1} - w_t</math> is the parameter difference between iterations. After a bit of math, this results in the following update (<math>I</math> is the identity matrix):<br />
<math>B_{t+1} = \left(I - \frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right)B_t \left(I - \frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right) + \frac{s_t s_t^T}{\langle s_t, y_t \rangle}</math><br />
<br />
The above explains the BFGS method. ''' L-BFGS is essentially the same, except we use a low-rank approximation to B. ''' <br />
<br />
For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as <math> -B_t \nabla J(w_t)</math> is also not well defined. We adjust our quadratic approximation as:<br />
<math> m_t(w) = \sup_{s \in \partial J (w_t)} \{ J(w_t) + \langle s, w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t)\}</math>. We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update: <br />
<br />
<math> w_{t+1} = \arg \min_w \frac{1}{2}(w-w_t)^T H_t (w-w_t) + \xi</math> such that <math>J(w_t) + \langle s_i, w-w_t \rangle \leq \xi \textrm{ for } s_1, \dots, s_k \in \partial J(w_t) </math>. Notice that we sample ''k'' subgradients from the set of subgradients (which is a convex set by the way). <br />
<br />
=== Related Work ===<br />
<br />
There are a lot of papers that use this method. Here are some examples:<br />
* [[A_Discriminative_Latent_Variable_Model_for_SMT | Blunsom et al, ACL 2008: A Discriminative Latent Variable Model for SMT ]]<br />
* [[Sha_2003_Shallow_Parsing_with_Conditional_Random_Fields | Sha & Perreira, NAACL 2003: Shallow Parsing with Conditional Random Fields ]]<br />
* [[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data | Smith & Eisner, ACL 2005: Contrastive Estimation: Training Log-Linear Models on Unlabeled Data ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9612L-BFGS2011-10-31T21:54:40Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
{{MyCiteconference | booktitle = Journal of Optimization Theory and Applications | coauthors = | date = 1985| first = D.F.| last = Shanno | title = On the Broyden-Fletcher-Goldfarb-Shanno Method}}<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -\eta_t H_t^{-1} \nabla J(w_t)</math>, where <math>\eta_t</math> is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity). <br />
<br />
The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it <math>B</math>. Then, <br />
<math> B_{t+1} = \arg \min_B || B - B_t||_w \textrm{s.t.} s_t = By_t</math>, where <math>y_t = \nabla J(w_{t+1}) - \nabla J (w_t) </math> is the gradient difference between iterations, and <math>s_t = w_{t+1} - w_t</math> is the parameter difference between iterations. After a bit of math, this results in the following update (<math>I</math> is the identity matrix):<br />
<math>B_{t+1} = \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right)B_t \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right) + \frac{s_t s_t^T}{\langle s_t, y_t \rangle}</math><br />
<br />
The above explains the BFGS method. L-BFGS is essentially the same, except we use a low-rank approximation to B. <br />
<br />
For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as <math> -B_t \nabla J(w_t)</math> is also not well defined. We adjust our quadratic approximation as:<br />
<math> m_t(w) = \sup_{s \in \partial J (w_t)} \{ J(w_t) + \langle s, w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t)\}</math>. We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update: <br />
<br />
<math> w_{t+1} = \arg \min_w \frac{1}{2}(w-w_t)^T H_t (w-w_t) + \xi</math> such that <math>J(w_t) + \langle s_i, w-w_t \rangle \leq \xi \textrm{ for } s_1, \dots, s_k \in \partial J(w_t) </math>. Notice that we sample ''k'' subgradients from the set of subgradients (which is a convex set by the way). <br />
<br />
=== Related Work ===<br />
<br />
There are a lot of papers that use this method. Here are some examples:<br />
* [[A_Discriminative_Latent_Variable_Model_for_SMT | Blunsom et al, ACL 2008: A Discriminative Latent Variable Model for SMT ]]<br />
* [[Sha_2003_Shallow_Parsing_with_Conditional_Random_Fields | Sha & Perreira, NAACL 2003: Shallow Parsing with Conditional Random Fields ]]<br />
* [[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data | Smith & Eisner, ACL 2005: Contrastive Estimation: Training Log-Linear Models on Unlabeled Data ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9611L-BFGS2011-10-31T21:53:55Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
{{MyCiteconference | booktitle = Journal of Optimization Theory and Applications | coauthors = | date = 1985| first = D.F.| last = Shanno | title = On the Broyden-Fletcher-Goldfarb-Shanno Method}}<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J(w_t) + \langle \nabla J(w_t), w-w_t /rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -\eta_t H_t^{-1} \nabla J(w_t)</math>, where <math>\eta_t</math> is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity). <br />
<br />
The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it <math>B</math>. Then, <br />
<math> B_{t+1} = \arg \min_B || B - B_t||_w \textrm{s.t.} s_t = By_t</math>, where <math>y_t = \nabla J(w_{t+1}) - \nabla J (w_t) </math> is the gradient difference between iterations, and <math>s_t = w_{t+1} - w_t</math> is the parameter difference between iterations. After a bit of math, this results in the following update (<math>I</math> is the identity matrix):<br />
<math>B_{t+1} = \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right)B_t \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right) + \frac{s_t s_t^T}{\langle s_t, y_t \rangle}</math><br />
<br />
The above explains the BFGS method. L-BFGS is essentially the same, except we use a low-rank approximation to B. <br />
<br />
For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as <math> -B_t \nabla J(w_t)</math> is also not well defined. We adjust our quadratic approximation as:<br />
<math> m_t(w) = \sup_{s \in \partial J (w_t)} \{ J(w_t) + \langle s, w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t)\}</math>. We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update: <br />
<br />
<math> w_{t+1} = \arg \min_w \frac{1}{2}(w-w_t)^T H_t (w-w_t) + \xi</math> such that <math>J(w_t) + \langle s_i, w-w_t \rangle \leq \xi \textrm{ for } s_1, \dots, s_k \in \partial J(w_t) </math>. Notice that we sample ''k'' subgradients from the set of subgradients (which is a convex set by the way). <br />
<br />
=== Related Work ===<br />
<br />
There are a lot of papers that use this method. Here are some examples:<br />
* [[A_Discriminative_Latent_Variable_Model_for_SMT | Blunsom et al, ACL 2008: A Discriminative Latent Variable Model for SMT ]]<br />
* [[Sha_2003_Shallow_Parsing_with_Conditional_Random_Fields | Sha & Perreira, NAACL 2003: Shallow Parsing with Conditional Random Fields ]]<br />
* [[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data | Smith & Eisner, ACL 2005: Contrastive Estimation: Training Log-Linear Models on Unlabeled Data ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9610L-BFGS2011-10-31T21:53:33Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
{{MyCiteconference | booktitle = Journal of Optimization Theory and Applications | coauthors = | date = 1986=5| first = D.F.| last = Shanno | title = On the Broyden-Fletcher-Goldfarb-Shanno Method}}<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J(w_t) + \langle \nabla J(w_t), w-w_t /rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -\eta_t H_t^{-1} \nabla J(w_t)</math>, where <math>\eta_t</math> is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity). <br />
<br />
The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it <math>B</math>. Then, <br />
<math> B_{t+1} = \arg \min_B || B - B_t||_w \textrm{s.t.} s_t = By_t</math>, where <math>y_t = \nabla J(w_{t+1}) - \nabla J (w_t) </math> is the gradient difference between iterations, and <math>s_t = w_{t+1} - w_t</math> is the parameter difference between iterations. After a bit of math, this results in the following update (<math>I</math> is the identity matrix):<br />
<math>B_{t+1} = \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right)B_t \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right) + \frac{s_t s_t^T}{\langle s_t, y_t \rangle}</math><br />
<br />
The above explains the BFGS method. L-BFGS is essentially the same, except we use a low-rank approximation to B. <br />
<br />
For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as <math> -B_t \nabla J(w_t)</math> is also not well defined. We adjust our quadratic approximation as:<br />
<math> m_t(w) = \sup_{s \in \partial J (w_t)} \{ J(w_t) + \langle s, w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t)\}</math>. We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update: <br />
<br />
<math> w_{t+1} = \arg \min_w \frac{1}{2}(w-w_t)^T H_t (w-w_t) + \xi</math> such that <math>J(w_t) + \langle s_i, w-w_t \rangle \leq \xi \textrm{ for } s_1, \dots, s_k \in \partial J(w_t) </math>. Notice that we sample ''k'' subgradients from the set of subgradients (which is a convex set by the way). <br />
<br />
=== Related Work ===<br />
<br />
There are a lot of papers that use this method. Here are some examples:<br />
* [[A_Discriminative_Latent_Variable_Model_for_SMT | Blunsom et al, ACL 2008: A Discriminative Latent Variable Model for SMT ]]<br />
* [[Sha_2003_Shallow_Parsing_with_Conditional_Random_Fields | Sha & Perreira, NAACL 2003: Shallow Parsing with Conditional Random Fields ]]<br />
* [[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data | Smith & Eisner, ACL 2005: Contrastive Estimation: Training Log-Linear Models on Unlabeled Data ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9609L-BFGS2011-10-31T21:53:07Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
{{MyCiteconference | booktitle = Journal of Optimization Theory and Applications | date = 1986=5| first = D.F.| last = Shanno | title = On the Broyden-Fletcher-Goldfarb-Shanno Method}}<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J(w_t) + \langle \nabla J(w_t), w-w_t /rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -\eta_t H_t^{-1} \nabla J(w_t)</math>, where <math>\eta_t</math> is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity). <br />
<br />
The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it <math>B</math>. Then, <br />
<math> B_{t+1} = \arg \min_B || B - B_t||_w \textrm{s.t.} s_t = By_t</math>, where <math>y_t = \nabla J(w_{t+1}) - \nabla J (w_t) </math> is the gradient difference between iterations, and <math>s_t = w_{t+1} - w_t</math> is the parameter difference between iterations. After a bit of math, this results in the following update (<math>I</math> is the identity matrix):<br />
<math>B_{t+1} = \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right)B_t \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right) + \frac{s_t s_t^T}{\langle s_t, y_t \rangle}</math><br />
<br />
The above explains the BFGS method. L-BFGS is essentially the same, except we use a low-rank approximation to B. <br />
<br />
For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as <math> -B_t \nabla J(w_t)</math> is also not well defined. We adjust our quadratic approximation as:<br />
<math> m_t(w) = \sup_{s \in \partial J (w_t)} \{ J(w_t) + \langle s, w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t)\}</math>. We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update: <br />
<br />
<math> w_{t+1} = \arg \min_w \frac{1}{2}(w-w_t)^T H_t (w-w_t) + \xi</math> such that <math>J(w_t) + \langle s_i, w-w_t \rangle \leq \xi \textrm{ for } s_1, \dots, s_k \in \partial J(w_t) </math>. Notice that we sample ''k'' subgradients from the set of subgradients (which is a convex set by the way). <br />
<br />
=== Related Work ===<br />
<br />
There are a lot of papers that use this method. Here are some examples:<br />
* [[A_Discriminative_Latent_Variable_Model_for_SMT | Blunsom et al, ACL 2008: A Discriminative Latent Variable Model for SMT ]]<br />
* [[Sha_2003_Shallow_Parsing_with_Conditional_Random_Fields | Sha & Perreira, NAACL 2003: Shallow Parsing with Conditional Random Fields ]]<br />
* [[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data | Smith & Eisner, ACL 2005: Contrastive Estimation: Training Log-Linear Models on Unlabeled Data ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9608L-BFGS2011-10-31T21:43:57Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J(w_t) + \langle \nabla J(w_t), w-w_t /rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -\eta_t H_t^{-1} \nabla J(w_t)</math>, where <math>\eta_t</math> is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity). <br />
<br />
The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it <math>B</math>. Then, <br />
<math> B_{t+1} = \arg \min_B || B - B_t||_w \textrm{s.t.} s_t = By_t</math>, where <math>y_t = \nabla J(w_{t+1}) - \nabla J (w_t) </math> is the gradient difference between iterations, and <math>s_t = w_{t+1} - w_t</math> is the parameter difference between iterations. After a bit of math, this results in the following update (<math>I</math> is the identity matrix):<br />
<math>B_{t+1} = \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right)B_t \left(I - frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right) + \frac{s_t s_t^T}{\langle s_t, y_t \rangle}</math><br />
<br />
The above explains the BFGS method. L-BFGS is essentially the same, except we use a low-rank approximation to B. <br />
<br />
For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as <math> -B_t \nabla J(w_t)</math> is also not well defined. We adjust our quadratic approximation as:<br />
<math> m_t(w) = \sup_{s \in \partial J (w_t)} \{ J(w_t) + \langle s, w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t)\}</math>. We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update: <br />
<br />
<math> w_{t+1} = \arg \min_w \frac{1}{2}(w-w_t)^T H_t (w-w_t) + \xi</math> such that <math>J(w_t) + \langle s_i, w-w_t \rangle \leq \xi \textrm{for} s_1, \dots, s_k \in \partial J(w_t) </math><br />
<br />
=== Related Work ===</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Bfgscharacters.png&diff=9607File:Bfgscharacters.png2011-10-31T21:13:10Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9606L-BFGS2011-10-31T21:12:49Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J (w_t) + \langle \nabla J(w_t), w_w_t /rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -H_t^{-1} \nabla J(w_t)</math><br />
<br />
=== Related Work ===</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9605L-BFGS2011-10-31T21:12:26Z<p>Asaluja: </p>
<hr />
<div>=== Citation ===<br />
<br />
=== Summary ===<br />
<br />
The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:<br />
<br />
[[File:bfgscharacters.png]]<br />
<br />
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. <br />
<br />
=== Details ===<br />
<br />
The standard BFGS approximates the objective function with a locally quadratic optimization:<br />
<math> m_t (w) J(w_t) + \leftangle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math><br />
where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. <br />
<br />
We update parameters with the weight that minimizes the quadratic approximation:<br />
<math>w_{t+1} = \arg \min_w J (w_t) + \langle \nabla J(w_t), w_w_t /rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math><br />
<br />
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update:<br />
<math>w_{t+1} = w_t -H_t^{-1} \nabla J(w_t)</math><br />
<br />
=== Related Work ===</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=L-BFGS&diff=9604L-BFGS2011-10-31T20:24:28Z<p>Asaluja: Created page with 'This method page will be completed shortly by Avneesh.'</p>
<hr />
<div>This [[Category::Method | method]] page will be completed shortly by Avneesh.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Asaluja&diff=9603User:Asaluja2011-10-31T20:24:08Z<p>Asaluja: </p>
<hr />
<div>I'm Avneesh, a second year PhD student in ECE & LTI. I'm taking this class because it's quite pertinent to my research interests. <br />
<br />
[http://www.cs.cmu.edu/~avneesh My External Page]<br />
<br />
You can find more info on my page, including a picture.<br />
<br />
<br />
== Class Project ==<br />
Proposed project: Training SMT Systems with Latent Structural SVM<br />
<br />
[[Training_SMT_Systems_with_the_Latent_Structured_SVM | Project Page]]<br />
<br />
== Wiki Write-Ups ==<br />
Please grade write-ups under Papers and Methods. <br />
<br />
=== Papers ===<br />
* [[SoftSupervisedTextClassification | A.Subramanya & J.Bilmes, EMNLP 2008: Soft-Supervised Learning for Text Classification]]<br />
* [[Structured Output Learning with Indirect Supervision | M.W. Chang, V. Srikumar, D. Goldwasser, D. Roth, ICML 2010: Structured Output Learning with Indirect Supervision]]<br />
* [[A Discriminative Latent Variable Model for SMT | P. Blunsom, T. Cohn, M. Osborne, ACL 2008: A Discriminative Latent Variable Model for Statistical Machine Translation]]<br />
* [[Hitting the Right Paraphrases in Good Time | SA. Kok, C. Brockett, NAACL/HLT 2010: Hitting the Right Paraphrases in Good Time]]<br />
<br />
=== Methods ===<br />
* [[Alternating Minimization | Alternating Minimization]]<br />
* [[Measure Propagation | Measure Propagation]]<br />
* [[L-BFGS | L-BFGS]]<br />
<br />
=== Datasets ===<br />
* [[Reuters_21578 | Reuters 21578 Dataset ]]</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9602A Discriminative Latent Variable Model for SMT2011-10-31T20:21:11Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses decoder]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative manner, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), a type of ynchronous [[PCFGs | PCFG]], something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r \in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. The full derivation chart is produced using [[CYK_Parsing | CYK Parsing]]. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
<br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
<br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9601A Discriminative Latent Variable Model for SMT2011-10-31T20:20:29Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses decoder]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative manner, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), a type of ynchronous [[PCFGs | PCFG]], something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. The full derivation chart is produced using [[CYK_Parsing | CYK Parsing]]. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
<br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
<br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9600A Discriminative Latent Variable Model for SMT2011-10-31T20:19:57Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses decoder]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative manner, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG) a synchronous [[PCFGs | PCFG]], something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. The full derivation chart is produced using [[CYK_Parsing | CYK Parsing]]. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
<br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
<br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9599A Discriminative Latent Variable Model for SMT2011-10-31T20:19:24Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses decoder]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative manner, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG) a synchronous [[PCFGs | PCFG]], something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. The full derivation chart is produced using [[CYK_Parsing | CYK Parsing]]. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
<br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
<br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9598A Discriminative Latent Variable Model for SMT2011-10-31T20:09:36Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses decoder]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG) a synchronous [[PCFGs | PCFG]], something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. The full derivation chart is produced using [[CYK_Parsing | CYK Parsing]]. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
<br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
<br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9597A Discriminative Latent Variable Model for SMT2011-10-31T20:06:04Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG) a synchronous [[PCFGs | PCFG]], something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. The full derivation chart is produced using [[CYK_Parsing | CYK Parsing]]. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
<br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
<br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9596A Discriminative Latent Variable Model for SMT2011-10-31T19:59:46Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
<br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
<br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Scalability.png&diff=9595File:Scalability.png2011-10-31T19:59:30Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Finalresults.png&diff=9594File:Finalresults.png2011-10-31T19:59:16Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Translationderivation.png&diff=9593File:Translationderivation.png2011-10-31T19:59:03Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Derivationscaling.png&diff=9592File:Derivationscaling.png2011-10-31T19:58:48Z<p>Asaluja: </p>
<hr />
<div></div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9591A Discriminative Latent Variable Model for SMT2011-10-31T19:58:30Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lambda)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
First off, we see that there is a huge amount of derivational ambiguity in the data - the number of derivations is exponential in the source sentence length (y-axis is a log-scale):<br />
[[File:derivationscaling.png]]<br />
<br />
Next, the following table shows that training with all derivations over choosing the 1-best derivation ("All Derivations" vs. "Single Derivation"), and decoding to optimize translations over derivations ("translation" vs. "derivation") gives best results. The authors also give results on how beam width affects translation quality, and show that even with a relatively tight beam width we get decent results. The table below also shows the effects of regularization (or rather lack thereof, which the last line indicates). Unregularized model performance lags well behind regularized model performance. <br />
[[File:translationderivation.png]]<br />
<br />
The authors also test on their held-out test set. In the table below, the first and the third approaches are what the authors proposed in this work. The second is a full Hiero system but without reverse translation probabilities and reverse lexical probabilities. This is a more fair comparison to the method they have proposed in this work, as these two systems have the same parameter space, differing only in the matter of estimation. The additional Hiero scores are achieved with MERT training on the full set of Hiero features, (the last line is with a language model, the second last is without). The authors point out that one cannot really compare their work with these methods, since they would need to incorporate the reverse features and a language model in their approach. <br />
[[File:finalresults.png]]<br />
<br />
Lastly, the authors show scalability of their model, in terms of accuracy (i.e. the learning curve):<br />
[[File:scalability.png]]<br />
<br />
=== Related Work ===<br />
* Percy Liang's work [http://www.seas.upenn.edu/~taskar/pubs/acl06.pdf end-to-end approach to discriminative MT] was one of the first works that used a lot of overlapping features in MT. They used a phrase-based system (Pharaoh, the predecessor to Moses). <br />
* The most popular way to tune feature weights (non-overlapping) is Minimum Error Rate Training [http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf MERT paper]<br />
* The Hierarchical MT model was first proposed by David Chiang [http://www.isi.edu/~chiang/papers/chiang-acl05.pdf Hierarchical MT]. This paper won the ACL best paper award in 2005.</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9590A Discriminative Latent Variable Model for SMT2011-10-31T19:32:58Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math>. Also, <math>Z_{\Lambda} (\textbf{f})</math> is the partition function that globally normalizes the conditional probabilities. Then, the conditional probability of a target sentence given a source is when we marginalize over derivations:<br />
<br />
<math> p_{\Lambda} (\textbf{e} | \textbf{f}) = \sum_{\textbf{d} \in \Delta(\textbf{e, f})} p_{\Lambda} (\textbf{d, e} | \textbf{f})</math><br />
where <math>\Delta(\textbf{e, f})</math> is the set of all derivations from source to target. <br />
<br />
To train, the authors rely on MAP estimation. MAP uses a prior, which can be thought of as a regularization term. In this paper, they use a zero-mean Gaussian prior: <math> p_0 (\lambda_k) \propto \arg \max_{\Lambda} p_{\Lambda} (\mathcal{D})p(\Lamba)</math>, where <math>\mathcal{D}</math> is the training data (parallel sentence corpus). We maximize the log likelihood of the posterior using L-BFGS. L-BFGS requires efficient computation of the objective value and the gradient, which is achieved by inside-outside inference over the SCFG parse chart of the input sentence '''f'''. Their formulation is similar to previous work on parsing with a log-linear model and latent variables. Note that in training, if the reference translation for a training instance is not contained in the model's hypothesis space, we discard the unreachable portion of that training sample. <br />
<br />
For decoding, the authors use beam search to approximate the sum over all derivations, in order to handle the exponential number of derivations for a given source-target pair. This is similar to previous work on decoding with an SCFG intersected with an n-gram language model. <br />
<br />
=== Baseline & Results ===<br />
<br />
The authors evaluated their model on 4 fronts: 1) maximizing translations (marginalizing derivations) vs. maximizing derivations in training and decoding 2) regularization vs. maximum likelihood unregularized model 3) comparison with frequency count-based systems and 4) performance of translation as we scale up the number of training examples. All experiments were done on Europarl V2 (French-English), and the training corpus consisted of170k sentence pairs. Tuning set had 315 sentence pairs, and the test set 1164. <br />
<br />
=== Related Work ===</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9587A Discriminative Latent Variable Model for SMT2011-10-31T19:09:13Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | \textbf {f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k (\textbf {d, e, f}) = \sum_{r in \textbf {d}} h_k (\textbf {f}, r)</math><br />
<br />
=== Baseline & Results ===<br />
<br />
=== Related Work ===</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9586A Discriminative Latent Variable Model for SMT2011-10-31T19:08:32Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e} | {\bf f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k ({\bf d, e, f}) = \sum_{r in {\bf d}} h_k ({\bf f}, r)</math><br />
<br />
=== Baseline & Results ===<br />
<br />
=== Related Work ===</div>Asalujahttp://curtis.ml.cmu.edu/w/courses/index.php?title=A_Discriminative_Latent_Variable_Model_for_SMT&diff=9585A Discriminative Latent Variable Model for SMT2011-10-31T19:08:10Z<p>Asaluja: </p>
<hr />
<div>== Citation ==<br />
{{MyCiteconference | booktitle = Proceedings of ACL-08:HLT | coauthors = T.Cohn, M.Osborne | date = 2008| first = P.| last = Blunsom| title = A Discriminative Latent Variable Model for Statistical Machine Translation | url =http://aclweb.org/anthology-new/P/P08/P08-1024.pdf}}<br />
<br />
This [[Category::Paper]] is available online [http://aclweb.org/anthology-new/P/P08/P08-1024.pdf].<br />
<br />
=== Background ===<br />
<br />
Current state-of-the-art approaches in statistical machine translation are often phrase-based (.e.g. the [http://statmt.org/moses/ moses]) and/or syntactically motivated (e.g., hierarchical MT with the [http://sourceforge.net/projects/joshua/ joshua decoder]). While these models have achieved a lot, they are still limited in a number of ways. In particular, the notion of "features" is very limited in existing MT work. Och, 2003 presented MERT, which is a discriminative technique to tune the weights of the language, phrase, reordering and word penalty models with respect to each other, but this technique only works well on a limited number (10 or less) of non-overlapping features. In addition, the features in this model are still built in a generative model, using frequency counts and maximizing likelihood. Approaches with an arbitrary number of overlapping features have been limited in MT (Liang et al, ACL 2006 is one prior work that does do this). <br />
<br />
Discriminative models in machine translation are especially difficult because of the "multiple derivation" aspect to the problem, namely there are many ways to go from a source sentence to a target sentence (the terminology of "derivation" is because in an SCFG, to produce an output sentence we go through a sequence of SCFG rule applications). Since we do not have any "reference derivations" to update against, one would ideally like to marginalize out all derivations when coming up with the best translation, but doing so exactly is NP-complete. Previous approaches side-step this problem by choosing a simple model with simple features, or just treat the best derivation as the best translation (i.e. do not marginalize over all possible derivations). <br />
<br />
=== Summary ===<br />
In this work, the authors manage to incorporate a large number of non-overlapping features in a hierarchical machine translation system. What that means is they featurize each rule in a synchronous context free grammar (SCFG), something that the authors call "Discriminative Synchronous Transduction". The authors also model the derivation as a latent/hidden variable which they manage to marginalize out in training and decoding. <br />
<br />
=== Main Approach ===<br />
<br />
The authors focus on the translation model. They come up with a log-linear translation model, which defines the conditional probability distribution over target translations of a given source sentence. Derivations are modeled as a latent variable. In particular, we can express the conditional probability of a derivation as:<br />
<br />
<math>p_{\Lambda} (\textbf{d, e | f}) = \frac{\exp \sum_k \lambda_k H_k (\textbf {d, e, f})}{Z_{\Lambda}(\textbf {f})}</math><br />
where '''d''' is the derivation, '''e''' is the target sentence, and '''f''' is the source sentence. <math>k</math> is indexed over the model's features, and <math>H_k</math> are the feature functions defined over rules <math>r: H_k ({\bf d, e, f}) = \sum_{r in {\bf d}} h_k ({\bf f}, r)</math><br />
<br />
=== Baseline & Results ===<br />
<br />
=== Related Work ===</div>Asaluja