Bayesian grammar induction for language modeling, Chen 1995
Bayesian grammar induction for language modeling. By Stanley F. Chen, . In Proceedings of the 33rd Annual Meeting of the ACL, vol. (), 1995.
The Paper addresses the problem of Parsing using unsupervised learning of probabilistic context free grammar. It views the problem of grammar induction as a search problem, with the search space spanning over all the possible grammars. The objective function that is optimized is the a-posterior probability where represents the grammar and represents the training examples. The algorithm prefers smaller grammars by using a universal a-priori probability over the grammar , where is the length of the description of the grammar in bits. The paper uses heuristics to constrain the search space as well as calculate the objective function. The method is compared with smoothed n-grams and the inside outside algorithm where it outperforms inside outside algorithm.
The algorithm starts with a grammar powerful enough to generate any string to ensure that the generation spans the training data. It then takes a Hill Climbing approach by adding a rule that increases the value of the objective function. This is done until there doesn't exist any rule that increases the value of the objective function.
The search space is made tractable by restricting the rules that can be added to the forms and . is always created as a new symbol.
The evaluation of the objective function is not done directly. is calculated directly. However, is not calculated. Instead, the change in is approximated. The method uses training data incrementally. Starting from an initial hypothesis grammar, an optimal grammar is searched for using the first instance of the training data. The optimal grammar is used as a hypothesis grammar for second instance and the process continues incrementally until all the training data is exhausted. At the end of each iteration, the probability of a grammar rule is distributed uniformly over all the newly generated rules. To address the effect of the constraints, the author suggests the use of post-pass as described in the Lari and Young 1990.
Being computationally expensive, the data is constrained to three sets each of 4500 sentences (with 500 for smoothing and 500 for test). The data is constructed with a grammar having 10 non-terminal, 20 terminal and 30 rules. The data for the first two domains was created artifically and for the third, the non-terminals were extracted from the 20 most frequent occurring symbols from Penn Treebank. The size of the vocabulary was reduced by mapping the lexicons to the corresponding POS tag.
The method is compared against smoothed n-grams and the inside outside algorithm. The method outperforms moderately for the first two domains, while for the third domain, it outperforms significantly to inside outside algorithm, but is outperformed by n-gram model.