Ristad and Yianilos 1997 Learning String Edit Distance

From Cohen Courses
Jump to navigationJump to search

Citation

Ristad, E.S. and Yianilos, P.N. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20: 522--532, 1998.

Online

Summary

In this paper the authors describe a stochastic HMM model for learning a String Edit Distance function, as well as an efficient EM variant for learning the edit costs.

The String Edit Distance model is then used to provide a stochastic solution to the problem of String Classification, in which the classification is based on the similarity of an observed string, (string of length ), to an underlying prototype, (string of length ), from a class, . The paper describes an efficient algorithm for inducing a joint probability model, where is a dictionary of prototype strings and is the stochastic model, from a corpus of labeled strings, and this model is used to classify new strings.

Finally, the authors address the problem of learning Word Pronunciation in conversational speech by reducing it to a classification problem and applying the String Edit Distance model. In this problem, the classes are the set of syntactic words in a language, and similarity is measured between the set of phonological segments of words (these are the prototype strings) and observed phonemes.

A Stochastic Model for String Edit Distance

The distance between the strings and is modeled as a memoryless stochastic transduction (using a Pair HMM) of edit operations, including deletions, insertions, and substitutions (modeling letter-specific transitions for every alphabet letter). Using this model, two distance functions are defined: (1) The Viterbi edit distance is defined by most likely transduction between two strings, and (2) the stochastic edit distance is defined by aggregating all transductions between the strings. The paper describes variants of the Forward and Backward algorithms as well as an EM variant for learning the parameters of the HMM which represent the costs of the edit operations.

Three types of HMMs are being considered:

  • Tied: The model is simplified to contain only four distinct edit costs for insertion, deletion, identity and substitution. Using these parameter equivalence classes is termed "parameter tying".
  • Mixed: A -component mixed transducer is a linear combination of HMMs, each using a different model (e.g., each model may use a different degree of parameter tying).
  • Memory: This is a model using a higher order Markov chain, that is, the probability of future events depends on past events for some .

Baseline

The model is compared with the Levenshtein distance which is the minimum number of edit operations required for the transduction between two strings. Effectively, this is a transformation process in which the cost of identity substitutions is zero and all other edit costs are one.

Results

The experimental evaluations test the use of this approach with all of the suggested models (using Tied, Untied, and Mixed models, and setting the distance to either the Viterbi or the stochastic distance for each of the models). The error rates of the stochastic models are from one half to one sixth the error of using the baseline Levenshtein distance. Over 87 percent of the unseen pronunciations are correctly recognized, which is within 4 percent of the maximum success rate by any classifier.