Florencia Reali and Thomas L. Griffiths, Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift, Proceedings of The Royal Society of London. Series B, Biological Sciences 2010

From Cohen Courses
Jump to navigationJump to search

Citation

Florencia Reali and Thomas L. Griffiths, Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift, Proceedings of The Royal Society of London. Series B, Biological Sciences 2010

Online version

Link to paper

Summary

This paper addresses the problem of language evolution by relating it to models of genetic drift in biological evolution. Although the mechanisms of biological and language evolution are very different: biological traits are transmitted via genes while language is transmitted via learning, the paper shows that these different mechanisms can arrive at the same results. Specifically, the paper demonstrates that transmission of frequency distribution over linguistic variants by Bayesian learners arrives at the same results as the Wright-Fisher model of genetic drift.

Description of the method

The focus of this paper is to model how language changes as a consequence of being passed from one learner to another. The learning problem considered is the estimation of the frequencies of a set of linguistic variants. The learning is modeled as statistical inference, with learners using Bayes' Law to estimate probability distribution over the set of variants.

Assume that a learner is exposed to N occurrences of a linguistic token such as word, sound, or grammatical construction, partitioned over K different variants. Let the vectors Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x = [x_1, x_2, ..., x_K] and \theta = [\theta_1, \theta_2, ..., \theta_K]} denote the observed frequencies and the estimated probabilities of the K variants, respectively. The learner's expectations are expressed in a prior probability distribution . After seeing the data , the learner updates the posterior probability of , using Bayes' Law:

Although the model is neutral over the linguistic variants with no variant being favored a priori over the others, learners can differ in their expectations about the amount of probabilistic variation in a language. For example, a learner may expect the language to be more deterministic thus assigning a very small probability to any unobserved variant, while another learner may assign it a much higher probability, indicating the willingness to consider the unobserved variants as part of the language. This is related to the process of smoothing in language modeling.

A way to captures such expectations while still maintaining neutrality between variants is to use, just like in smoothing, the K-dimensional Dirichlet distribution as priors. Specifically, if the prior is a symmetric K-dimensional Dirichlet distribution with parameters , the probability that a learner will assign to the next observation being variant k after seeing instances of that variant from a total of N observation is equal to .

Thus, when , learner will assign small probability to unseen variant, reducing the probabilistic variation of the language. When on the other hand, learner will assign higher probability to unseen variant.

Using this framework, how a language evolves can thus be formulated as such: The learner estimates from a sample of N tokens produced by a speaker before generating any utterances x himself by sampling from the distribution associated with his estimate of . His generated utterances are then presented to the next learner and so on, thus forming a kind of iterated learning process:

IteratedLearning.png

Since the frequencies generated by a learner depend only on the frequencies generated by the previous learner, this iterated learning process of how language evolves is a Markov chain. It is possible to analyze the dynamics of the process (i.e. the dynamics of how a language changes) by computing a transition matrix indicating the probability of moving from one frequency value to another across generations. Its asymptotic consequences can be characterized by computing the stationary distribution to which the Markov chain converges.

Datasets used

The aspect HMM segmentation model is applied on two corpora:

  • A corpus of SpeechBot transcripts from All Things Considered (ATC), a daily news program on National Public Radio. This dataset consists of 4,917 segments with 35,777 word types and about 4 million word tokens. The word error rate in this corpus is estimated to be in the 20% to 50% range.
  • A corpus of 3,830 articles from the New York Times (NYT) consisting of about 4 million word tokens and 70,792 word types.

The aspect HMM is trained with 20 hidden topics in the experiments.

Experimental Results

The model developed in this paper is related to The Wright-Fisher model in biology. The Wright-Fisher model describes the behavior of alleles evolving in the absence of selection. Hence the Wright-Fisher model is 'neutral' over the alleles variants; just like the model in this paper is 'neutral' over the linguistic variants. In drawing the similarity between Wright-Fisher model and the model proposed, the paper can therefore use results from population genetics to characterize the dynamics and stationary distribution of the Markov chain defined by iterated learning, indicating the kind of languages that will emerge over time.

Three variants of the two corpora are used in the experiments:

  • a random sequences of segments from the ATC corpus
  • a random sequences of segments from the NYT corpus
  • actual aired sequences of ATC segments (in this audio transcripts, clear demarcations of segmentation breaks are not explicitly given; this is the primary problem that the paper is trying to tackle)

The paper uses co-occurrence agreement probability (CoAP) to quantitatively evaluate the segments produced by their model. In short the paper uses CoAP to measure how often a segmentation is correct with respect to two words that are k words apart in the document.

A useful interpretation of the CoAP is through its compliment:

where is the a priori probability of a segment, is the probability of missing a segment, and is the probability of hypothesizing a segment where there is no segment.

In the random sequences of segments, the model performs better (i.e. produces better segmentation) on NYT randomized segments than on ATC randomized segments; probably since NYT is a cleaner, more error-free corpus than ATC. The result on actual aired ATC sequence seems worse than either of the randomized test set.

ResultAHMM.png

When comparing the performance of aspect HMM (AHMM) to HMM model in segmenting NYT corpus, AHMM outperforms HMM segmentation for small window widths. As the window size increases HMM increasingly does well since all words are counted equally in increasingly larger windows, satisfying HMM's Naive Bayes assumption of mutual independence between words. However, as window size increases the precision of the segmenter also decreases due to coarser segmentation.

ResultAHMM-HMM.png

Discussion

The novelty of the paper lies on its addition of aspect model to HMM model for segmenting documents. It removes HMM naive assumption that words are generated independently given the hidden topic variable. Instead, words from the selected hidden variable are generated via the aspect model rather than independently generated.

However, one of the possible drawback of the paper is its very coarse approximation to the probability distribution over the observation windows o. The Viterbi algorithm requires the observation probability for each time step. While the HMM uses its Naive Bayes assumption to compute this distribution, the AHMM can only compute conditional probabilities about observation windows which it was exposed to in training. In testing, the observation window may not be something the model has seen before. The paper therefore uses an online approximation to EM to find that refines its probability approximation recursively as it sees more words in the observation window. Words in the beginning of the window are weighted more heavily than words towards the end of the window. Therefore, as window size increases, more words make less impact on the observation distribution and the segmenter does not perform as well.

Another possible drawback is that AHMM does not model topic breaks explicitly. Topic breaks are implicitly assumed when two adjacent windows have different topic variables. This lack of explicit modeling of topic breaks is possibly what causes the model's tendency to undersegment, as indicated by the high values in the experiment. In future, direct modeling of topic breaks may be explored. The idea of an overlapping window may also be good to improve the precision of the segmentation. The idea to automatically assign labels on each segment (i.e. topic labeling) is also an interesting future direction.

Related Papers

Unlike this paper that defines a 'neutral' model of how languages evolve in the absence of selection at the level of linguistic variants: i.e. it evolves only by the consequence of being transmitted from one learner to another, other recent computational work has focused on the role of selective forces or directed mutation at the level of linguistic variants: