Difference between revisions of "Florencia Reali and Thomas L. Griffiths, Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift, Proceedings of The Royal Society of London. Series B, Biological Sciences 2010"
(23 intermediate revisions by the same user not shown) | |||
Line 12: | Line 12: | ||
== Description of the method == | == Description of the method == | ||
− | The focus of this paper is to model how language changes as a consequence of being passed from one learner to another. The learning problem considered is the estimation of the frequencies of a set of linguistic variants. | + | The focus of this paper is to model how language changes as a consequence of being passed from one learner to another. The learning problem considered is the estimation of the frequencies of a set of linguistic variants. Learning is modeled with learners using [[UsesMethod::Bayes' Law]] to estimate probability distribution over the set of variants. |
Assume that a learner is exposed to ''N'' occurrences of a linguistic token such as word, sound, or grammatical construction, partitioned over ''K'' different variants. Let the vectors <math>x = [x_1, x_2, ..., x_K] and \theta = [\theta_1, \theta_2, ..., \theta_K]</math> denote the observed frequencies and the estimated probabilities of the ''K'' variants, respectively. The learner's expectations are expressed in a prior probability distribution <math>P(\theta)</math>. After seeing the data <math>x</math>, the learner updates the posterior probability of <math>\theta</math>, <math>P(\theta|x)</math> using [[UsesMethod::Bayes' Law]]: | Assume that a learner is exposed to ''N'' occurrences of a linguistic token such as word, sound, or grammatical construction, partitioned over ''K'' different variants. Let the vectors <math>x = [x_1, x_2, ..., x_K] and \theta = [\theta_1, \theta_2, ..., \theta_K]</math> denote the observed frequencies and the estimated probabilities of the ''K'' variants, respectively. The learner's expectations are expressed in a prior probability distribution <math>P(\theta)</math>. After seeing the data <math>x</math>, the learner updates the posterior probability of <math>\theta</math>, <math>P(\theta|x)</math> using [[UsesMethod::Bayes' Law]]: | ||
Line 18: | Line 18: | ||
<math>P(\theta|x) = \frac {P(x|\theta)P(\theta)} {\int {P(x|\theta)P(\theta) d\theta'}}</math> | <math>P(\theta|x) = \frac {P(x|\theta)P(\theta)} {\int {P(x|\theta)P(\theta) d\theta'}}</math> | ||
+ | Although the model is neutral over the linguistic variants with no variant being favored a priori over the others, learners can differ in their expectations about the '''amount of probabilistic variation in a language'''. For example, a learner may expect the language to be more deterministic thus assigning a very small probability to any unobserved variant, while another learner may assign it a much higher probability, indicating the willingness to consider the unobserved variants as part of the language. This is related to the process of [[UsesMethod::smoothing]] in language modeling. | ||
+ | |||
+ | A way to capture such expectations while still maintaining neutrality between variants is to use, just like in [[UsesMethod::smoothing]], the ''K''-dimensional [[UsesMethod::Dirichlet distribution]] as priors. Specifically, if the prior <math>P(\theta)</math> is a symmetric ''K''-dimensional Dirichlet distribution with parameters <math>\alpha/K</math>, the probability a learner will assign to the next observation being variant ''k'' after seeing <math>x_k</math> instances of that variant from a total of N observation is equal to <math>(x_k + \alpha/K)/(N + \alpha)</math>. | ||
+ | |||
+ | Thus, when <math>\alpha/K < 1</math>, learner will assign small probability to unseen variant, reducing the probabilistic variation of the language and favoring 'regularization' of languages towards deterministic rules. When <math>\alpha/K > 1</math> on the other hand, learner will assign higher probability to unseen variant. | ||
+ | |||
+ | Using this framework, how a language evolves can thus be formulated as such: | ||
The learner estimates <math>\theta</math> from a sample of ''N'' tokens produced by a speaker before generating any utterances ''x'' himself by sampling from the distribution <math>P(x|\theta)</math> associated with his estimate of <math>\theta</math>. His generated utterances are then presented to the next learner and so on, thus forming a kind of ''iterated learning'' process: | The learner estimates <math>\theta</math> from a sample of ''N'' tokens produced by a speaker before generating any utterances ''x'' himself by sampling from the distribution <math>P(x|\theta)</math> associated with his estimate of <math>\theta</math>. His generated utterances are then presented to the next learner and so on, thus forming a kind of ''iterated learning'' process: | ||
[[File:IteratedLearning.png]] | [[File:IteratedLearning.png]] | ||
− | + | Since the frequencies generated by a learner depend only on the frequencies generated by the previous learner, this ''iterated learning'' process of how language evolves is a [[UsesMethod::Markov chain]]. It is possible to analyze the dynamics of the process (i.e. the dynamics of how a language changes) by computing a transition matrix indicating the probability of moving from one frequency value to another across generations. Its asymptotic consequences can be characterized by computing the stationary distribution to which the Markov chain converges. | |
− | |||
− | |||
− | |||
− | |||
== Datasets used == | == Datasets used == | ||
− | + | Aside from using simulated data in its experiments, the paper also uses a corpus of [[UsesDataset::child-directed speech]]. The experiments involve simulating the process of language evolution via iterated learning. | |
− | |||
− | |||
− | |||
− | |||
== Experimental Results == | == Experimental Results == | ||
− | The model developed in this paper is related to The Wright-Fisher model in biology. The Wright-Fisher model describes the behavior of alleles evolving in the absence of selection. Hence the Wright-Fisher model is 'neutral' over the alleles variants; just like the model in this paper is 'neutral' over the linguistic variants. In drawing the similarity between Wright-Fisher model and the model proposed, the paper can therefore use results from population genetics to characterize the dynamics and stationary distribution of the Markov chain defined by iterated learning, indicating the kind of languages that will emerge over time. | + | The model developed in this paper is related to The Wright-Fisher model in biology. The Wright-Fisher model describes the behavior of alleles evolving in the absence of selection. Hence the Wright-Fisher model is 'neutral' over the alleles variants; just like the model in this paper is 'neutral' over the linguistic variants. In drawing the similarity between Wright-Fisher model and the model proposed, the paper can therefore use results from population genetics to characterize the dynamics and stationary distribution of the Markov chain defined by iterated learning, indicating the kind of languages that will emerge over time. In particular, the equivalence between these models can account for three basic regularities in the form and evolution of languages: |
− | + | * S-shaped curves in language change: When old linguistic variants are replaced by new ones, an s-shaped curve is typically observed in plots of frequency over time. Using the model proposed, the paper shows that such curve can emerge in the frequency plot over time provided that learners have priors favoring regularization, i.e. <math>\alpha/K < 1</math> | |
− | |||
− | |||
− | |||
− | + | [[File:s-shapedCurve.png]] | |
− | + | * Emergence of power-law distribution: One of the interesting properties of human languages is that word frequencies follow a power-law distribution. The paper shows that such phenomena can be produced via simulation with select parameters using the proposed 'neutral' model. Comparison of the distribution over frequencies produced by the simulation with that computed from the [[UsesDataset::child-directed speech]] corpus shows that the model produces a power law with exponent <math>\gamma = 1.74</math>, providing a close match with the exponent estimated from the child-directed speech corpus (<math>\gamma = 1.7</math>) | |
− | + | [[File:corpusData.png]] | |
− | + | [[File:simulatedData.png]] | |
− | In | + | * Frequency effects in lexical replacement rates: Another properties of human languages is that frequently used words are replaced much more slowly than less frequent ones. In other words, there is an inverse power-law relationship between frequency of use and replacement rate. Using simulation based on the proposed model, the paper is able to show that replacement rate follows an inverse power-law relationship with frequency. |
− | [[File: | + | [[File:replacementRate.png]] |
− | |||
− | |||
− | |||
− | |||
== Discussion == | == Discussion == | ||
− | The novelty of the paper lies | + | The novelty of the paper lies in its ability to draw similarity between simple iterated learning mechanism that uses Bayesian inference with model of genetic drift (Wright-Fisher model), hence providing justification of using the models of genetic drift to account for language changes over time in the absence of selection of its linguistic variants. By manipulating the values of the priors, the paper is able to model the different expectations of learners on the variability of the language while still maintaining neutrality between its variants. The similarity drawn between the model proposed and the Wright-Fisher model from biology is also interesting as it can possibly shed more lights into the nature of language evolution and how closely related it is to biological evolution. |
− | |||
− | |||
− | + | The drawback of this paper is that it is basically simulation of unigram language modeling with smoothing, conducted iteratively over several 'generations'. Unfortunately, these 'generations' of iterated learning are not anchored to real time sequence or dataset. In the future, a mapping between 'generations' to actual time period can be explored. | |
== Related Papers == | == Related Papers == | ||
− | Unlike this paper | + | Unlike this paper which defines a 'neutral' model of how languages evolve in the absence of selection at the level of linguistic variants: it models language evolution only as a result of being transmitted from one learner to another; other recent computational work has focused on the role of selective forces or directed mutation at the level of linguistic variants: |
* [[RelatedPaper::Komarova, N. L. & Nowak, M. A. 2001 Natural selection of the critical period for language acquisition. Proc. R. Soc. Lond. B. 268, 1189 – 1196]] | * [[RelatedPaper::Komarova, N. L. & Nowak, M. A. 2001 Natural selection of the critical period for language acquisition. Proc. R. Soc. Lond. B. 268, 1189 – 1196]] | ||
* [[RelatedPaper::Christiansen, M. H. & Chater, N. 2008 Language as shaped by the brain. Behav. Brain Sci. 31, 489–558]] | * [[RelatedPaper::Christiansen, M. H. & Chater, N. 2008 Language as shaped by the brain. Behav. Brain Sci. 31, 489–558]] |
Latest revision as of 12:01, 31 March 2011
Contents
Citation
Florencia Reali and Thomas L. Griffiths, Words as alleles: connecting language evolution with Bayesian learners to models of genetic drift, Proceedings of The Royal Society of London. Series B, Biological Sciences 2010
Online version
Summary
This paper addresses the problem of language evolution by relating it to models of genetic drift in biological evolution. Although the mechanisms of biological and language evolution are very different: biological traits are transmitted via genes while language is transmitted via learning, the paper shows that these different mechanisms can arrive at the same results. Specifically, the paper demonstrates that transmission of frequency distribution over linguistic variants by Bayesian learners arrives at the same results as the Wright-Fisher model of genetic drift.
Description of the method
The focus of this paper is to model how language changes as a consequence of being passed from one learner to another. The learning problem considered is the estimation of the frequencies of a set of linguistic variants. Learning is modeled with learners using Bayes' Law to estimate probability distribution over the set of variants.
Assume that a learner is exposed to N occurrences of a linguistic token such as word, sound, or grammatical construction, partitioned over K different variants. Let the vectors denote the observed frequencies and the estimated probabilities of the K variants, respectively. The learner's expectations are expressed in a prior probability distribution . After seeing the data , the learner updates the posterior probability of , using Bayes' Law:
Although the model is neutral over the linguistic variants with no variant being favored a priori over the others, learners can differ in their expectations about the amount of probabilistic variation in a language. For example, a learner may expect the language to be more deterministic thus assigning a very small probability to any unobserved variant, while another learner may assign it a much higher probability, indicating the willingness to consider the unobserved variants as part of the language. This is related to the process of smoothing in language modeling.
A way to capture such expectations while still maintaining neutrality between variants is to use, just like in smoothing, the K-dimensional Dirichlet distribution as priors. Specifically, if the prior is a symmetric K-dimensional Dirichlet distribution with parameters , the probability a learner will assign to the next observation being variant k after seeing instances of that variant from a total of N observation is equal to .
Thus, when , learner will assign small probability to unseen variant, reducing the probabilistic variation of the language and favoring 'regularization' of languages towards deterministic rules. When on the other hand, learner will assign higher probability to unseen variant.
Using this framework, how a language evolves can thus be formulated as such: The learner estimates from a sample of N tokens produced by a speaker before generating any utterances x himself by sampling from the distribution associated with his estimate of . His generated utterances are then presented to the next learner and so on, thus forming a kind of iterated learning process:
Since the frequencies generated by a learner depend only on the frequencies generated by the previous learner, this iterated learning process of how language evolves is a Markov chain. It is possible to analyze the dynamics of the process (i.e. the dynamics of how a language changes) by computing a transition matrix indicating the probability of moving from one frequency value to another across generations. Its asymptotic consequences can be characterized by computing the stationary distribution to which the Markov chain converges.
Datasets used
Aside from using simulated data in its experiments, the paper also uses a corpus of child-directed speech. The experiments involve simulating the process of language evolution via iterated learning.
Experimental Results
The model developed in this paper is related to The Wright-Fisher model in biology. The Wright-Fisher model describes the behavior of alleles evolving in the absence of selection. Hence the Wright-Fisher model is 'neutral' over the alleles variants; just like the model in this paper is 'neutral' over the linguistic variants. In drawing the similarity between Wright-Fisher model and the model proposed, the paper can therefore use results from population genetics to characterize the dynamics and stationary distribution of the Markov chain defined by iterated learning, indicating the kind of languages that will emerge over time. In particular, the equivalence between these models can account for three basic regularities in the form and evolution of languages:
- S-shaped curves in language change: When old linguistic variants are replaced by new ones, an s-shaped curve is typically observed in plots of frequency over time. Using the model proposed, the paper shows that such curve can emerge in the frequency plot over time provided that learners have priors favoring regularization, i.e.
- Emergence of power-law distribution: One of the interesting properties of human languages is that word frequencies follow a power-law distribution. The paper shows that such phenomena can be produced via simulation with select parameters using the proposed 'neutral' model. Comparison of the distribution over frequencies produced by the simulation with that computed from the child-directed speech corpus shows that the model produces a power law with exponent , providing a close match with the exponent estimated from the child-directed speech corpus ()
- Frequency effects in lexical replacement rates: Another properties of human languages is that frequently used words are replaced much more slowly than less frequent ones. In other words, there is an inverse power-law relationship between frequency of use and replacement rate. Using simulation based on the proposed model, the paper is able to show that replacement rate follows an inverse power-law relationship with frequency.
Discussion
The novelty of the paper lies in its ability to draw similarity between simple iterated learning mechanism that uses Bayesian inference with model of genetic drift (Wright-Fisher model), hence providing justification of using the models of genetic drift to account for language changes over time in the absence of selection of its linguistic variants. By manipulating the values of the priors, the paper is able to model the different expectations of learners on the variability of the language while still maintaining neutrality between its variants. The similarity drawn between the model proposed and the Wright-Fisher model from biology is also interesting as it can possibly shed more lights into the nature of language evolution and how closely related it is to biological evolution.
The drawback of this paper is that it is basically simulation of unigram language modeling with smoothing, conducted iteratively over several 'generations'. Unfortunately, these 'generations' of iterated learning are not anchored to real time sequence or dataset. In the future, a mapping between 'generations' to actual time period can be explored.
Related Papers
Unlike this paper which defines a 'neutral' model of how languages evolve in the absence of selection at the level of linguistic variants: it models language evolution only as a result of being transmitted from one learner to another; other recent computational work has focused on the role of selective forces or directed mutation at the level of linguistic variants: