Midterm Report Nitin Yandong Ming Yanbo

Team members

Nitin Agarwal

Yandong Liu

Yanbo Xu

Ming Sun

LDA results

Used ACL 2008 corpus for experimentation
For exploratory analysis of corpus we ran the LDA model
Parameters of the LDA model
- Number of topics : 100
- Gibbs iteration : 2000
- Beta prior : 0.5
- Alpha prior : 1.0

Some of the topics obtained post training

Error Detection (Topic 6)
- errors, error, correct, rate, correction, spelling, detection, based, detect, types, detecting
Evaluation (Topic 10)
- evaluation, human, performance, automatic, quality, evaluate, study, results, task, metrics
Entity Coreference (Topic 13)
- names, entity, named, entities, person, coreference, task, ne, recognition, proper, location
Parsing (Topic 18)
- parsing, parser, parse, grammar, parsers, parses, input, chart, partial, syntactic, parsed, algorithm

ATM results

Gibbs Sampling for Collaboration Influence Model

We want $P(Z,X,R|W)$ , the posterior distribution of topic Z, (author, collaborator) pair X and which favor of collaboration over influence R given the words W in the corpus:

$P(Z,X,R|W)={\frac {P(Z,X,R,W)}{\sum _{Z,X,R}P(Z,X,R,W)}}$

We begin by calculating $P(W|Z,X,R)$ and $P(Z,X,R)$ :

$P(W|Z,X,R)=P(W|Z)=\prod _{z=1}^{T}({\frac {\Gamma (\sum _{v=1}^{V}\beta _{v})}{\prod _{v=1}^{V}\Gamma (\beta _{v})}}({\frac {\prod _{v=1}^{V}\Gamma (n_{z}^{w_{v}}+\beta _{v})}{\Gamma (\sum _{v=1}^{V}\beta _{v}+\sum _{v=1}^{V}n_{z}^{w_{v}})}}))$

$P(Z,X,R)=(\prod _{i_{w}=1}^{W}{\frac {1}{n_{r_{i_{w}}}(a_{i_{w}})+\eta _{r_{i_{w}}}}})\prod _{p=1}^{P}({\frac {\Gamma (\sum _{z}\alpha _{z})}{\prod _{z=1}^{T}\Gamma (\alpha _{z})}}{\frac {\prod _{z}\Gamma (n_{p}^{z}+\alpha _{z})}{\Gamma (\sum _{z}\alpha _{z}+\sum _{z}n_{p}^{z})}})$ ,

where P is the number of all the different author-collaborator-favor of collaboration combination (a,a',r).

So the Gibbs sampling of $P(z_{i},x_{i},r_{i},w_{i}|Z_{-i},X_{-i},R_{-i},W_{-i})$ :

$P(z_{i},x_{i},r_{i},w_{i}|Z_{-i},X_{-i},R_{-i},W_{-i})$

$={\frac {P(Z,X,R,W)}{P(Z_{-i},X_{-i},R_{-i},W_{-i})}}$

$={\frac {1}{n_{r_{i}}+\eta _{r_{i}}}}{\frac {n_{p,-i}^{t}+\alpha _{t}}{\sum _{z}n_{p,-i}^{z}+\sum _{z}\alpha _{z}}}{\frac {n_{t,-i}^{w_{v}}+\beta _{v}}{\sum _{v}n_{t,-i}+\sum _{v}\beta _{v}}}$

Further manipulation can turn the above equation into update equations for the topic and author-collaboration of each corpus token:

$P(z_{i}|Z_{-i},X,W,R)\propto {\frac {n_{z_{i}}^{w_{v}}+\beta _{v}}{\sum _{v}n_{z_{i}}^{w_{v}}+\beta _{v}}}{\frac {n_{x_{i}}^{z_{i}}+\alpha _{z_{i}}}{\sum _{z'}n_{x_{i}}^{z'}+\alpha _{z'}}}{\frac {n_{r_{i}}+\eta _{r_{i}}}{\sum _{r_{i}}(n_{r_{i}}+\eta _{r_{i}})}}$

$P(x_{i},r_{i}|Z,X_{-i},W,R_{-i})\propto {\frac {n_{x_{i},r_{i}}^{z_{i}}+\alpha _{z_{i}}}{\sum _{z'}n_{x_{i},r_{i}}^{z'}+\alpha _{z'}}}{\frac {n_{r_{i}}+\eta _{r_{i}}}{\sum _{r_{i}}(n_{r_{i}}+\eta _{r_{i}})}}$