Difference between revisions of "Jurgens and Lu ICWSM 2012"

Latest revision as of 14:01, 4 October 2012

Citation

@inproceedings{DBLP:conf/icwsm/JurgensL12,

 author = {David Jurgens and Tsai-Ching Lu},
 title = {Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia},
 booktitle = {ICWSM},
 year = {2012}

Online version

Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia

Summary

The growth of Wikipedia relies in the cooperative, and sometimes combative, interactions among editors working on the same page. But most research on Wikipedia editor interactions focus on cooperative behaviors, which calls for a full analysis of editing behaviors in Wikipedia, including both cooperative and combative. To investigate editor interactions in Wikipedia in this context, this paper proposes to represent Wikipedia's revision history as a temporal, bipartite network with multiple node and edge types for users and revisions. From this representation, they identify author interactions as network motifs and show how the motif types capture editing behaviors. They demonstrate the usefulness of motifs by two tasks; (1) classification of pages as combative or cooperative page and (2) analysis of the dynamics of editor behavior to explain Wikipedia’s content growth.

Proposed analysis method

Network representation

They view editor interactions in Wikipedia as a bipartite graph from authors to the pages. They expand this representation to encode three additional features: (1) the type of author who made the change, (2) the time at which the change was made, and (3) the magnitude and effect of the change to the page. To do so, they define the bipartite graph of Wikipedia revisions as follows.

The figure below illustrates a subset of a page’s history as sequence of classified revisions.

Network derivation from Wikipedia dataset

Data:

Wikipedia revision dataset is derived from a complete revision history of Wikipedia, ending on April 05, 2011.
After extracting article pages that have at least 10 revisions, the resulting dataset contained 2,715,123 articles and 227,034,806 revisions.
Though the data is not same, other dataset of Wikipedia edit history is available from here: SNAP

Revision classes:

They selected four high-level categories for revisions: adding, deleting, editing, and reverting.
Using (1) the revising author’s comment and (2) MD5 hash for the articles, a revision can be identified as revert or not.
To classify a revision into one of the other three revision classes, they used two parameters: (1) the number of whitespace-delimited tokens added or removed from the page, $\delta$ , i.e., its change in size, and (2) the number of tokens whose content was changed, $\theta$ .
The classification rule is as follows.

To further distinguish edits based on the magnitude of their effect in addition to the type, they partition each class into major and minor subcategories, with the exception of Revert.
Based on the shape of the effect distributions, the difference between major and minor was selected using the Pareto principle, or “80/20 rule” (Newman, M. 2005. Power laws, pareto distributions and zipf’s law. Contemporary physics 46(5):323–351.).
The intuition here is, the revisions with small effects account for the majority of the cumulative effects to the content.
The figure belos shows distributions of the effects for Add, Delete, and Edit types. Vertical lines indicate the division between major and minor revisions based on the 80/20 rule, where 80% of a type’s cumulative effects are due to those to the left of the line.

Network motifs

The set of candidate motifs was selected from all subgraphs made of three contiguous edits on a single page.

Demonstration of the usefulness of the motifs

Classification of pages as combative or cooperative page

Identifying cooperative/combative pages:

To identifying cooperative/combative pages, they used established categories of pages. Combative pages are 720 pages listed in Wikipedia:List of Controversial Articles, and cooperative pages is 10,149 pages in Wikipedia:Good Articles and Wikipedia:Featured articles, with the assumption that high quality pages will have more cooperative interactions. Other pages are classified into neutral pages.

Experimental setting:

The classification algorithm used here is SVM. They compared the result when motifs were used as features to the result when author-edit types were used as features. As a classification performance measure, they used F-scores for each page class. When using motifs as features, they used only the $k$ most frequent motif types, varying the value of $k$ .

Result:

The table below shows F-scores for each page class. It shows that the motifs features contribute the increase of classification accuracy, with enough amount of motifs, especially for the classification of combative/cooperative pages.

Analysis of content growth

To see how content is created as a result of the interactions, they applied LDA, where motifs are equivalent to tokens and behaviors to topics. Here, they introduced new concept, behavior. Each page has probability distribution of behaviors. The number of behaviors is fixed as 20.

Figure below visualizes the changes in mass for all 20 behaviors.

Four behaviors whose relative probability mass changed most are depicted in the figure below. During 2002- 2007, the probability mass of Behaviors B and C decreased to 24.6% and 32.4% of their earlier respective levels, while A and D more than doubled.

From these figures, we can observe that the early growth of Wikipedia was fueled by content addition from single authors or collaborating between two authors (B) and contributions from administrators (C). Then these early behaviors have given way to increases tge behaviors associated with editing (A) and maintaining quality or vandalism detection (D).

Review

Strengths and weaknesses

Strength

By introducing a new unit for Wikipedia analysis, motif, they conducted analyses on various types of editing behaviors, which other research have not yet done. By doing so, they gave some explanation of how content is created as a result of the interactions, and this kind of analysis was a new one.

Weakness

There are many heuristics in analyzing data. For example, they defined a network motif as a subgraph whose #edit is three, or they fixed the #topic in LDA as 20 and so forth. We are not clear how these heuristics affect the result of the analysis.

Possible impact

The idea of using motif as a unit for analyses may be promising, since it enables us to deal with various patterns of editing behaviors. But the authors did not propose methodology to deal with these motifs; they just proposed to use motifs as features. So they had to limit the expressiveness of motifs, maybe because of the problem of combinatorial explosion; the #edit associated with a page is fixed to be three. Devising an efficient method for handling motifs would enable deeper analyses on Wikipedia editors' interactions.

Recommendation for whether or not to assign the paper as required/optional reading in later classes.

No. The method used here is very simple, so if someone gets interested in this topic, it may be enough to look at this summary.

Related Papers

Brandes, U. et al. WWW 2009 : Brandes, U.; Kenis, P.; Lerner, J.; and Van Raaij, D. 2009. Network analysis of collaboration structure in wikipedia. In Proceedings of the 18th international conference on World Wide Web (WWW), 731–740. ACM.
- This paper studied the structure of editing behaviors in Wikipedia from a network analytical perspective. Their study does not use motifs, but are exhaustive, so we may be able to see how the proposed motifs can contribute the interaction analysis.

Study Plan

Blei, D.; Ng, A.; and Jordan, M. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:993– 1022. : This is an original paper on LDA.
- Probabilistic topic models: This is a general introduction to topic modeling, by the author of the above paper. So it may be helpful to read this article to understand LDA.

@@ Line 14: / Line 14: @@
 == Summary ==
-To investigate editor interactions in  [[AddressesProblem::Wikipedia Analysis| Wikipedia]], this [[Category::Paper|paper]] proposes to represent Wikipedia's revision history as a temporal, bipartite network with multiple node and edge types for users and revisions.
+The growth of Wikipedia relies in the cooperative, and sometimes combative, interactions among editors working on the same page.
+But most research on Wikipedia editor interactions focus on cooperative behaviors, which calls for a full [[AddressesProblem::Wikipedia Analysis| analysis of editing behaviors in Wikipedia]], including both cooperative and combative.
+To investigate editor interactions in  Wikipedia in this context, this [[Category::Paper|paper]] proposes to represent Wikipedia's revision history as a temporal, bipartite network with multiple node and edge types for users and revisions.
 From this representation, they identify author interactions as network motifs and show how the motif types capture editing behaviors.
-They also demonstrate the usefulness of motifs by two tasks; (1) classification of pages as combative or cooperative page and (2) analysis of the dynamics of editor behavior to explain Wikipedia’s content growth.
+They demonstrate the usefulness of motifs by two tasks; (1) classification of pages as combative or cooperative page and (2) analysis of the dynamics of editor behavior to explain Wikipedia’s content growth.
-== Background ==
+== Proposed analysis method ==
-=== Wikipedia ===
+=== Network representation ===
+They view editor interactions in Wikipedia as a bipartite graph from authors to the pages.
+They expand this representation to encode three additional features: (1) the type of author who made the change, (2) the time at which the change was made, and (3) the magnitude and effect of the change to the page.
+To do so, they define the bipartite graph of Wikipedia revisions as follows.
+[[File:Jurgens_2.png]]
-* This term is coined in  [http://arxiv.org/pdf/cond-mat/0308217v1.pdf Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.]
+The figure below illustrates a subset of a page’s history as sequence of classified revisions.
-*Some links in this wiki:
+[[File:Jurgens_1.png]]
-** [http://malt.ml.cmu.edu/mw/index.php/%E2%80%9Cmodularity%E2%80%9D modularity]
-** [http://malt.ml.cmu.edu/mw/index.php/Maximization_of_the_benefit_function_known_as_%22modularity%22 Maximization of the benefit function known as "modularity"]
-==== Problem definition ====
-Suppose that we are given the structure of some network and that we want to determine whether there exists any natural division of its vertices into nonoverlapping groups or communities, where these communities may be of any size.
-Given some divisons on the network, modularity is a quality function that measures how "good" the division is.
-The question here is, of course, what is the "good" divisions. I will illustrate it in the following.
-==== Intuition ====
+=== Network derivation from Wikipedia dataset ===
-* For the sake of ease, let us focus initially on the problem of whether any good division of the network exists into just two communities.
-* The most obvious way is to minimize the number of edges running between two groups, which is most often adopted in the graph-partitioning literature.
-* But it allows trivial divisions such as (1) all vertexes in one group and none in the other group, or (2) only 1 vertex in one group and the rest in the other group.
-* So, we need a different way to define a "good" devision of a network.
-* Modularity is introduced in this context. The basic idea here is, a good division of a network into communities is one in which there are fewer edges than expected between communities.
-** "If the number of edges between two groups is only what one would expect on the basis of random chance, then few thoughtful observers would claim this constitutes evidence of meaningful community structure. On the other hand, if the number of edges between groups is significantly less than we expect by chance, or equivalent if the number within groups is significantly more, then it is reasonable to conclude that something interesting is going on." - From the paper
-==== Definition ====
+'''Data''':
-Given some divisions on networks, modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.
+* [[UsesDataset::Wikipedia_revision_dataset|Wikipedia revision dataset]] is derived from a complete revision history of Wikipedia, ending on April 05, 2011.
+* After extracting article pages that have at least 10 revisions, the resulting dataset contained 2,715,123 articles and 227,034,806 revisions.
+* Though the data is not same, other dataset of Wikipedia edit history is available from here: [http://snap.stanford.edu/data/wiki-meta.html SNAP]
-* Preliminary
+'''Revision classes''':
-** For a particular division of the network into two groups let <math>s_i = 1</math>, if vertex <math>i</math> belongs to group 1 and <math>s_i = -1</math>, if vertex <math>i</math> belongs to group 2.
-** Let the number of edges between vertices <math>i</math> and <math>j</math> be <math>A_{ij}</math>.
-** Then, the expected number of edges between vertices <math>i</math> and <math>j</math> when edges are placed at random is <math>\frac{k_i k_j}{2m}</math>, where <math>k_i</math> and <math>k_j</math>are the degrees of the vertices and <math>m=\frac{1}{2}\sum_{i} k_i</math> is the total number of edges in the network.
-Now, the modularity <math>Q</math> is given by the sum of <math>A_{ij} -  \frac{k_{i}k_{j}}{2m}</math>over all pairs of vertices <math>i, j</math> that fall in the same group.
+* They selected four high-level categories for revisions: adding, deleting, editing, and reverting.
-(We can interpret <math>Q</math> as follows, too: We are given some groups now. For each group, we can calculate the difference between the actual number of edges and the expected number of edges over all pairs in the group. <math>Q</math> is the sum of these values.)
+* Using (1) the revising author’s comment and (2) MD5 hash for the articles, a revision can be identified as revert or not.
+* To classify a revision into one of the other three revision classes, they used two parameters: (1) the number of whitespace-delimited tokens added or removed from the page, <math>\delta</math>, i.e., its change in size, and (2) the number of tokens whose content was changed, <math>\theta</math>.
+* The classification rule is as follows.
-Since <math> \frac{1}{2} (s_{i} s_{j} + 1)</math> is 1 if  <math>i</math> and <math>j</math> are in the same group and 0 otherwise, we can express modularity as follows:
+[[File:Jurgens_3.png]]
-<math>Q =  \frac{1}{4m}  \sum_{ij} (A_{il} - \frac{k_{i}k_{j}}{2m}) (s_{i} s_{j} +1) =   \frac{1}{4m}  \sum_{ij} (A_{il} - \frac{k_{i}k_{j}}{2m}) s_{i} s_{j}, [1]</math>
-where the second equality follows from <math>2m = \sum_{i} k_{i} =  \sum_{ij} A_{ij} .</math>
+* To further distinguish edits based on the magnitude of their effect in addition to the type, they partition each class into major and minor subcategories, with the exception of Revert.
+* Based on the shape of the effect distributions, the difference between major and minor was selected using the Pareto principle, or “80/20 rule” ([http://people.physics.anu.edu.au/~tas110/Teaching/Lectures/L4/Material/Newman05.pdf Newman, M. 2005. Power laws, pareto distributions and zipf’s law. Contemporary physics 46(5):323–351.]).
+* The intuition here is, the revisions with small effects account for the majority of the cumulative effects to the content.
+* The figure belos shows distributions of the effects for Add, Delete, and Edit types. Vertical lines indicate the division between major and minor revisions based on the 80/20 rule, where 80% of a type’s cumulative effects are due to those to the left of the line.
-Note that <math>\frac{1}{4m}</math> is merely conventional.
+[[File:Jurgens_4.png]]
-=== Existing work on community detection based on modularity maximization ===
+=== Network motifs ===
-* [http://deim.urv.cat/~aarenas/publicacions/pdf/jstat05.pdf Danon, L., Duch, J., Diaz-Guilera, A. & Arenas, A. Comparing community structure identification. (2005) J. Stat. Mech.,P09008. ] and [http://69.164.193.67/site_media/publication_pdfs/Guimera-2005-Nature-433-895.pdf Guimer`a, R. & Amaral, L. A. N. Functional cartography of complex metabolic networks. (2005) Nature 433, 895–900. ] used [http://en.wikipedia.org/wiki/Simulated_annealing simulated annealing]. But these methods are not expected to scale to large networks due to the high computation cost.
+The set of candidate motifs was selected from all subgraphs made of three contiguous edits on a single page.
-* There are other heuristic methods such as greedy algorithms ([http://arxiv.org/abs/cond-mat/0309508 Newman, M. E. J. Fast algorithm for detecting community structure in networks. (2004) Phys. Rev. E 69, 066133.]) and  extremal optimization ([http://pre.aps.org/abstract/PRE/v72/i2/e027104 Duch, J. & Arenas, A. Community detection in complex networks using extremal optimization. (2005) Phys. Rev. E 72, 027104.]).
-== Method ==
+== Demonstration of the usefulness of the motifs ==
-=== Dividing networks into two communities ===
-The author rewrites the equation [1] as follows.
-<math>Q = \frac{1}{4m}  \mathbf{s^T} \mathbf{B} \mathbf{s},</math>
-where <math> \mathbf{s}</math> is the column vector whose elements are the  <math> s_{i}</math>, and  <math> B_{ij} =  A_{ij} -  \frac{k_{i}k_{j}}{2m} </math>, which is called modularity matrix.
-By writing <math>\mathbf{s}</math> as a linear combination of the normalized eigenvectors <math>u_{i}</math> of <math>\mathbf{B} </math>,  it is shown that we can express <math>Q</math> as follow:
+=== Classification of pages as combative or cooperative page ===
-<math>Q = \frac{1}{4m}  \sum_{i} (\mathbf{u_{i}}^T \cdot \mathbf{s} )^2 \beta_{i} , [2]</math>
+'''Identifying cooperative/combative pages''':
-where <math>\beta_{i}</math> is the eigenvalue of <math>\mathbf{B}</math> corresponding to eigenvector <math>\mathbf{u_{i}}</math>.
+To identifying cooperative/combative pages, they used established categories of pages.
+Combative pages are 720 pages listed in Wikipedia:List of Controversial Articles, and cooperative pages is 10,149 pages in Wikipedia:Good Articles and Wikipedia:Featured articles, with the assumption that high quality pages will have more cooperative interactions.
+Other pages are classified into neutral pages.
-The author shows that the maximum of <math>Q</math> is achieved by setting <math>s_{i} = +1</math>  if the corresponding element of the leading eigen vector (whose eigenvalue is largest)  is positive and -1 otherwise.
+'''Experimental setting''':
-Thus, the algorithm is as follows: we compute the leading eigenvector of the modularity matrix and divide the vertices into two groups according to the signs of the elements in this vector.
-=== Dividing networks into more than two communities ===
+The classification algorithm used here is [[SVM]].
-The author divides networks into multiple communities by repeating the previous method recursively.
+They compared the result when motifs were used as features to the result when author-edit types were used as features.
-That is, he uses the algorithm described above first to divide the network into two parts, then divides those parts, and so on.
+As a classification performance measure, they used F-scores for each page class.
+When using motifs as features, they used only the <math>k</math>  most frequent motif types, varying the value of <math>k</math>.
-More specifically, he considers how much the modularity increases when we divide a group <math>g</math> into two parts.
+'''Result''':
-He shows this additional contribution of modularity <math>\Delta{Q}</math> can be expressed in a similar form as the previous section.
-He also shows that the modularity matrix in the previous section is now rewritten as a generalized modularity matrix.
-Then he shows that we can apply same spectral algorithm to maximize <math>\Delta{Q}</math>.
-This algorithm tells us clearly at what point we need to halt the subdivision process;
+The table below shows F-scores for each page class.
-If there are no division of a subgraph that will increase the modularity of the network, or equivalently that gives a positive value for <math>\Delta{Q}</math>, we should stop the process then.
+It shows that the motifs features contribute the increase of classification accuracy, with enough amount of motifs, especially for the classification of combative/cooperative pages.
-=== Nice features of this method ===
+[[File:Jurgens_6.png]]
-* We do not need to specify the size of communities.
-* It has the ability to refuse to divide the network when no good division exists.
-** If the generalized modularity matrix has no positive eigenvalues, it means there is no division of the network that results in positive modularity, which we can see from the equation [2].
-== Dataset ==
+=== Analysis of content growth ===
-* [[UsesDataset::Karate_network|Zachary's karate network]] 34 nodes.
+To see how content is created as a result of the interactions, they applied [[LDA]], where motifs are equivalent to tokens and behaviors to topics.
-* [[UsesDataset::Jazz_musicians_network|Pablo's jazz musicians network]] 198 nodes.
+Here, they introduced new concept, behavior. Each page has probability distribution of behaviors. The number of behaviors is fixed as 20.
-* [[UsesDataset::Metabolic_network|Jeong's metabolic network]] 453 nodes.
-* [[UsesDataset::Email_network|Guimer's email network]] 1,133 nodes.
-* [[UsesDataset::Key_signing_network|Guardiola,'s Key signing network]] 10,680 nodes.
-* [[UsesDataset::Physicists_network|Newman's Physicists network]] 27,519 nodes.
-== Experiment ==
+Figure below visualizes the changes in mass for all 20 behaviors.
-* Measure
-** The author used modularity value as a performance measure of a community detection method.
+[[File:Jurgens_7.png]]
-* Competing methods
-**  Betweenness-based algorithm of Girvan and Newman
+Four behaviors whose relative probability mass changed most are depicted in the figure below.
-*** [http://www.pnas.org/content/99/12/7821.abstract Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. (2002) PNAS.]
+During 2002- 2007, the probability mass of Behaviors B and C decreased to 24.6% and 32.4% of their earlier respective levels, while A and D more than doubled.
-** Fast algorithm of Clauset et al.
-*** [http://arxiv.org/abs/cond-mat/0408187 Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. (2004) Phys. Rev. E 70, 066111.]
+[[File:Jurgens_8.png]]
-*** It optimizes modularity by using a greedy algorithm.
-** Extremal optimization algorithm of Duch and Arenas
+From these figures, we can observe that the early growth of Wikipedia was fueled by content addition from single authors or collaborating between two authors (B) and contributions from administrators (C).
-*** [http://pre.aps.org/abstract/PRE/v72/i2/e027104 Duch, J. & Arenas, A. Community detection in complex networks using extremal optimization. (2005) Phys. Rev. E 72, 027104.]
+Then these early behaviors have given way to increases tge behaviors associated with editing (A) and maintaining quality or vandalism detection (D).
-* Result
-** The proposed method outperformed the first two methods (Betweenness-based algorithm of Girvan and Newman, Fast algorithm of Clauset et al. ) for all of the networks.
-** The third method (Extremal optimization algorithm of Duch and Arenas) was more competitive.  There are no clear diference between them for the smaller networks up to ~ 1000 vertices. But for larger networks, the proposed method outperformed the third method, and the performance gap increased as the size of networks increased, showing that the proposed method is most promising for large networks.
 == Review==
+=== Strengths and weaknesses ===
+;Strength:
+By introducing a new unit for Wikipedia analysis, motif, they conducted analyses on various types of editing behaviors, which other research have not yet done.
+By doing so, they gave some explanation of how content is created as a result of the interactions, and this kind of analysis was a new one.
+;Weakness:
+There are many heuristics in analyzing data. For example, they defined a network motif  as a subgraph whose #edit is three, or they fixed the #topic in LDA as 20 and so forth.
+We are not clear how these heuristics affect the result of the analysis.
+=== Possible impact  ===
+The idea of using motif as a unit for analyses may be promising, since it enables us to deal with various patterns of editing behaviors.
+But the authors did not propose methodology to deal with these motifs; they just proposed to use motifs as features.
+So they had to limit the expressiveness of motifs, maybe because of the problem of combinatorial explosion; the #edit associated with a page is fixed to be three.
+Devising an efficient method for handling motifs would enable deeper analyses on Wikipedia editors' interactions.
 === Recommendation for whether or not to assign the paper as required/optional reading in later classes.  ===
+No. The method used here is very simple, so if someone gets interested in this topic, it may be enough to look at this summary.
-Yes.
+== Related Papers ==
-* Modularity-based methods are common in community detection task. This papper might be a good introduction for the concept of modularity.
-* This paper also illustrates how the optimization problem can be rewritten in terms of eigenvalues and eigenvectors of a matrix called modularity matrix, which results into eigenvalue problems. This derivation shows that we can solve a problem by seeing the problem from different view points. This might be a good lesson for us when we face problems.
-== Related Papers ==
+* [[RelatedPaper::Brandes, U. et al. WWW 2009]] : [http://dl.acm.org/citation.cfm?id=1526808 Brandes, U.; Kenis, P.; Lerner, J.; and Van Raaij, D. 2009. Network analysis of collaboration structure in wikipedia. In Proceedings of the 18th international conference on World Wide Web (WWW), 731–740. ACM.]
-* [[RelatedPaper::Newman, M. E. J. & Girvan, M.  Phys. Rev.  2004.]]
+** This paper studied the structure of editing behaviors in Wikipedia from a network analytical perspective.  Their study does not use motifs, but are exhaustive, so we may be able to see how the proposed motifs can contribute the interaction analysis.
-** [http://arxiv.org/pdf/cond-mat/0308217v1.pdf Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.]
-* [[RelatedPaper::Girvan, M. & Newman, M. E.  PNAS. 2002.]]
-** [http://www.pnas.org/content/99/12/7821.abstract Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. (2002) PNAS.]
 == Study Plan  ==
-* [http://arxiv.org/pdf/cond-mat/0308217v1.pdf Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. (2004) Phys. Rev. E 69, 026113.]
+* [http://www.psychology.adelaide.edu.au/personalpages/staff/simondennis/LexicalSemantics/BleiNgJordan03.pdf Blei, D.; Ng, A.; and Jordan, M. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:993– 1022.] : This is  an original paper on LDA.
-** This is a paper published in 2004, by the same author, which investigated the problem of community detection by several approaches. This paper also introduced the term "modularity" to evaluate the community structure. So, this paper might be a good material to understand the motivation of the authors.
+** [http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf Probabilistic topic models]: This is a general introduction to topic modeling, by the author of the above paper. So it may be helpful to read this article to understand LDA.