Jurgens and Lu ICWSM 2012

From Cohen Courses
Jump to navigationJump to search

Citation

@inproceedings{DBLP:conf/icwsm/JurgensL12,

 author = {David Jurgens and Tsai-Ching Lu},
 title = {Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia},
 booktitle = {ICWSM},
 year = {2012}

Online version

Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia


Summary

The growth of Wikipedia relies in the cooperative, and sometimes combative, interactions among editors working on the same page. But most research on Wikipedia editor interactions focus on cooperative behaviors, which calls for a full analysis of editing behaviors in Wikipedia, including both cooperative and combative. To investigate editor interactions in Wikipedia in this context, this paper proposes to represent Wikipedia's revision history as a temporal, bipartite network with multiple node and edge types for users and revisions. From this representation, they identify author interactions as network motifs and show how the motif types capture editing behaviors. They demonstrate the usefulness of motifs by two tasks; (1) classification of pages as combative or cooperative page and (2) analysis of the dynamics of editor behavior to explain Wikipedia’s content growth.

Proposed analysis method

Network representation

They view editor interactions in Wikipedia as a bipartite graph from authors to the pages. They expand this representation to encode three additional features: (1) the type of author who made the change, (2) the time at which the change was made, and (3) the magnitude and effect of the change to the page. To do so, they define the bipartite graph of Wikipedia revisions as follows.

Jurgens 2.png


The figure below illustrates a subset of a page’s history as sequence of classified revisions.

Jurgens 1.png


Network derivation from Wikipedia dataset

Data:

  • Wikipedia revision dataset is derived from a complete revision history of Wikipedia, ending on April 05, 2011.
  • After extracting article pages that have at least 10 revisions, the resulting dataset contained 2,715,123 articles and 227,034,806 revisions.
  • Though the data is not same, other dataset of Wikipedia edit history is available from here: SNAP

Revision classes:

  • They selected four high-level categories for revisions: adding, deleting, editing, and reverting.
  • Using (1) the revising author’s comment and (2) MD5 hash for the articles, a revision can be identified as revert or not.
  • To classify a revision into one of the other three revision classes, they used two parameters: (1) the number of whitespace-delimited tokens added or removed from the page, , i.e., its change in size, and (2) the number of tokens whose content was changed, .
  • The classification rule is as follows.

Jurgens 3.png

  • To further distinguish edits based on the magnitude of their effect in addition to the type, they partition each class into major and minor subcategories, with the exception of Revert.
  • Based on the shape of the effect distributions, the difference between major and minor was selected using the Pareto principle, or “80/20 rule” (Newman, M. 2005. Power laws, pareto distributions and zipf’s law. Contemporary physics 46(5):323–351.).
  • The intuition here is, the revisions with small effects account for the majority of the cumulative effects to the content.
  • The figure belos shows distributions of the effects for Add, Delete, and Edit types. Vertical lines indicate the division between major and minor revisions based on the 80/20 rule, where 80% of a type’s cumulative effects are due to those to the left of the line.

Jurgens 4.png

Network motifs

The set of candidate motifs was selected from all subgraphs made of three contiguous edits on a single page.

Demonstration of the usefulness of the motifs

Classification of pages as combative or cooperative page

Identifying cooperative/combative pages:

To identifying cooperative/combative pages, they used established categories of pages. Combative pages are 720 pages listed in Wikipedia:List of Controversial Articles, and cooperative pages is 10,149 pages in Wikipedia:Good Articles and Wikipedia:Featured articles, with the assumption that high quality pages will have more cooperative interactions. Other pages are classified into neutral pages.

Experimental setting:

The classification algorithm used here is SVM. They compared the result when motifs were used as features to the result when author-edit types were used as features. As a classification performance measure, they used F-scores for each page class. When using motifs as features, they used only the most frequent motif types, varying the value of .

Result:

The table below shows F-scores for each page class. It shows that the motifs features contribute the increase of classification accuracy, with enough amount of motifs, especially for the classification of combative/cooperative pages.

Jurgens 6.png

Analysis of content growth

To see how content is created as a result of the interactions, they applied LDA, where motifs are equivalent to tokens and behaviors to topics. Here, they introduced new concept, behavior. Each page has probability distribution of behaviors. The number of behaviors is fixed as 20.

Figure below visualizes the changes in mass for all 20 behaviors.

Jurgens 7.png

Four behaviors whose relative probability mass changed most are depicted in the figure below. During 2002- 2007, the probability mass of Behaviors B and C decreased to 24.6% and 32.4% of their earlier respective levels, while A and D more than doubled.

Jurgens 8.png

From these figures, we can observe that the early growth of Wikipedia was fueled by content addition from single authors or collaborating between two authors (B) and contributions from administrators (C). Then these early behaviors have given way to increases tge behaviors associated with editing (A) and maintaining quality or vandalism detection (D).

Review

Strengths and weaknesses

Strength

By introducing a new unit for Wikipedia analysis, motif, they conducted analyses on various types of editing behaviors, which other research have not yet done. By doing so, they gave some explanation of how content is created as a result of the interactions, and this kind of analysis was a new one.

Weakness

There are many heuristics in analyzing data. For example, they defined a network motif as a subgraph whose #edit is three, or they fixed the #topic in LDA as 20 and so forth. We are not clear how these heuristics affect the result of the analysis.

Possible impact

The idea of using motif as a unit for analyses may be promising, since it enables us to deal with various patterns of editing behaviors. But the authors did not propose methodology to deal with these motifs; they just proposed to use motifs as features. So they had to limit the expressiveness of motifs, maybe because of the problem of combinatorial explosion; the #edit associated with a page is fixed to be three. Devising an efficient method for handling motifs would enable deeper analyses on Wikipedia editors' interactions.

Recommendation for whether or not to assign the paper as required/optional reading in later classes.

No. The method used here is very simple, so if someone gets interested in this topic, it may be enough to look at this summary.

Related Papers

Study Plan