Bamman et. al., FIRST MONDAY 2012

From Cohen Courses
Revision as of 20:03, 3 October 2012 by Lingwang (talk | contribs)
Jump to navigationJump to search

Citation

David Bamman, Brendan O'Connor and Noah A. Smith. 2012. Censorship and deletion practices in Chinese social media. In First Monday.

Online version

Censorship and Content Deletion in Chinese Social Media

Summary

This Paper attempts to characterize the practices of censorship and message deletion in Sina Weibo (Chinese counterpart of Twitter). The paper identifies three different approaches to analyse this issue:

  • The first method analyses the term deletion rates.
  • The second approach compares the term distribution between Weibo and Twitter
  • The third study conditions the term deletion rate by province where the post was made.

Term Deletion Rate

To build a corpora of messages and their annotations (whether the message was deleted), the Weibo messages were queried over a period of three months. Later, it was checked if the message still existed in the present time. If not, it means that the message was deleted.

To analyse topics that are likely to be deleted, the authors calculate the term deletion rate for each term , defined as follows.

, where is the number of times a message with the term was deleted and is the number of messages with .

Furthermore, a statistical test is performed (using the one–tailed binomial p-value) to find the terms whose deletion rates are abnormally high. These terms are then analysed manually.

From these terms, the authors conclude the following. Messages containing politically sensitive items are likely to be deleted. Another type of terms are terms such as "asked to resign", which have are sentitive due to real-world events. Finally, terms that occured in false rumors also have a high deletion rate.

Comparing Twitter with Weibo

Based on the fact that messages in Twitter are not deleted as in Weibo, it is expected that the relative frequency of a terms that are likely to be deleted in Weibo, to occur much more often in Twitter. Thus, the authors propose the following metric:


The terms with the highest scores are tested in Weibo's search engine to check whether they are blocked. Results show that in the top 20 terms, 70% of the messages were blocked. The precision gets lower as we add terms with lower scores, and for the top 2000 terms, 136 censored terms were found.

While the results are not precise, this provides a framework for automatically detecting terms that are censored.

Geographic Distribution

In this analysis the geographic data in the message's metadata is used and the probability of a message being deleted given the province is calculated as:

Using this metric it is shown that the deletion rates are higher in Tibet, where 53% of the messages are deleted, while in other provinces such as Beijing and Shanghai, only 12% and 11.4% are deleted, respectively. A further study was conducted to find which terms are more likely to cause the message to be deleted in each province. This is done by calculating the pointwise mutual information of a term as:

Then, for each province the terms with the highest PMI are analysed. The study shows that politically sensitive items are not highly correlated with the province, in fact, the most characteristic terms for each province, tend to be the locations within that province.

Related Work

Previous studies on censorship have been conducted in Rebecca MacKinnon, 2009, Yu et al SNA–KDD 2011 and Xu et al PAM 2011. While these, do not explicitly model the censorship using statistical models as this work, they are used as source of information of the censorship practices in Weibo.

Study plan

Most of the concepts presented in this work are basic. However, some terms might require some reviewing: