What VS What? Detect Controversial Topics in Online Community

From Cohen Courses
Jump to navigationJump to search

Comments

Interesting idea. One issue you will have is evaluation - it's not clear how you can detect controversies. You might want to look over the wikipedia edit war data and see if any of it is available, and if so, if there's an easy way of using it as training/test/eval data. Wikipedia edit history data is out there - also, DBPedia has a number of nice cleaned-up versions of Wikipedia that can be used for link analysis and such.

There is another 3-person team Controversial events detection. I recommend the four of you get together and discuss joint evaluation/labeling plan, even if you work on different techniques. You might even regroup as teams - two 2-person groups are likely to be more productive than a 3 and a 1. --Wcohen 18:40, 10 October 2012 (UTC)

Team members

Teammate Wanted! Feel Free to contact me!

Motivation

In online communities, there are always some topics that are more controversial than others and attract a lot of users' enthusiasm and concentration. For example, in Geek news communities such as Slashdot, the news article about Apple VS Android topic usually has a much higher volume of comments. The same thing happens when things comming to other controversial topics like Windows VS Linux, Open Source VS Commercial Software.

The goal behind this project is to automatic discover those topics inside a online community that when put get together, the level of controversy grows higher.

Project idea

When given series of Documents d and the number of comments associated with that Documents, note as

By running Topic Model like LDA on Document Space D, we can get k topics, noted as:

Given a particular document , in LDA, it has a representation in the topic space, as

Then we get the number of comments that a particular topic can generate:

By using some sentiment analysis techniques, we hope to detect the sentiment towards a topic given a document. Specifically given a topic , we hope to find those documents that hold a positive sentiment to this topic, define as . Thus we can calculate the number of comments a topic can generate when the sentiment in such document is positive:

Then we can define the degree of controverse between two topic as follows:

Document Level Topic Polarity Detection

First, using in the method discribed in Ramnath Balasubramanyan et. al. ICWSM 2012, using tools like SentiWordNet, we can first construct two vocabularies, one for positive and one for negative.

In order measure the sentiment towards topic given a specific document, we can leverage two kinds of features:

- Distance Feature For each word in a topic we can calculate the their mean distance towards those sentiment words. And the polarity of the whole topic is decided from the weighted average distance of all the words in the topic, to see it is close to the positive or negative.

- Co-occur Feature We can treat a document as a collections of sentences. From this point of view we can calculate sentence level PMI for each word in the topic. Then each word votes the polarity of the whole topic.

Dataset

We plan to crawl data from some online tech new communities, such as slashdot, theverge and engadget. For each blog, we get the content of the article and the comments associated with that article.

There are existing datasets we can use like the political blogs, which have blog content and comments described in one of the paper in reference paper.

Reference