What VS What? Detect Controversial Topics in Online Community
Contents
Comments
Interesting idea. One issue you will have is evaluation - it's not clear how you can detect controversies. You might want to look over the wikipedia edit war data and see if any of it is available, and if so, if there's an easy way of using it as training/test/eval data. Wikipedia edit history data is out there - also, DBPedia has a number of nice cleaned-up versions of Wikipedia that can be used for link analysis and such.
There is another 3-person team Controversial events detection. I recommend the four of you get together and discuss joint evaluation/labeling plan, even if you work on different techniques. You might even regroup as teams - two 2-person groups are likely to be more productive than a 3 and a 1. --Wcohen 18:40, 10 October 2012 (UTC)
Team members
Teammate Wanted! Feel Free to contact me!
Motivation
In online communities, there are always some topics that are more controversial than others and attract a lot of users' enthusiasm and concentration. For example, in Geek news communities such as Slashdot, the news article about Apple VS Android topic usually has a much higher volume of comments. The same thing happens when things comming to other controversial topics like Windows VS Linux, Open Source VS Commercial Software.
The goal behind this project is to automatic discover those topics inside a online community that when put get together, the level of controversy grows higher.
Project idea
When given series of Documents d and the number of comments associated with that Documents, note as
By running Topic Model like LDA on Document Space D, we can get k topics, noted as:
Given a particular document , in LDA, it has a representation in the topic space, as
Then we get the number of comments that a particular topic can generate:
By using some sentiment analysis techniques, we hope to detect the sentiment towards a topic given a document. Specifically given a topic , we hope to find those documents that hold a positive sentiment to this topic, define as . Thus we can calculate the number of comments a topic can generate when the sentiment in such document is positive:
Then we can define the degree of controverse between two topic as follows:
Document Level Topic Polarity Detection
First, using in the method discribed in Ramnath Balasubramanyan et. al. ICWSM 2012, using tools like SentiWordNet, we can first construct two vocabularies, one for positive and one for negative.
In order measure the sentiment towards topic given a specific document, we can leverage two kinds of features:
- Distance Feature For each word in a topic we can calculate the their mean distance towards those sentiment words. And the polarity of the whole topic is decided from the weighted average distance of all the words in the topic, to see it is close to the positive or negative.
- Co-occur Feature We can treat a document as a collections of sentences. From this point of view we can calculate sentence level PMI for each word in the topic. Then each word votes the polarity of the whole topic.
Dataset
We plan to crawl data from some online tech new communities, such as slashdot, theverge and engadget. For each blog, we get the content of the article and the comments associated with that article.
There are existing datasets we can use like the political blogs, which have blog content and comments described in one of the paper in reference paper.
Reference
- Ramnath Balasubramanyan et. al. ICWSM 2012 This one has the political blog datasets
- Roja Bandari et. al. ICWSM 2012 The Pulse of News in Social Media: Forecasting Popularity, ICWSM 2012
- Shmueli et. al. WWW2012 Care to Comment? Recommendations for Commenting on News Stories