Difference between revisions of "Project Second Draft-Subhodeep Manaj"

From Cohen Courses
Jump to navigationJump to search
Line 9: Line 9:
 
Youtube has a large user-base (nearly 48.2 m users in early 2010*) that are involved in discussion by posting comments on the videos they watch. People appreciate, condemn or sometimes just neutrally discuss the content of the video.  Along with comments posted to a video, users also exhibit their preferences by the “liking” or “disliking” a video.   
 
Youtube has a large user-base (nearly 48.2 m users in early 2010*) that are involved in discussion by posting comments on the videos they watch. People appreciate, condemn or sometimes just neutrally discuss the content of the video.  Along with comments posted to a video, users also exhibit their preferences by the “liking” or “disliking” a video.   
  
Our goal is to be able to an we predict, through the comments, whatportion of users tend to like or dislike the video. We use the actual “like” and “dislike” figures to evaluate the prediction, given the absence of labeled comments. There’s a positive correlation between the number of comments and number of ratings (likes/dislikes) for a particular video. From large sample approximations, we can assume that the number of people “liking” a video and/or commenting fairly about is an accurate representation of user's preference for that video. The same holds for “disliking” the video or commenting negatively about it.
+
Our goal is to be able to an we predict, through the comments, what portion of users tend to like or dislike the video. We use the actual “like” and “dislike” figures to evaluate the prediction, given the absence of labeled comments. There’s a positive correlation between the number of comments and number of ratings (likes/dislikes) for a particular video. From large sample approximations, we can assume that the number of people “liking” a video and/or commenting fairly about is an accurate representation of user's preference for that video. The same holds for “disliking” the video or commenting negatively about it.
  
While this seems loilke
+
We want to limit our domain to some predefined categories. We want to take the 50 topmost discussed videos and in some predefined category such as "music video" or "politics" and those which have a higher #ratings/#comments ratio. We would like to choose videos that inherently is a debatable topic and we want to capture this variance with the number of #likes and #dislikes.
• How large is large enough?.....We can take top 50 or
 
so “most discussed” videos in a category, and take the
 
ones that have high #ratings/#comments ratio
 
• To ensure getting good bias, choose the videos with
 
high variance in #likes and #dislikes, and consider
 
categories like politics, sports and music
 
 
 
Making use of internet slangs*
 
* http://www.internetslang.com/
 
 
 
(Other)Methodology
 
 
 
• Making use of adjectives (and SentiWordNet)
 
• Making use of certain polar words#
 
• When comments are long and use words of both
 
polarities, collocation of certain “keywords” with
 
the polar terms can “possibly” be considered
 
• These keywords could come from a frequency
 
count over all the comments, and also from tags of
 
the video
 
• Do Latent Semantic Analysis on Comment set
 
• # Analogous to Pang et. al.
 
  
 +
== Methods ==
 +
We have a few ideas and concepts in mind for the methods that we can possibly use to find bias in comments.
  
 +
* Using the internet Slang repository for identifying words and meanings - http://www.internetslang.com/
 +
* Making use of adjectives (and SentiWordNet)
 +
* Making use of certain polar words
 +
* When comments are long and use words of both polarities, collocation of certain “keywords” with the polar terms can be considered. These keywords could come from a frequency count over all the comments, and also from tags of the video
 +
* Latent Semantic Analysis on Comment set
  
 
== Data Set ==
 
== Data Set ==
Line 46: Line 31:
 
An important part of our approach will be preprocessing the set of comments so as to filter out comments that are not relevant to the topic. A number of users also post spam comments such as links to their websites. We plan to incorporate a model that can classify comments as spam and reject them.
 
An important part of our approach will be preprocessing the set of comments so as to filter out comments that are not relevant to the topic. A number of users also post spam comments such as links to their websites. We plan to incorporate a model that can classify comments as spam and reject them.
  
 
+
== Drawbacks ==
 +
While we may be able to get a global estimation of the general viewer preference for a video, it is not possible to get labels for each individual label. A supervised learning approach may not suffice.
  
  

Revision as of 17:29, 15 February 2011

Project Proposal

Predicting proportion of users that like a Youtube video through the comments on the blog

Team Members

Subhodeep Moitra (smoitra@cs.cmu.edu) Manaj Srivastava (manajs@cs.cmu.edu)

Goal of the Project

Youtube has a large user-base (nearly 48.2 m users in early 2010*) that are involved in discussion by posting comments on the videos they watch. People appreciate, condemn or sometimes just neutrally discuss the content of the video. Along with comments posted to a video, users also exhibit their preferences by the “liking” or “disliking” a video.

Our goal is to be able to an we predict, through the comments, what portion of users tend to like or dislike the video. We use the actual “like” and “dislike” figures to evaluate the prediction, given the absence of labeled comments. There’s a positive correlation between the number of comments and number of ratings (likes/dislikes) for a particular video. From large sample approximations, we can assume that the number of people “liking” a video and/or commenting fairly about is an accurate representation of user's preference for that video. The same holds for “disliking” the video or commenting negatively about it.

We want to limit our domain to some predefined categories. We want to take the 50 topmost discussed videos and in some predefined category such as "music video" or "politics" and those which have a higher #ratings/#comments ratio. We would like to choose videos that inherently is a debatable topic and we want to capture this variance with the number of #likes and #dislikes.

Methods

We have a few ideas and concepts in mind for the methods that we can possibly use to find bias in comments.

  • Using the internet Slang repository for identifying words and meanings - http://www.internetslang.com/
  • Making use of adjectives (and SentiWordNet)
  • Making use of certain polar words
  • When comments are long and use words of both polarities, collocation of certain “keywords” with the polar terms can be considered. These keywords could come from a frequency count over all the comments, and also from tags of the video
  • Latent Semantic Analysis on Comment set

Data Set

We will scrape youtube using an API so as to extract comments and other metadata such as number of likes, related video titles and number of views for a predefined genre of videos such as "music videos"

Evaluation Metric

Our evaluation metric will be the number of likes and dislikes for a particular video.

Filtering junk comments

An important part of our approach will be preprocessing the set of comments so as to filter out comments that are not relevant to the topic. A number of users also post spam comments such as links to their websites. We plan to incorporate a model that can classify comments as spam and reject them.

Drawbacks

While we may be able to get a global estimation of the general viewer preference for a video, it is not possible to get labels for each individual label. A supervised learning approach may not suffice.


References

       Stefan Siersdorfer, Jose San Pedro, Sergiu Chelaru, Wolfgang Nejdl  "How useful are your comments?- Analyzing and Predicting YouTube Comments and Comment Ratings "  - 19th International World Wide Web Conference, WWW 2010, Raleigh, USA 

Hu M., Sun A., Lim E., “Comments-Oriented Blog Summarization by Sentence Extraction”, 16th ACM Conference on Information and Knowledge Management, 2007

	Mishne G., Glance N., “Leave a Reply: An Analysis of Weblog Comments”, Third Annual Workshop on the Web-logging Ecosystem, 2006
	Schuth A., Marx M., Rijke M., “Extracting the discussion structure in comments on news-articles”, Proceedings of the 9th Annual ACM Workshop on Web-Information and Data Management, 2007