TED comment analysis

From Cohen Courses
Revision as of 06:45, 10 May 2010 by Apappu (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Authors: Aasish Pappu and Gopala Krishna Presentation: [1]

Problem

In this work we are addressing the following problems:

This task uses methods that used in following problems (for analysis only):

Dataset

The data we have used for this task is from TED.com.

Motivation

We have observed the trending news topics and trending Google queries are aligned

image here with the trend

TED Network

Despite the dataset doesn't provide us any links between their users, we want to tap into the network (if it exists) of the dataset. To analyze any network we have to get to some form of adjacency matrix. Therefore, we rested on a simple hypothesis which is that if two users comment on a talk, create an edge between both of them. Although, this leads us to an obvious pitfall of having an edge between singletons, it never hurts having a unit weighted edge between them. As, two users and comment more often on a talk, the edge weight between them increases. We did this to all users in the active users list.

Since, we have hypothesized that there is a latent network in TED, we would like to explore if there is some form of community structure in the TED.

We have modularity optimization based method to detect communities in the TED network.

Comment Prediction

This task aims at discovering a user or a group of users who might be interested in commmenting on a talk. We found that with 61% prediction accuracy we can tell which users are likely to comment on a particular talk. We tested our system on the transcripts of the 501 talks mentioned in the previous section. In this experiment we have treated this problem as labeling each talk-transcript with a user-cluster label learnt previously (as explained in the earlier sections). We have choosen top three labels (user-clusters) from the inference results on each talk and verified that how many of the actual commenters are from these three clusters.


Talk topic Prediction

This task aims at classifying a new talk under one of the known classes given the speech transcript of the talk. The transcripts of the talks form the test dataset for our work. We have ran over LDA-trained model over these 501 transcripts and inferred the topic label for each of these talks. Since, we do not have true labels for these talks or considering that a new talk would not have a label to work with. Therefore, we have used topic labels that were assigned to the talks when we ran our inference method on each talk (treating all the comments in a talk as a document) to compare with the labels now we over the speech transcript of the talk.

Conclusion

In this work, we use user comments on technical talks as a social community. We present the database we build for this purpose and describe our analyses of it. We use LDA to infer the clusters among talks and users. The results are very encouraging as illustrated by our commentor prediction task which gave over 60\% accuracy in prediction. This work can be easily extended to automatic tagging of talks or to a user recommendation system.