Difference between revisions of "TED comment analysis"
m (1 revision) |
|||
Line 15: | Line 15: | ||
== Dataset == | == Dataset == | ||
− | The data we have used for this task is from [[UsesDataset::TED.com]]. | + | The data we have used for this task is from [[UsesDataset::TED.com]]. |
+ | Download this Dataset [https://bitbucket.org/aasish/ted-comments-dataset/get/8bd2d65ec8df.tar.gz] | ||
== Motivation == | == Motivation == |
Latest revision as of 18:44, 15 September 2012
Authors: Aasish Pappu and Gopala Krishna Presentation: [1]
Contents
Problem
In this work we are addressing the following problems:
- Comment Prediction using topic modeling
- Topic Prediction for a new Talk or Blog using topic modeling
This task uses methods that used in following problems (for analysis only):
- Modularity optimization method for Community Detection
- (LDA) for topic model for users and comments
Dataset
The data we have used for this task is from TED.com. Download this Dataset [2]
Motivation
We have observed the trending news topics and trending Google queries are aligned
image here with the trend
TED Network
Despite the dataset doesn't provide us any links between their users, we want to tap into the network (if it exists) of the dataset. To analyze any network we have to get to some form of adjacency matrix. Therefore, we rested on a simple hypothesis which is that if two users comment on a talk, create an edge between both of them. Although, this leads us to an obvious pitfall of having an edge between singletons, it never hurts having a unit weighted edge between them. As, two users and comment more often on a talk, the edge weight between them increases. We did this to all users in the active users list.
Since, we have hypothesized that there is a latent network in TED, we would like to explore if there is some form of community structure in the TED.
We have modularity optimization based method to detect communities in the TED network.
Comment Prediction
This task aims at discovering a user or a group of users who might be interested in commmenting on a talk. We found that with 61% prediction accuracy we can tell which users are likely to comment on a particular talk. We tested our system on the transcripts of the 501 talks mentioned in the previous section. In this experiment we have treated this problem as labeling each talk-transcript with a user-cluster label learnt previously (as explained in the earlier sections). We have choosen top three labels (user-clusters) from the inference results on each talk and verified that how many of the actual commenters are from these three clusters.
Talk topic Prediction
This task aims at classifying a new talk under one of the known classes given the speech transcript of the talk. The transcripts of the talks form the test dataset for our work. We have ran over LDA-trained model over these 501 transcripts and inferred the topic label for each of these talks. Since, we do not have true labels for these talks or considering that a new talk would not have a label to work with. Therefore, we have used topic labels that were assigned to the talks when we ran our inference method on each talk (treating all the comments in a talk as a document) to compare with the labels now we over the speech transcript of the talk.
Conclusion
In this work, we use user comments on technical talks as a social community. We present the database we build for this purpose and describe our analyses of it. We use LDA to infer the clusters among talks and users. The results are very encouraging as illustrated by our commentor prediction task which gave over 60\% accuracy in prediction. This work can be easily extended to automatic tagging of talks or to a user recommendation system.