Difference between revisions of "Huang 2010 Conversational Tagging in Twitter"

From Cohen Courses
Jump to navigationJump to search
Line 16: Line 16:
  
 
== Dataset ==
 
== Dataset ==
Our data comes from 2 different sources, both of which are online
+
They author created their own dataset, from 2 different sources: Twitter and Delicious. They collected a sample of 42 million hashtags used in the microblogging website Twitter, inserted in messages posted by users. They also got a sample of 378 million tags from the online bookmarking service Delicious, created by users to organize their bookmarks. Both of these datasets contain the tag along with the timestamp of when that tag was attached, intended for temporal analysis.
services which allow users to tag content (Table 1). We have a
 
sample of 42 million hashtags used in the microblogging website
 
Twitter, inserted in messages posted by users. We have a sample
 
of 378 million tags from the online bookmarking service
 
Delicious, created by users to organize their bookmarks. Both of
 
these datasets contain the tag along with the timestamp of when
 
that tag was attached.
 
  
 
== Corpus Analysis ==
 
== Corpus Analysis ==

Revision as of 21:55, 31 March 2011

Citation

Jeff Huang, Katherine M. Thornton and Efthimis N. Efthimiadis. 2010. Conversational Tagging in Twitter. In Proceedings of ACM HT.

Online version

An online version of this paper is available at [1].

Summary

This paper presents a study of Twitter tags versus tags in other Web 2.0 systems. They show several findings on their differences and similarities. They claim that twitter tags are more about filtering and directing content so that it appears in certain streams.

Key Contributions

The paper made a key contributions by its findings on the differences between twitter tags and tags in previous systems. It presents the old-style tags as a posteriori and the twitter-style tags as a priori. As claimed by the authors, it is the first large-scale study on twitter tags.

Dataset

They author created their own dataset, from 2 different sources: Twitter and Delicious. They collected a sample of 42 million hashtags used in the microblogging website Twitter, inserted in messages posted by users. They also got a sample of 378 million tags from the online bookmarking service Delicious, created by users to organize their bookmarks. Both of these datasets contain the tag along with the timestamp of when that tag was attached, intended for temporal analysis.

Corpus Analysis

The authors conducted the word frequencies distribution analysis and showed that the results followed Zipf's law. And the author also performed POS tagging on all the twitter messages and presented the variation of the POS tags across different classes.

  • Positive vs. Negative Tags

The author shows an indicator of a positive text is superlative adverbs (RBS), such as “most” and “best”. Positive texts are also characterized by the use of possessive ending (POS). And the negative set contains more often verbs in the past tense (VBN, VBD), because many authors express their negative sentiments about their loss or disappointment.

  • Subjective vs. Objective Tags

The author observe that objective texts tend to contain more common and proper nouns (NPS, NP, NNS), while authors of subjective texts use more often personal pronouns (PP, PP$).

Sentiment Classification and Results

The authors present details on their sentiment classification experiments including the feature extractions, classifier building and experiments results.

In feature extraction, authors present a four-step approach consisting of (1) filtering URL links, Twitter user names and such non-informative tokens; (2) tokenizing the text with punctuation marks and spaces; (3) removing stopwords (articles); (4) constructing n-grams.

In classifier building, the authors claim that they have tried Naive Bayes, SVM and CRF. However, Naive Bayes classifier works the best thus was picked.

In the final results, the authors present several comparisons between systems with different settings and conclude that the Naive Bayes classifier with bigram features works best due to its good balance between coverage and sentiment patterns.

Discussion

This paper is highly related to our proposed course project on automatic Twitter message clustering based on hashtags.