Difference between revisions of "Brendan 2010 TweetMotif: Exploratory Search and Topic Summarization on Twitter"

Revision as of 12:32, 3 February 2011

Citation

Brendan O'Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of AAAI ICWSM.

Online version

An online version of this paper is available at [1].

Summary

This paper presents TweetMotif, an exploratory search application for Twitter. The topic extraction system is based on syntactic filtering, language modeling, near-duplicate detection, and set cover heuristics. And the author also provides the demo of TweetMotif and its source code on http://tweetmotif.com.

Key Contributions

The paper presents a very interesting research problem on search on Twitter. Firstly, the demo system is quite novel from a user's perspective. Secondly, there several novel findings and experiments, including the tokenization on Twitter, the language modeling for Twitter messages and the topic merging. They are all new research area for further exploration.

Novel Experiments and Findings

The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.

Tokenization and Syntatic Filtering

The authors mentioned that standard tokenizers, usually designed for newspapers or scientific publications, perform poorly on social media domain, especially for Twitter message. The authors built a regex-based tokenizer which treats hashtags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g. musical notes) as tokens. And they also consider the syntatic filtering in which they gather all the n-grams upto trigrams but discard those cross syntactic boundaries.

Score and Filter Topic Phrase Candidates

The author presents a novel approach to combine the attribute values extracted for one person across the documents. Two alternatives are considered, one is to pick the most probable value, the other is to vote out a most confident result. And the voting method outperforms the first one by almost 10% in each of the models tested.

Merge Similar Topics

The author then proposes to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For example, to extract birth year given knowledge of the birthday, the author argues that in training we could mark up each hook corpus with the known birthday b : birthday(x, b) and the target birth year y : birthyear(x, y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence. In testing, for each hook, we could first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF.

Group Near-duplicate Messages

dadda

Discussion

This system can serve as one of the baseline system for our proposed course project on automatic Twitter message clustering based on hashtags.

@@ Line 17: / Line 17: @@
 == Novel Experiments and Findings ==
-The authors present a set of three novel approaches, from the automatic annotation with statistical models, to the information fusion, and to the inter-dependency model. The author claims that the automatic annotation is possible for training extraction models, the information fusion greatly helps improve the performance, and the inter-dependency model lifts individual performances.
+The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.
-* ''Automatic Annotation for Training Statistical Extraction Models''
+* ''Tokenization and Syntatic Filtering''
-The author first mentions that statistical extraction systems (such as HMMs and CRFs) are trained using hand-annotated data. Annotating the necessary data by hand is time consuming and brittle, since it may require large scale re-annotation when the annotation scheme changes. However, for the training of Rote model, an alternative is available which directly computes the probability of positive sample. The author then extends the method carefully to adapt to the Naive Bayes and Conditional Random Fields and show good performance, in particular the CRF-based model with negative samples.
+The authors mentioned that standard tokenizers, usually designed for newspapers or scientific publications, perform poorly on social media domain, especially for Twitter message. The authors built a regex-based tokenizer which treats hashtags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g. musical notes) as tokens. And they also consider the syntatic filtering in which they gather all the n-grams upto trigrams but discard those cross syntactic boundaries.
-* ''Cross-Document Information Fusion''
+* ''Score and Filter Topic Phrase Candidates''
 The author presents a novel approach to combine the attribute values extracted for one person across the documents. Two alternatives are considered, one is to pick the most probable value, the other is to vote out a most confident result. And the voting method outperforms the first one by almost 10% in each of the models tested.
-* ''Cross-Field Bootstrapping''
+* ''Merge Similar Topics''
 The author then proposes to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For example, to extract birth year given knowledge of the birthday, the author argues that in training we could mark up each hook corpus with the known birthday b : birthday(x, b) and the target birth year y : birthyear(x, y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence. In testing, for each hook, we could first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF.
+* ''Group Near-duplicate Messages''
+dadda
 == Discussion ==
 This system can serve as one of the baseline system for our proposed course project on automatic Twitter message clustering based on hashtags.

Difference between revisions of "Brendan 2010 TweetMotif: Exploratory Search and Topic Summarization on Twitter"

Revision as of 12:32, 3 February 2011

Contents

Citation

Online version

Summary

Key Contributions

Novel Experiments and Findings

Discussion

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools