Brendan 2010 TweetMotif: Exploratory Search and Topic Summarization on Twitter

From Cohen Courses
Jump to navigationJump to search

Citation

Brendan O'Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of AAAI ICWSM.

Online version

An online version of this paper is available at [1].

Summary

This paper presents TweetMotif, an exploratory search application for Twitter. The topic extraction system is based on syntactic filtering, language modeling, near-duplicate detection, and set cover heuristics. And the author also provides the demo of TweetMotif and its source code on http://tweetmotif.com.

Key Contributions

The paper presents a very interesting research problem on search on Twitter. Firstly, the demo system is quite novel from a user's perspective. Secondly, there several novel findings and experiments, including the tokenization on Twitter, the language modeling for Twitter messages and the topic merging. They are all new research area for further exploration.

Novel Experiments and Findings

The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.

  • Tokenization and Syntatic Filtering

The authors mentioned that standard tokenizers, usually designed for newspapers or scientific publications, perform poorly on social media domain, especially for Twitter message. The authors built a regex-based tokenizer which treats hashtags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g. musical notes) as tokens. And they also consider the syntatic filtering in which they gather all the n-grams upto trigrams but discard those cross syntactic boundaries.

  • Score and Filter Topic Phrase Candidates

The author presents a novel approach to combine the attribute values extracted for one person across the documents. Two alternatives are considered, one is to pick the most probable value, the other is to vote out a most confident result. And the voting method outperforms the first one by almost 10% in each of the models tested.

  • Merge Similar Topics

The author then proposes to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For example, to extract birth year given knowledge of the birthday, the author argues that in training we could mark up each hook corpus with the known birthday b : birthday(x, b) and the target birth year y : birthyear(x, y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence. In testing, for each hook, we could first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF.

  • Group Near-duplicate Messages

dadda

Discussion

This system can serve as one of the baseline system for our proposed course project on automatic Twitter message clustering based on hashtags.