Brendan 2010 TweetMotif: Exploratory Search and Topic Summarization on Twitter
Brendan O'Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of AAAI ICWSM.
An online version of this paper is available at .
This paper presents TweetMotif, an exploratory search application for Twitter. The topic extraction system is based on syntactic filtering, language modeling, near-duplicate detection, and set cover heuristics. And the author also provides the demo of TweetMotif and its source code on http://tweetmotif.com.
The paper presents a very interesting research problem on search on Twitter. Firstly, the demo system is quite novel from a user's perspective. Secondly, there several novel findings and experiments, including the tokenization on Twitter, the language modeling for Twitter messages and the topic merging. They are all new research area for further exploration.
Novel Experiments and Findings
The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.
- Tokenization and Syntatic Filtering
The authors mentioned that standard tokenizers, usually designed for newspapers or scientific publications, perform poorly on social media domain, especially for Twitter message. The authors built a regex-based tokenizer which treats hashtags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g. musical notes) as tokens. And they also consider the syntatic filtering in which they gather all the n-grams upto trigrams but discard those cross syntactic boundaries.
- Score and Filter Topic Phrase Candidates
TweetMotif takes a simple language modeling approach to identifying topic phrases that are most distinctive for a tweet result set, scoring them by the likelihood ratio of the phrases appearing in this result set versus its appearance in the background tweet language model. The intuition behind is close to that of TF-IDF for IR, however, Twitter has some distinctive properties, for example, the TF of a particular word will be essentially the document (message) frequency of that word. Thus the authors also propose that the IR paradigms will need some reformation to be used in this new context.
- Merge Similar Topics
Every candidate phrase defines a topic, a set of messages that contain that phrase. Many phrases, however, occur in roughly the same set of messages, thus their topics are repetitive. The authors use two methods to merge similar topics. Firstly, they merge the overlapping (in fact, subsuming) topic phrases. Secondly, they consider the message set directly and merge the topics which have more than 90% Jaccard similarity on their message sets.
- Group Near-duplicate Messages
The authors mentioned the massive amount of message duplication on Twitter, including forwarded messages, repetitive advertisements, spams, news feeds etc.. Their algorithms thus is designed to group messages based on their Jaccard similarity of tri-gram phrases, grouped if the pairwise similarity exceed 65%.
This system can serve as one of the baseline system for our proposed course project on automatic Twitter message clustering based on hashtags.