Difference between revisions of "Brendan 2010 TweetMotif: Exploratory Search and Topic Summarization on Twitter"
Line 17: | Line 17: | ||
== Novel Experiments and Findings == | == Novel Experiments and Findings == | ||
− | The authors present a set of | + | The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together. |
− | * '' | + | * ''Tokenization and Syntatic Filtering'' |
− | The | + | The authors mentioned that standard tokenizers, usually designed for newspapers or scientific publications, perform poorly on social media domain, especially for Twitter message. The authors built a regex-based tokenizer which treats hashtags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g. musical notes) as tokens. And they also consider the syntatic filtering in which they gather all the n-grams upto trigrams but discard those cross syntactic boundaries. |
− | * '' | + | * ''Score and Filter Topic Phrase Candidates'' |
The author presents a novel approach to combine the attribute values extracted for one person across the documents. Two alternatives are considered, one is to pick the most probable value, the other is to vote out a most confident result. And the voting method outperforms the first one by almost 10% in each of the models tested. | The author presents a novel approach to combine the attribute values extracted for one person across the documents. Two alternatives are considered, one is to pick the most probable value, the other is to vote out a most confident result. And the voting method outperforms the first one by almost 10% in each of the models tested. | ||
− | * '' | + | * ''Merge Similar Topics'' |
The author then proposes to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For example, to extract birth year given knowledge of the birthday, the author argues that in training we could mark up each hook corpus with the known birthday b : birthday(x, b) and the target birth year y : birthyear(x, y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence. In testing, for each hook, we could first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF. | The author then proposes to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For example, to extract birth year given knowledge of the birthday, the author argues that in training we could mark up each hook corpus with the known birthday b : birthday(x, b) and the target birth year y : birthyear(x, y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence. In testing, for each hook, we could first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF. | ||
+ | |||
+ | * ''Group Near-duplicate Messages'' | ||
+ | dadda | ||
== Discussion == | == Discussion == | ||
This system can serve as one of the baseline system for our proposed course project on automatic Twitter message clustering based on hashtags. | This system can serve as one of the baseline system for our proposed course project on automatic Twitter message clustering based on hashtags. |
Revision as of 12:32, 3 February 2011
Contents
Citation
Brendan O'Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of AAAI ICWSM.
Online version
An online version of this paper is available at [1].
Summary
This paper presents TweetMotif, an exploratory search application for Twitter. The topic extraction system is based on syntactic filtering, language modeling, near-duplicate detection, and set cover heuristics. And the author also provides the demo of TweetMotif and its source code on http://tweetmotif.com.
Key Contributions
The paper presents a very interesting research problem on search on Twitter. Firstly, the demo system is quite novel from a user's perspective. Secondly, there several novel findings and experiments, including the tokenization on Twitter, the language modeling for Twitter messages and the topic merging. They are all new research area for further exploration.
Novel Experiments and Findings
The authors present a set of four interesting experiments and findings, from the tokenization on Twitter, to the topic phrase extraction, and to the topic merging etc. The author claims that the traditional tokenizatoin method works poorly on Twitter messages, the traditional IR paradigms need some reformation, and the near-duplicate message need to be grouped together.
- Tokenization and Syntatic Filtering
The authors mentioned that standard tokenizers, usually designed for newspapers or scientific publications, perform poorly on social media domain, especially for Twitter message. The authors built a regex-based tokenizer which treats hashtags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g. musical notes) as tokens. And they also consider the syntatic filtering in which they gather all the n-grams upto trigrams but discard those cross syntactic boundaries.
- Score and Filter Topic Phrase Candidates
The author presents a novel approach to combine the attribute values extracted for one person across the documents. Two alternatives are considered, one is to pick the most probable value, the other is to vote out a most confident result. And the voting method outperforms the first one by almost 10% in each of the models tested.
- Merge Similar Topics
The author then proposes to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For example, to extract birth year given knowledge of the birthday, the author argues that in training we could mark up each hook corpus with the known birthday b : birthday(x, b) and the target birth year y : birthyear(x, y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence. In testing, for each hook, we could first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF.
- Group Near-duplicate Messages
dadda
Discussion
This system can serve as one of the baseline system for our proposed course project on automatic Twitter message clustering based on hashtags.