Adaptive Real-time Filtering in Twitter
Comments
Looks like a nice well-defined project. Can you say a little bit about how this is different from your research - I know you're working on similar stuff with Jamie. --Wcohen 14:42, 10 October 2012 (UTC)
My research with Jamie focused on ad-hoc search when I was working with Twitter. The filtering task is a new problem for me, although I admit that I'm hoping to reuse some of the Tweet processing infrastructure I have set up for my ad-hoc search project. I also recently switched my main research project to federated search, so I was hoping to keep my fingers in the old stuff. --Yubink 17:24, 11 October 2012 (UTC)
Team members
Project Summary
This project will explore how to create a topic-based filter for tweets arriving in real-time, assuming that user judgements (of relevant vs. non-relevant) for tweets shown by the system is available. This project will follow the framework of the Real-time Filtering task of the Microblog Track in the Text REtrieval Conference (TREC), a well-known competitive conference hosted by NIST each year [1]. The goal of the project will be to produce a competitive system for entry into the 2013 run of the track.
Dataset
The project will use the Microblog Track dataset, queries and relevance judgements. The tweet dataset contains 14 million tweets and 50 query topics with relevance judgements. Also available from a previous project is a web crawl of 1 million HTML documents that were linked from tweets.
Task
Given a topic query, a query time, and a corpus of tweets prior to the topic query time, the project aims to filter future tweets such that only tweets relevant to the topic are returned. Any future tweets shown to the "user" will receive feedback that can be incorporated back into the system.
Baseline
The baseline of the project will the ranked list of tweets returned from a search engine queried with the topic. (Of course, the ranked list will be filtered and re-ordered such that they will be temporally ordered, and only tweets from the future of the topic query time will be shown.)
Challenges
- The current plan for the system includes the modification of the Indri search engine, which will be heavy in implementation (C++)