Buza et al Scalable Event-based Clustering of Social Media via Record Linkage Techniques ICWSM 2011

From Cohen Courses
Jump to navigationJump to search

Citation

Krisztian Buza, Philipp Cimiano, Lucas Drumond, Timo Reuter, Lars Schmidt-Thieme. Scalable Event-based Clustering of Social Media via Record Linkage Techniques (ICWSM 2011)

Online version

ICWSM 2011

Summary

This paper tackles the problem of event detection in Flickr by clustering images in event-based clusters; that is, images of the same event form a single cluster. The authors' goals are to create a technique scalable to large volumes of data generated by social media outlets and to create clusters that are meaningful to users. In order to accomplish this, they use a variation of single-linkage clustering:

  1. Candidate pair generation
    • In order to avoid calculating similarity metrics for all possible document pairs, the authors use a "blocker" to generate candidate pairs. Because they are trying to detect events, the hypothesize that items that are temporally further apart are less likely to be about the same event, thus less "similar." Therefore, they use a sliding window of size on a temporally ordered set of the documents to generate candidate pairs.
  2. Pairwise feature extraction
    • For each candidate pair generated by the blocker, the authors extract features so that a similarity metric can be calculated. These are time, geographic location, and textual features (title, description, notes, "Where-On-Earth ID", and a concatenation of all previous features) represented by TF-IDF weighted vector space models. When features were missing (e.g. not all images include a description), the metric is set to 0.
  3. Pairwise decision model
  4. Given all of the above modules, the single linkage clustering algorithm is run over the candidate pairs to form the clusters.

Experiments

For their experiments, the authors crawled last.fm and yahoo.upcoming.com for event IDs. They then downloaded images from Flickr that were tagged with these event IDs. This created a gold standard dataset for the system evaluation; for testing and training, the event IDs from the images were removed.

For their evaluation measures, they use normalized mutual information and measure. The newly proposed method was found to be better than the state of the art method in both NMI and . Furthermore, when the models were trained on the yahoo.upcoming.com dataset and tested on last.fm, it showed good transferring performance; that is, the model worked well cross-domain. Finally, the newly proposed method was significantly faster than the state of the art, due to performing less pairwise similarity calculations. The authors also showed that the algorithm could be made to trade off speed for accuracy by varying the temporal window size in the candidate pair generation model.

Comments & Criticisms

This paper presents a method of solving event detection in a way that is significantly faster than the state of art while improving on quality at the same time. The fact that speed and accuracy is easily traded off using a single parameter is also very useful. However, while the paper mentions that some of the documents are missing fields, there is no exact statistics. Also, there is no discussion of which features were weighted the highest in the pairwise decision model; because the image dataset the authors used had a fairly rich representation, depending on the importance of the fields, the results could be very different for a data collection that may not have as much metadata associated with it.

Study Plan

Some methods that may be useful to know: