Kashoob, Caverlee and Ding ICWSM 2009
This a Paper for Social Media Analysis 10-802 in Fall 2012.
Citation
Kashoob, Said, and Caverlee, James, and Ding, Ying, A Categorical Model for Discovering Latent Structure in Social Annotations, 2009, In Proceedings of International Conference on Weblogs and Social Media (ICWSM 2009)
Online version
A Categorical Model for Discovering Latent Structure in Social Annotations
Summary
This paper develops a latent Graphical Model for content with tags, such as Flickr images and YouTube videos. The authors consider a community of users that select from a finite number of categories to generate tags from, similar to the model assumptions of Latent Dirichlet Allocation (LDA) where topics are selected to generate words from in a document. The authors' Community-based Categorical Annotation (CCA) Model presents a category as a mixture of tags and a community as a mixture of categories. Each object consists of content and a social annotation document, where the social annotation document is a list of all tags and their frequency. This paper seeks to present a better model for understanding documents with associated tag data, using unsupervised learning techniques to uncover latent structure in large-scale tag annotations.
In the model, both the communities and categories are latent variables, whereas the content and social annotation documents are visible.
The authors applied their method to two popular social-tagging services, Flickr and Delicious. They also performed traditional Latent Dirichlet Allocation Topic modeling on the Delicious data set to understand the relationship between tag annotations and the content within those documents. Their data sets consisted of 92,000 Flickr images with 44,980 unique tags, and 27,572 Delicious web pages with 16,216 unique annotations.
To estimate the parameters of their graphical model, the authors use a form of Gibbs Sampling, a special case of Markov Chain Monte Carlo methods, which is tractable for large data models with many hidden parameters. Also faced with the task of choosing the correct number of categories, they chose based on the perplexity of held out data using a model trained on the remaining data, opting for the number of categories that resulted in the minimum perplexity.
Results
The authors discovered that semantically coherent categories could be uncovered from tag data using their graphical model, and that distinct communities of interest were also apparent. Their unsupervised method found 70 categories in their Flickr data set, with 40 categories in their Delicious data set, assuming a single community. The authors report as future work the task of estimating the number of communities in an unsupervised manner. For the paper, they assumed different numbers of communities (experimentally, though not reporting how many).
Interestingly, the authors found that when their estimated category topics were compared to topics generated by running LDA on the Delicious data set (using Jensen-Shannon Divergence between vectorized topics as a measure of similarity), the set of categories was complementary, but not identical, to the topics generated by LDA. Furthermore, objects given similar tags did not necessarily have similar content. Therefore, the coherent tag categories could serve as a (partially) orthogonal set of features to use for querying similar documents. The authors showed that when a set of documents related to solving a Rubik's cube were queried for similar documents using only (content based) topical similarity, the documents returned were math documents (due to the puzzle's mathematical nature). When only (tag based) category similarity was used, documents related to games in general were retrieved. However, when both topic and category similarity were used simultaneously, documents specifically about the Rubik's cube were successfully retrieved.
Discussion
This paper has some useful findings:
- Unsupervised methods of content modeling can be successfully applied to tag data for uncovering semantically coherent categories.
- Categories discovered by the CCA model tend to complement, rather than be identical to, topics generated from the content using LDA.
- Use of tag categories in combination with content topics can result in better query results when searching for documents based on similarity.
Related papers
- Blei, Ng and Jordan JMLR 2003 Seminal work on using Latent Dirichlet Allocation to uncover coherent topics in a set of documents.
- Golder and Huberman IS 2005 A study on social tagging within the Delicious community.
- Zhou et. al. WWW 2008 Proposes a model to unify tag data with document content data for the purpose of information retrieval.
- Heinrich (Technical Report) 2004 Describes the method of Gibbs sampling for parameter estimation.
Study plan
Some concepts which made aid in understanding this paper
- Markov-Chain Monte Carlo
- Method used by authors to estimate the parameters of their graphical model.
- Perplexity
- Jensen-Shannon Divergence