Law et al., ECML 2010
Learning to Tag from Open Vocabulary Labels. Edith Law, Burr Settles, and Tom Mitchell. In the proceedings of the ECML PKDD 2010 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.
Online version
This paper makes use of Latent Dirichlet Allocation topic modelling to address the problem of Tag recommendation for music. The TagATune dataset, which is a collection of 10,000 unique user-generated tags for 30,000 music clips, was used for the experiments.
The authors motivate the paper by mentioning that most approaches to classifying media assume a fixed vocabulary, and argue that machine learning techniques can be used to exploit the open vocabularies generated by social tagging and crowd-sourcing communities. Some obvious problems with using an open vocabulary include the noise generated from mis-spellings ("chello"), synonymy ("serene" and "mello"), compound phrases ("guitar plucking") and size of vocabulary.
For the purpose of their experiments, the features extracted for each of the clips were the best ones typically used in music tagging literature (See MIReX), and aren't really discussed in the paper.
The authors experiments show that their technique can reduces training time by 94% compared to attempting to learn/train tags directly, and results in comparable or better results in classification and retrieval of tags for music clips.