Difference between revisions of "R. Ghani. ICML 2002"

From Cohen Courses
Jump to navigationJump to search
Line 15: Line 15:
 
To evaluate this approach, the author compared it with [[UsesMethod::Naive Bayes classifier learning|Naive Bayes]], [[UsesMethod::EM]] and [[UsesMethod::Co-training]] on
 
To evaluate this approach, the author compared it with [[UsesMethod::Naive Bayes classifier learning|Naive Bayes]], [[UsesMethod::EM]] and [[UsesMethod::Co-training]] on
 
two datasets [[UsesDataset::Hoovers]] and [[UsesDataset::Jobs]].  
 
two datasets [[UsesDataset::Hoovers]] and [[UsesDataset::Jobs]].  
 +
 
[[UsesDataset::Hoovers]] dataset that contains over 108,000 web pages of different companies. Since there are no natural feature split,
 
[[UsesDataset::Hoovers]] dataset that contains over 108,000 web pages of different companies. Since there are no natural feature split,
 
the author randomly split the vocabulary into two halves and treat them as two separate feature sets.
 
the author randomly split the vocabulary into two halves and treat them as two separate feature sets.
 +
 
Another dataset used for experiments is [[UsesDataset::Jobs]] dataset.  
 
Another dataset used for experiments is [[UsesDataset::Jobs]] dataset.  
 
Job titles and job description are used separate feature sets for Co-training.
 
Job titles and job description are used separate feature sets for Co-training.

Revision as of 15:30, 30 November 2010

Citation

R. Ghani. Combining Labeled and Unlabeled Data for MultiClass Text Categorization. In Proceedings of ICML, 2002.

Online version

ECOC and Co-training

Summary

This paper presents a new semi-supervised learning algorithm.

It decomposes multi-class classification problem into n binary ones using ECOC and Co-training is used for learning each individual binary classifier.

To evaluate this approach, the author compared it with Naive Bayes, EM and Co-training on two datasets Hoovers and Jobs.

Hoovers dataset that contains over 108,000 web pages of different companies. Since there are no natural feature split, the author randomly split the vocabulary into two halves and treat them as two separate feature sets.

Another dataset used for experiments is Jobs dataset. Job titles and job description are used separate feature sets for Co-training.