Difference between revisions of "R. Ghani. ICML 2002"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 12: | Line 12: | ||
It decomposes multi-class classification problem into n binary ones using [[UsesMethod::Error correcting output coding|ECOC]] | It decomposes multi-class classification problem into n binary ones using [[UsesMethod::Error correcting output coding|ECOC]] | ||
and [[UsesMethod::Co-training]] is used for learning each individual binary classifier. | and [[UsesMethod::Co-training]] is used for learning each individual binary classifier. | ||
+ | |||
+ | The performance of this algorithm relies on two assumptions: | ||
+ | # ECOC can outperform [[usesMethod::Naive Bayes classifier learning|Naive Bayes]] on multi-class problem. | ||
+ | # [[UsesMethod::Co-training]] can improve over a single [[usesMethod::Naive Bayes classifier learning|Naive Bayes classifier]] using unlabeled data. | ||
+ | The joint effect of combining two methods together would improve the perform even further. | ||
To evaluate this approach, the author compared it with [[UsesMethod::Naive Bayes classifier learning|Naive Bayes]], [[UsesMethod::EM]] and [[UsesMethod::Co-training]] on | To evaluate this approach, the author compared it with [[UsesMethod::Naive Bayes classifier learning|Naive Bayes]], [[UsesMethod::EM]] and [[UsesMethod::Co-training]] on | ||
two datasets [[UsesDataset::Hoovers]] and [[UsesDataset::Jobs]]. | two datasets [[UsesDataset::Hoovers]] and [[UsesDataset::Jobs]]. | ||
+ | |||
[[UsesDataset::Hoovers]] dataset that contains over 108,000 web pages of different companies. Since there are no natural feature split, | [[UsesDataset::Hoovers]] dataset that contains over 108,000 web pages of different companies. Since there are no natural feature split, | ||
the author randomly split the vocabulary into two halves and treat them as two separate feature sets. | the author randomly split the vocabulary into two halves and treat them as two separate feature sets. | ||
+ | |||
Another dataset used for experiments is [[UsesDataset::Jobs]] dataset. | Another dataset used for experiments is [[UsesDataset::Jobs]] dataset. | ||
Job titles and job description are used separate feature sets for Co-training. | Job titles and job description are used separate feature sets for Co-training. | ||
+ | |||
+ | In Hoovers dataset, using only 10% labeled data Co-training performs worse than supervised Naive Bayes (NB) classifier. | ||
+ | ECOC is superior over NB classifier. The assumption that Co-training will improve over Naive Bayes did not hold here. | ||
+ | One surprising result to me is that the combination of Co-training and ECOC still outperforms ECOC alone. | ||
+ | |||
+ | The author also did experiments with a combination of ECOC and [[UsesMethod::Co-EM]]. | ||
+ | Co-EM is a hybrid model by combining features from EM and Co-training. It is first introduced in [[RelatedPaper::Nigam & Ghani, CIKM 2000]]. | ||
+ | However, on both datasets ECOC and Co-training outperform this new combination. |
Latest revision as of 16:06, 30 November 2010
Citation
R. Ghani. Combining Labeled and Unlabeled Data for MultiClass Text Categorization. In Proceedings of ICML, 2002.
Online version
Summary
This paper presents a new semi-supervised learning algorithm.
It decomposes multi-class classification problem into n binary ones using ECOC and Co-training is used for learning each individual binary classifier.
The performance of this algorithm relies on two assumptions:
- ECOC can outperform Naive Bayes on multi-class problem.
- Co-training can improve over a single Naive Bayes classifier using unlabeled data.
The joint effect of combining two methods together would improve the perform even further.
To evaluate this approach, the author compared it with Naive Bayes, EM and Co-training on two datasets Hoovers and Jobs.
Hoovers dataset that contains over 108,000 web pages of different companies. Since there are no natural feature split, the author randomly split the vocabulary into two halves and treat them as two separate feature sets.
Another dataset used for experiments is Jobs dataset. Job titles and job description are used separate feature sets for Co-training.
In Hoovers dataset, using only 10% labeled data Co-training performs worse than supervised Naive Bayes (NB) classifier. ECOC is superior over NB classifier. The assumption that Co-training will improve over Naive Bayes did not hold here. One surprising result to me is that the combination of Co-training and ECOC still outperforms ECOC alone.
The author also did experiments with a combination of ECOC and Co-EM. Co-EM is a hybrid model by combining features from EM and Co-training. It is first introduced in Nigam & Ghani, CIKM 2000. However, on both datasets ECOC and Co-training outperform this new combination.