Mota and Grishman, ACL-IJCNLP 2009
Citation
C. Mota and R. Grishman 2009. Updating a Name Tagger Using Contemporary Unlabeled Data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers.
Online version
Summary
This paper investigates the performance of using a semi-supervised Name Entity Tagger, that has been trained on a data from an earlier time period, over a contemporary unlabeled data. They used co-training approach with seeds and unlabeled data. They have experimented on CETEMPublico data set which is a Portuguese journal corpus of 8 years of news text with the time span of 6 months.
The authors performed experiments in order to answer two questions:
- Is it better to update the seed or the unlabeled data?
The experiments proved that using seeds from the same time period with test does not help as much as using unlabeled data from the period of the test data set. A close analysis showed that training with contemporary unlabeled data improves the classification of tags.
- Is it better to use large amounts of older unlabeled data?
It has been observed that increasing the size of the unlabeled data does not always improve the performance.
Using unlabeled contemporary data outperforms using larger amount of older unlabeled data or using contemporary seeds. Therefore there is no need to label new data or use more training data.