Mota and Grishman, ACL-IJCNLP 2009
C. Mota and R. Grishman 2009. Updating a Name Tagger Using Contemporary Unlabeled Data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers.
This paper investigates the performance of using a semi-supervised Name Entity Tagging system, that has been trained on a data from an earlier time period, over a contemporary unlabeled data. They used co-training approach with seeds and unlabeled data. They have experimented on CETEMPublico data set which is a Portuguese journal corpus of 8 years of news text with the time span of 6 months. The test set was fixed to the last 6 months but the seeds and the unlabeled data had been selected from the available 16 time spans. The authors either chose both the seeds and the unlabeled data iteratively from each time span or fixed one to the last epoch and chose the other one from each of the epoch or iteratively augmented the unlabeled data with the older data.
The authors performed experiments in order to answer two questions:
- Is it better to update the seed or the unlabeled data?
The experiments proved that using seeds from the same time period with test does not help as much as using unlabeled data from the period of the test data set. A close analysis showed that training with contemporary unlabeled data improves the classification of tags.
- Is it better to use large amounts of older unlabeled data?
It has been observed that increasing the size of the unlabeled data does not always improve the performance.
Using unlabeled contemporary data outperforms using larger amount of older unlabeled data or using contemporary seeds. Therefore there is no need to label new data or use more training data.
In order to deal with contemporary data, some other works focused on decreasing the OOV rate by contributing new names from contemporary texts or adapting language models. References to these papers can be found at the paper.