KeisukeKamataki writeup of Bellare 2009
This is a review of Bellare_2009_generalized_expectation_criteria_for_bootstrapping_extractors_using_record_text_alignment by user:KeisukeKamataki.
Summary: They proposed an approach to train a model of information extraction making use of unlabeled text and existing database. Specifically, they trained alignment-CRF model based with the word alignment between the database record and unlabeled text to give labels for each text token. The best alignment is computed with Baum-Welch and Viterbi as the process of CRF. With the training set labeled with alignment-CRF, they train Extr-CRF to extract citation-matching information again with Baum-Welch algorithm. The algorithm tries to minimize KL divergence between the parameter of alignment-CRF and the parameter of ExtrCRF lambda to estimate lambda. For the label prediction, AlignCRF significantly outperformed HMM and IBM-Model4. For citation matching problem, ExtrCRF worked best as among the machine-learning based approach (and somewhat close to manual labeling for some information like author or pages).
I like: This paper challenges an interesting task to make use of unlabeled text for training the model. I'm interested in if this approach also works for other domains (or open domain) of information extraction tasks and also interested in how much amount of database record/text is needed to get good performance in practice.