Liuy writeup of Bellare and McCallum
This is a review of Bellare_2009_generalized_expectation_criteria_for_bootstrapping_extractors_using_record_text_alignment by user:Liuy.
A CRF is presented in the paper to align tokens of existing database and its actual representation in the text. The goal is to train an extractor by looking at the alignments between texts and database records. they come up with the annotation of text from alignment and use that to train the extractor. It also talks about an extension to multiple states model.
I have a couple of questions: 1. I am interested in the way they do the alignment. But I am not sure how and why we want CRF for this alignment problem, instead of other possible methods. 2. the alignment model does not explore much markov dependencies and correspondences. I think this is a limitation of their model. 3. Their evaluation on citation extraction, where they train the extractor based on alignments occurs between DBLP records and citation texts. the error reduction is not very convincing, as their approach very likely involve some preprocessing of the text.