Yandongl writeup of Bellare 2009
This is a review of the paper Bellare_2009_generalized_expectation_criteria_for_bootstrapping_extractors_using_record_text_alignment by user:Yandongl.
This paper studies extraction by inducing alignment between DBLP records and citations. Authors trained a set of CRF models for different tasks such as extraction with alignment and without alignment (no corresponding record). Alignment (AlignCRF) is trained without labeling by using generalized expectation criteria. Parameters are estimated by minimizing the divergence between model expectation and target expectations with L-BFGS algorithm. The objective function is non-convex but local maxima suffices in this task. For extractor (ExtrCRF) the objective function is convex and optimal solutions are guaranteed.
260 random DBLP records are randomly collected and citation texts are search on the web. Features for AlignCRF and ExtrCRF include character-based ones, domain-specific patterns, regular expressions, etc. AlignCRF has some extra features for alignment task.
In the alignment step, AlignCRF outperforms other traditional approaches such as HMM and Model4 significantly. So is ExtrCRF in extraction step.
One question is that , it seems to me that the most popular techniques for alignment in machine translation still fall into traditional methods such as HMMs and Giza++. Since CRF works so well, maybe time to change to CRF? Or alignment used in machine translation is essentially different from what is used in this paper?