Siddharth writeup of Collective labeling

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

A review of Sutton_2004_collective_segmentation_and_labeling_of_distant_entities_in_information_extraction by user:sgopal1

  • CRF's represent a conditional distribution model over arbitrary graphs. In general CRF's have been applied to sequential chains, where a markov assumption is made. So to use a CRF, one does the following
    • Identify a set of features, the features denote some statistics of the corpus. ( As a side note, since the model assumes an exponential distribution, these features denote the sufficient statistic of the distribution ie one can throw away the training data after calculating the statistics )
    • Identify the conditional likelihood, differentiate, identify the weights of the features.
    • Given a new test instance, the best sequence label is now found using standard DP ( forward backward algorithm )
  • In a linear CRF, note that the feature functions cannot depend on arbitrary labels. Specifically, a feature function like 'the number of times a named entity has been seen so far' is an example of an invalid feature. This is because we would not know the value of this feature at test time. In order to incorporate such features, it is necessary to change the topology of the graph. ( Although it would be possible do inference in an exponential time by looping over all possible sequences ).
  • This paper adds a constraint to a linear CRF saying that "the same words belong the same label". They change the topology of the graph appropriately, by connecting the labels of the same words together. Because of the existence loops, it would no longer be possible to do inference using a forward backward method. They resort to the use of approximate inference using loopy belief propagation.
  • Criticism
    • Although it increases the F1 score on speakers, the performance on other fields decreases.
    • General question : Suppose I do not want the model to learn such dependencies, but however I will try incorporating this at the inference step. How will this change my results ? ( Some greedy approach similar in spirit to borkar et al )
      • The ILP approach of Roth & Yih is one way of doing this. There's also a paper by Finkel & Manning that uses a softer set of constraints and Gibbs sampling (cited in the 2-stage paper). Wcohen