Sgardine project status report

From Cohen Courses
Jump to navigationJump to search
  • Do you plan on looking at the same problem, or have you changed your plans?

The approach has not changed but it has acquired more interesting theoretical grounding. I'm now following the theoretical framework of Conditional graphical models, Perez-Cruz & Ghahramani, 2007 to motivate using a multiclass classifier on the cliques of the graph to label a sequence. I'm still planning to use perceptron instead of SVM, allowing looking at the problem as an online-learning setting for evaluation purposes (i.e. I plan to plot the algorithm's approach to the baseline as its exposed to the training data)

  • What dataset will you be using? What does it look like (e.g., how many entities are there, how many tokens, etc)?

First, I plan to do NER (so that I can roughly compare, at least as a sanity check, to Mayfield, McNamee, Platko 2003) on the CoNLL 2002 data.

That data has two tasks, a Dutch and a Spanish task, each with about ~250K tokens of training data, with about ~15K entities, labelled with 9 labels.

Second, I'd like to try labelling some entities in some HTML documents. I plan to use some of the data from here; it's labelled as to slots and tuples but I'm just discarding the tuple data. I'm transforming the XHTML into a digestible training sequence -- in the meantime exploring some possible representations and some features which will help summarize the HTML-structure above the tokens.

  • If you plan on using off-the-shelf code, what have you installed, what experiences have you had with it?

I have installed the Mallet package and gotten a baseline CRF result for the Spanish NER task -- around 0.63 F1, I expect that baseline to improve as I introduce some better features (I plan to use the same features with the online model as with the CRF for more direct comparison)

  • If you plan on writing code, what have you written so far, in what languages, and what do you still need to do?

I plan to implement the model in Java, using parts of Mallet if convenient.