Philgoo project status report

From Cohen Courses
Jump to navigationJump to search
  • What dataset will you be using? What does it look like?
    • I am using MUC-7 as in (Borthwick, 1998)
    • The dataset is devided to training set, dryrun set and formal set.
    • A set has multiple <DOC></DOC>s but I am not using features from position in document so I merged the whole.


  • What did you do to get acquainted with the data?
    • I built a parser for the training set and test set


  • Do you plan on looking at the same problem, or have you changed your plans?
    • I am looking for the same problem. However I am realizing getting comparable score to published models is not trivial.
    • Deep analyzing model structure, training method and data character will require stable output. I may consider using off-the-shelf code later but not now.
    • Other jobs than implementing and runnding classifiers such as morphological processing, feature selecting, gathering feature data etc are much bigger than expected.


  • If you plan on writing code, what have you written so far, in what languages
    • Preprocessor for MUC7 NE data: rubi
    • Parser for MUC7 NE data: C++
    • HMM with joint likelihood: C++
    • HMM with conditional likelihood: C++ (in progress)


  • What do you still need to do?
    • Implement more features
    • CRF with joint likelihood
    • CRF with conditional likelihood


  • If you've run a baseline system on the data and gotten some results, what are they?
    • HMM with joint likelihood will function as the baseline.
    • The results accuracy is near 30% so much to improve.
    • Process on morphological level is required. CMU and CMUs should be the same organization which is not now.