Tur et al, NLEJ 2003

From Cohen Courses
Revision as of 17:11, 2 October 2010 by PastStudents (talk | contribs)
Jump to navigationJump to search

Citation

Tür, G., Hakkani-Tür, D., Oflazer, K. 2003. A Statistical Information Extraction System for Turkish. Natural Language Engineering 9(2), 181–210

Online version

CiteSeerX

Summary

Turkish is an agglutinative language which enables the production of thousands of word forms from a given root. This structure of Turkish results in data sparseness issues which at the end decrease the effectiveness of statistical methods. In order to deal with this problem, researchers work with the morphological form of the word instead of the surface form.

This paper is important since this is the first work which uses statistical methods in IE tasks for Turkish. In this paper the authors focus on three subtopics of IE.

  • Sentence Segmentation

In the paper, the authors reduces the sentence segmentation problem to a boundary classification problem where each word is followed by a boundary flag which denotes whether there is a sentence boundary or not. Their input lack the punctuation or case. They use a model which combines the language model of surface form and LM of final inflectional form from morphological form of the word.

  • Topic Segmentation

Similar to the sentence boundaries, a topic boundary approache is used here. LMs are created after clustering the topics. The authors started with word-based model, and generalized by creating a stem-based model. Noun-based model which is the most general approach got the best accuracy.

  • Name Tagging

The authors combine 4 models for this task.

  • Lexical Model : An HMM model of word/tag combinations used to catch the lexical information.
    • Contextual Model : This model helps tagging unknown words by using the context clues around it.
    • Morphological Model :This model uses the morphological information to catch the proper nouns.
    • Name Tag Model : This model favors correct probable tagging sequences by using the name tag information.