Tur et al, NLEJ 2003
Citation
Tür, G., Hakkani-Tür, D., Oflazer, K. 2003. A Statistical Information Extraction System for Turkish. Natural Language Engineering 9(2), 181–210
Online version
Summary
Turkish is an agglutinative language which enables the production of thousands of word forms from a given root. This structure of Turkish results in data sparseness issues which at the end decrease the effectiveness of statistical methods. In order to deal with this problem, researchers work with the morphological form of the word instead of the surface form.
This paper is important since this is the first work which uses statistical methods in IE tasks for Turkish. In this paper the authors focus on three subtopics of IE.
In the paper, the authors reduces the sentence segmentation problem to a boundary classification problem where each word is followed by a boundary flag which denotes whether there is a sentence boundary or not. Their input lack the punctuation or case. They use a model which combines the language model of surface form and LM of final inflectional form from morphological form of the word.
Similar to the sentence boundaries, a topic boundary approache is used here. LMs are created after clustering the topics. The authors started with word-based model, and generalized by creating a stem-based model. Noun-based model which is the most general approach got the best accuracy.
- Name Tagging
- Lexical Model : An HMM model of word/tag combination used to catch the lexical information.
- Contextual Model : This model helps tagging unknown words by using the context clues around it.
- Morphological Model :This model uses the morphological information to catch the proper nouns.
- Name Tag Model : This model favors correct probable tagging sequences by using the name tag information.
Combination of above 4 models is used.