Difference between revisions of "Reyyan project abstract"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
Line 5: | Line 5: | ||
== Summary == | == Summary == | ||
− | In this project, I am going to develop an named entity recognition (NER) system for Turkish. There are only a couple of | + | In this project, I am going to develop an named entity recognition (NER) system for Turkish. There are only a couple of previous works for NER on Turkish texts. The first one used language-independent bootstrapping algorithm. The most recent one ([[Kucuk and Yazici, FQAS 2009]]) used Rule-based methods and the one before that ([[Tur et al, NLEJ 2003]]) used statistical methods. There is still room for using more state-of-the-art Machine Learning methods. |
− | My aim for this project is to apply more recent methods like CRF to Turkish texts. | + | My aim for this project is to apply more recent methods like CRF to Turkish texts. In order to that we have to have plenty of tagged data. Therefore, initially I will focus on improving the current training data sets both in quality and size. |
− | |||
== Data Set == | == Data Set == | ||
Line 22: | Line 21: | ||
Turkish is an agglutinative language which enables the production of thousands of word forms from a given root. | Turkish is an agglutinative language which enables the production of thousands of word forms from a given root. | ||
− | + | We can produce training data from parallel | |
+ | I will apply bootstrap method to tag | ||
== Motivation == | == Motivation == |
Revision as of 11:52, 8 October 2010
Team Members
Reyyan Yeniterzi [reyyan@cs.cmu.edu]
Summary
In this project, I am going to develop an named entity recognition (NER) system for Turkish. There are only a couple of previous works for NER on Turkish texts. The first one used language-independent bootstrapping algorithm. The most recent one (Kucuk and Yazici, FQAS 2009) used Rule-based methods and the one before that (Tur et al, NLEJ 2003) used statistical methods. There is still room for using more state-of-the-art Machine Learning methods.
My aim for this project is to apply more recent methods like CRF to Turkish texts. In order to that we have to have plenty of tagged data. Therefore, initially I will focus on improving the current training data sets both in quality and size.
Data Set
I am going to use the same training data set that has been used in (Tur et al, NLEJ 2003). The data consists of news articles and contains person, location and organization tags.
In addition to this training data, I also have a parallel English-Turkish corpus of 50K sentences. This data mostly consists of EU meetings.
Tasks
- Improving the amount and quality of training data : I have the train data which has been used in (Tur et al, NLEJ 2003). This data has only 3 type of tags (person, location and organization). We can add more tags to this data.
I can use bootstrap method to tag this data. Another idea that can work is tagging Turkish side of the data by matching the Turkish and English entities with their dependency parses. Turkish is an agglutinative language which enables the production of thousands of word forms from a given root.
We can produce training data from parallel I will apply bootstrap method to tag
Motivation
As a Turkish student, I want to apply what I have learned in this course to Turkish texts. One encounters different challenges while working with Turkish. I want to see the effect of these on the NER task and try to overcome these issues by using deterministic and statistical methods.
Superpowers
I know Turkish which I believe is a good starting point.