Difference between revisions of "Reyyan project abstract"

From Cohen Courses
Jump to navigationJump to search
 
(7 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
== Summary ==
 
== Summary ==
  
In this project, I am going to develop an named entity recognition (NER) system for Turkish. There are only a couple of previous works for NER on Turkish texts. The first one ([[Cucerzan and Yarowsky, SIGDAT 1999]]) used language-independent bootstrapping algorithm. The most recent one ([[Kucuk and Yazici, FQAS 2009]]) used Rule-based methods and the one before that ([[Tur et al, NLEJ 2003]]) used statistical methods. There is still room for using more state-of-the-art Machine Learning methods.  
+
In this project, I am going to develop an named entity recognition (NER) system for Turkish. There are only a couple of previous works for NER on Turkish texts. The first one ([[Cucerzan and Yarowsky, SIGDAT 1999]]) used language-independent bootstrapping algorithm. The most recent one ([[Kucuk and Yazici, FQAS 2009]]) used Rule-based methods and the one before that ([[Tur et al, NLEJ 2003]]) used statistical methods. There is still room for more state-of-the-art Machine Learning methods.
 +
 
 +
My aim for this project is to apply more recent methods like CRF to Turkish texts. In order to do that one has to have plenty of tagged data, therefore, initially I will focus on improving the current training data sets both in quality and size.
  
My aim for this project is to apply more recent methods like CRF to Turkish texts. In order to that we have to have plenty of tagged data. Therefore, initially I will focus on improving the current training data sets both in quality and size.
 
                                   
 
 
== Data Set ==
 
== Data Set ==
  
 
I am going to use the same training data set that has been used in ([[Tur et al, NLEJ 2003]]). The data consists of news articles and contains person, location and organization tags.
 
I am going to use the same training data set that has been used in ([[Tur et al, NLEJ 2003]]). The data consists of news articles and contains person, location and organization tags.
  
In addition to this training data, I also have a parallel English-Turkish corpus of 50K sentences. This data mostly consists of EU meetings.  
+
In addition to this training data, I have a parallel English-Turkish corpus of 50K sentences. This data mostly consists of EU meetings.
  
 
== Tasks ==
 
== Tasks ==
  
 +
* I have the train data which has been used in ([[Tur et al, NLEJ 2003]]). This data has only 3 type of tags (person, location and organization). Additional tags can be introduced to this data set.
 +
* I will apply NER tools to the English side of the parallel corpus and then use token matching and similarities to generate the tags on the Turkish side.
 +
* Depending on the accuracy of that method, I will explore ways of matching the Turkish and English entities by using their dependency parses.
 +
* I can also use bootstrap method. The authors of [[Cucerzan and Yarowsky, SIGDAT 1999]] applied bootstrap but in that paper the authors did not use any language dependent properties. In this project bootstrap method that uses features from Turkish will be analyzed.
 +
* Turkish is an agglutinative language which enables the production of thousands of word forms from a given root. This results in data sparseness problems in some cases. In order to deal with this problem, a morphological analyzer has to be applied. I will explore the effect of using morphological analyzer in NER task on Turkish.
  
* I have the train data which has been used in ([[Tur et al, NLEJ 2003]]). This data has only 3 type of tags (person, location and organization). Additional tags can be introduced to this data set.
+
== Techniques ==
* I have a English-Turkish parallel corpus. I will apply NER on the English side and then using token matching, we can generate tags on the Turkish side.  
+
* I will start using available NER packages like Stanford's CRF-NER or CRF tools like CRF++ and add necessary features.
* I can also use bootstrap method. applied bootstrap but in that paper the authors did not use any language dependent properties. Bootstrap method that uses features from Turkish will be analyzed.
+
* For the morphological analyzer, I will use Kemal Oflazer's Turkish morphological analyzer.  
* Turkish is an agglutinative language which enables the production of thousands of word forms from a given root. This results in data sparseness problems in some cases. In order to deal with this problem, morphological analyzer has to be applied. I will try ans explore the effect of using morphological analyzer in my approaches. 
+
* I will use MaltParser as the dependency parsing tool. There is a pre-trained model for Turkish with reported accuracy of around 76%. I will try that parser on my data and analyze the results.
* Another idea that can work is tagging Turkish side of the data by matching the Turkish and English entities with their dependency parses.
 
  
 
== Motivation ==  
 
== Motivation ==  

Latest revision as of 12:32, 8 October 2010

Team Members

Reyyan Yeniterzi [reyyan@cs.cmu.edu]

Summary

In this project, I am going to develop an named entity recognition (NER) system for Turkish. There are only a couple of previous works for NER on Turkish texts. The first one (Cucerzan and Yarowsky, SIGDAT 1999) used language-independent bootstrapping algorithm. The most recent one (Kucuk and Yazici, FQAS 2009) used Rule-based methods and the one before that (Tur et al, NLEJ 2003) used statistical methods. There is still room for more state-of-the-art Machine Learning methods.

My aim for this project is to apply more recent methods like CRF to Turkish texts. In order to do that one has to have plenty of tagged data, therefore, initially I will focus on improving the current training data sets both in quality and size.

Data Set

I am going to use the same training data set that has been used in (Tur et al, NLEJ 2003). The data consists of news articles and contains person, location and organization tags.

In addition to this training data, I have a parallel English-Turkish corpus of 50K sentences. This data mostly consists of EU meetings.

Tasks

  • I have the train data which has been used in (Tur et al, NLEJ 2003). This data has only 3 type of tags (person, location and organization). Additional tags can be introduced to this data set.
  • I will apply NER tools to the English side of the parallel corpus and then use token matching and similarities to generate the tags on the Turkish side.
  • Depending on the accuracy of that method, I will explore ways of matching the Turkish and English entities by using their dependency parses.
  • I can also use bootstrap method. The authors of Cucerzan and Yarowsky, SIGDAT 1999 applied bootstrap but in that paper the authors did not use any language dependent properties. In this project bootstrap method that uses features from Turkish will be analyzed.
  • Turkish is an agglutinative language which enables the production of thousands of word forms from a given root. This results in data sparseness problems in some cases. In order to deal with this problem, a morphological analyzer has to be applied. I will explore the effect of using morphological analyzer in NER task on Turkish.

Techniques

  • I will start using available NER packages like Stanford's CRF-NER or CRF tools like CRF++ and add necessary features.
  • For the morphological analyzer, I will use Kemal Oflazer's Turkish morphological analyzer.
  • I will use MaltParser as the dependency parsing tool. There is a pre-trained model for Turkish with reported accuracy of around 76%. I will try that parser on my data and analyze the results.

Motivation

As a Turkish student, I want to apply what I have learned in this course to Turkish texts. One encounters different challenges while working with Turkish. I want to see the effect of these on the NER task and try to overcome these issues by using deterministic and statistical methods.

Superpowers

I know Turkish which I believe is a good starting point.