Difference between revisions of "Reyyan project abstract"

From Cohen Courses
Jump to navigationJump to search
Line 6: Line 6:
  
 
In this project, I am going to develop an information extraction system for Turkish. There are only a couple of studies that worked on NER on Turkish texts. A more recent one ([[Kucuk and Yazici, FQAS 2009]]) used Rule-based methods and another one ([[Tur et al, NLEJ 2003]]) used statistical methods. I am planing to apply more recent methods, such as CRF.
 
In this project, I am going to develop an information extraction system for Turkish. There are only a couple of studies that worked on NER on Turkish texts. A more recent one ([[Kucuk and Yazici, FQAS 2009]]) used Rule-based methods and another one ([[Tur et al, NLEJ 2003]]) used statistical methods. I am planing to apply more recent methods, such as CRF.
 +
 +
== Tasks ==
 +
 +
I can use bootstrap method to tag this data. Another idea that can work is tagging Turkish side of the data by matching the Turkish and English entities with their dependency parses. 
  
 
== Data Set ==
 
== Data Set ==
  
I am going to use the same training data set that has been used in one of the previous studies. The data consists of news articles and contains person, location and organization tags.
+
I am going to use the same training data set that has been used in ([[Tur et al, NLEJ 2003]]). The data consists of news articles and contains person, location and organization tags.
  
I also have a parallel English-Turkish corpus. I can use bootstrap method to tag this data. Another idea that can work is tagging Turkish side of the data by matching the Turkish and English entities with their dependency parses.  
+
In addition to this training data, I also have a parallel English-Turkish corpus of 50K sentences. This data mostly consists of EU meetings.  
  
 
== Motivation ==  
 
== Motivation ==  

Revision as of 08:32, 8 October 2010

Team Members

Reyyan Yeniterzi [reyyan@cs.cmu.edu]

Summary

In this project, I am going to develop an information extraction system for Turkish. There are only a couple of studies that worked on NER on Turkish texts. A more recent one (Kucuk and Yazici, FQAS 2009) used Rule-based methods and another one (Tur et al, NLEJ 2003) used statistical methods. I am planing to apply more recent methods, such as CRF.

Tasks

I can use bootstrap method to tag this data. Another idea that can work is tagging Turkish side of the data by matching the Turkish and English entities with their dependency parses.

Data Set

I am going to use the same training data set that has been used in (Tur et al, NLEJ 2003). The data consists of news articles and contains person, location and organization tags.

In addition to this training data, I also have a parallel English-Turkish corpus of 50K sentences. This data mostly consists of EU meetings.

Motivation

As a Turkish student, I want to apply what I have learned in this course to Turkish texts. One encounters different challenges while working with Turkish. I want to see the effect of these on the NER task and try to overcome these issues by using deterministic and statistical methods.

Superpowers

I know Turkish which I believe is a good starting point.