Difference between revisions of "Automatic Evaluation Method"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Summary == Automatic evaluation methods usually come into place where there is no ground truth or the labeled data is very expensive. This topic has been developed a lot sinc…')
 
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Summary ==
 
== Summary ==
  
Automatic evaluation methods usually come into place where there is no ground truth or the labeled data is very expensive. This topic has been developed a lot since last decade, and showed its applicability in many different NLP tasks.
+
Automatic evaluation methods usually come into place where there is hard to define a ground truth (for example in machine translation) or the labeled data is very expensive (for example in co-reference resolution). This topic has been developed a lot since last decade, and showed its applicability in many different NLP tasks.
  
 
== Main Ideas ==
 
== Main Ideas ==
Line 9: Line 9:
 
A few example of alternative measures are
 
A few example of alternative measures are
 
* '''Entropy''', it measures how well is the mapping function between two sets and serves in the cases that the mapping function is a crucial indicator of the overall performance.
 
* '''Entropy''', it measures how well is the mapping function between two sets and serves in the cases that the mapping function is a crucial indicator of the overall performance.
* '''N-gram statistics''', it measures how n-gram patterns are formed in the output text, which performs well as an alternative measure for text generation tasks.
+
* '''N-gram statistics''', it measures how n-gram patterns in the output text are formed, which performs well as an alternative measure for text generation tasks.
  
 
The proof of applicability for automatic evaluation methods in different task usually starts with a hypothesis and requires a correlation analysis of the alternative measure with original measure.
 
The proof of applicability for automatic evaluation methods in different task usually starts with a hypothesis and requires a correlation analysis of the alternative measure with original measure.
  
== Automatic Evaluation For Machine Translation ==
+
== Applications in NLP Tasks ==
 +
* BLEU for Machine Translation [1]
 +
The most famous application of automatic evaluation is the BLEU score for machine translation which takes the matched n-gram statistics of the system output with human generated output. The most innovative idea behind BLEU is that there is no well-defined ground truth in machine translation where many outputs are acceptable, BLEU removes the needs of human grader and allows large-scale testing, which then speeds up the development of the machine translation field. It is the groundbreaking work in this area, which inspires the rest three.
  
The most famous application of automatic evaluation is the BLEU score for machine translation which takes the  
+
* ROUGE for Text Summarization [2]
 +
As another text generation task, the case for text summarization is very similar to machine translation. In text summarization, many different outputs are acceptable too. Thus ROUGE is proposed after the idea of BLEU to measure the performance of text summarization system by matched n-gram statistics.
 +
 +
* Alignment Entropy for Machine Transliteration [3]
 +
Alignment entropy is a whole new idea in automatic evaluation methods. The alignment process is a very important process for training a machine transliteration system. And experiments show that the performance of machine transliteration is largely determined by the performance of its alignment process. And entropy is a good measurement for mapping functions, it shows great applicability and scalability for this task.
  
== Automatic Evaluation For Machine Transliteration ==
+
* CONE for Co-reference Resolution [4]
 +
Human annotation for Co-reference Resolution (CRR) requires semantic understanding and it is very labor intensive and costly. Only a handful of annotated corpora are currently available. An alternative set of metrics collectively called CONE for Named Entity Co-reference Resolution (NE-CRR) was recently proposed and studied. It consists of CONE-B3 and CONE-CEAF metrics based on the traditional B3 and CEAF metrics for Co-reference resolution. CONE, as an alternative measure, requires only a soft gold standard which can be obtained automatically by a robust simple classifier. Possible extensions include semi-supervised learning or bootstrapping based on CONE
  
== Automatic Evaluation For Text Summarization ==
+
== References ==
 +
[1] Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL.
  
== Automatic Evaluation For Coreference Resolution ==
+
[2] Chin-Yew Lin and E.H. Hovy. 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of NAACL.
 +
 
 +
[3] Vladimir Pervouchine, Haizhou Li and Bo Lin. 2009. Transliteration Alignment. In Proceedings of ACL.
 +
 
 +
[4] Bo Lin, Rushin Shah, Robert Frederking, Anatole Gershman. 2010. CONE: Metrics for Automatic Evaluation of Named Entity Co-reference Resolution. In Proceedings of ACL Named Entities Workshop.

Latest revision as of 00:48, 23 October 2010

Summary

Automatic evaluation methods usually come into place where there is hard to define a ground truth (for example in machine translation) or the labeled data is very expensive (for example in co-reference resolution). This topic has been developed a lot since last decade, and showed its applicability in many different NLP tasks.

Main Ideas

The main ideas behind automatic evaluation methods (or metrics) is that there is an alternative measure A to the original measure B. The measure B requires ground truth (or labeled data) where usually measure A does not require ground truth (or a small set of it).

A few example of alternative measures are

  • Entropy, it measures how well is the mapping function between two sets and serves in the cases that the mapping function is a crucial indicator of the overall performance.
  • N-gram statistics, it measures how n-gram patterns in the output text are formed, which performs well as an alternative measure for text generation tasks.

The proof of applicability for automatic evaluation methods in different task usually starts with a hypothesis and requires a correlation analysis of the alternative measure with original measure.

Applications in NLP Tasks

  • BLEU for Machine Translation [1]

The most famous application of automatic evaluation is the BLEU score for machine translation which takes the matched n-gram statistics of the system output with human generated output. The most innovative idea behind BLEU is that there is no well-defined ground truth in machine translation where many outputs are acceptable, BLEU removes the needs of human grader and allows large-scale testing, which then speeds up the development of the machine translation field. It is the groundbreaking work in this area, which inspires the rest three.

  • ROUGE for Text Summarization [2]

As another text generation task, the case for text summarization is very similar to machine translation. In text summarization, many different outputs are acceptable too. Thus ROUGE is proposed after the idea of BLEU to measure the performance of text summarization system by matched n-gram statistics.

  • Alignment Entropy for Machine Transliteration [3]

Alignment entropy is a whole new idea in automatic evaluation methods. The alignment process is a very important process for training a machine transliteration system. And experiments show that the performance of machine transliteration is largely determined by the performance of its alignment process. And entropy is a good measurement for mapping functions, it shows great applicability and scalability for this task.

  • CONE for Co-reference Resolution [4]

Human annotation for Co-reference Resolution (CRR) requires semantic understanding and it is very labor intensive and costly. Only a handful of annotated corpora are currently available. An alternative set of metrics collectively called CONE for Named Entity Co-reference Resolution (NE-CRR) was recently proposed and studied. It consists of CONE-B3 and CONE-CEAF metrics based on the traditional B3 and CEAF metrics for Co-reference resolution. CONE, as an alternative measure, requires only a soft gold standard which can be obtained automatically by a robust simple classifier. Possible extensions include semi-supervised learning or bootstrapping based on CONE

References

[1] Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL.

[2] Chin-Yew Lin and E.H. Hovy. 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of NAACL.

[3] Vladimir Pervouchine, Haizhou Li and Bo Lin. 2009. Transliteration Alignment. In Proceedings of ACL.

[4] Bo Lin, Rushin Shah, Robert Frederking, Anatole Gershman. 2010. CONE: Metrics for Automatic Evaluation of Named Entity Co-reference Resolution. In Proceedings of ACL Named Entities Workshop.