Difference between revisions of "Automatic Evaluation Method"

Revision as of 23:55, 30 September 2010

Summary

Automatic evaluation methods usually come into place where there is hard to define a ground truth (for example in machine translation) or the labeled data is very expensive (for example in co-reference resolution). This topic has been developed a lot since last decade, and showed its applicability in many different NLP tasks.

Main Ideas

The main ideas behind automatic evaluation methods (or metrics) is that there is an alternative measure A to the original measure B. The measure B requires ground truth (or labeled data) where usually measure A does not require ground truth (or a small set of it).

A few example of alternative measures are

Entropy, it measures how well is the mapping function between two sets and serves in the cases that the mapping function is a crucial indicator of the overall performance.
N-gram statistics, it measures how n-gram patterns in the output text are formed, which performs well as an alternative measure for text generation tasks.

The proof of applicability for automatic evaluation methods in different task usually starts with a hypothesis and requires a correlation analysis of the alternative measure with original measure.

Applications in NLP Tasks

BLEU for Machine Translation

The most famous application of automatic evaluation is the BLEU score for machine translation which takes the matched n-gram statistics of the system output with human generated output

Alignment Entropy for Machine Transliteration
ROUGE for Text Summarization
CONE for Co-reference Resolution

@@ Line 1: / Line 1: @@
 == Summary ==
-Automatic evaluation methods usually come into place where there is no ground truth or the labeled data is very expensive. This topic has been developed a lot since last decade, and showed its applicability in many different NLP tasks.
+Automatic evaluation methods usually come into place where there is hard to define a ground truth (for example in machine translation) or the labeled data is very expensive (for example in co-reference resolution). This topic has been developed a lot since last decade, and showed its applicability in many different NLP tasks.
 == Main Ideas ==
@@ Line 9: / Line 9: @@
 A few example of alternative measures are
 * '''Entropy''', it measures how well is the mapping function between two sets and serves in the cases that the mapping function is a crucial indicator of the overall performance.
-* '''N-gram statistics''', it measures how n-gram patterns are formed in the output text, which performs well as an alternative measure for text generation tasks.
+* '''N-gram statistics''', it measures how n-gram patterns in the output text are formed, which performs well as an alternative measure for text generation tasks.
 The proof of applicability for automatic evaluation methods in different task usually starts with a hypothesis and requires a correlation analysis of the alternative measure with original measure.
-== Automatic Evaluation For Machine Translation ==
+== Applications in NLP Tasks ==
+* BLEU for Machine Translation
+The most famous application of automatic evaluation is the BLEU score for machine translation which takes the matched n-gram statistics of the system output with human generated output
-The most famous application of automatic evaluation is the BLEU score for machine translation which takes the
+* Alignment Entropy for Machine Transliteration
+* ROUGE for Text Summarization
-== Automatic Evaluation For Machine Transliteration ==
+* CONE for Co-reference Resolution
-== Automatic Evaluation For Text Summarization ==
-== Automatic Evaluation For Coreference Resolution ==

Difference between revisions of "Automatic Evaluation Method"

Revision as of 23:55, 30 September 2010

Summary

Main Ideas

Applications in NLP Tasks

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools