Difference between revisions of "Reynar et al, A maximum entropy approach to identifying sentence boundaries. 1997"
Line 11: | Line 11: | ||
== Summary == | == Summary == | ||
− | In this [[Category::paper]] author | + | In this [[Category::paper]] author talks about problem of finding sentence boundary in the raw text.It uses the context information to identify whether the occurrence of '?', '.', '!'(or any other annotated sentence boundary) is a valid sentence boundary or not.The feature used were not domain specific which means that model can easily be trained for any other domain. |
+ | |||
+ | == Method == | ||
+ | |||
+ | First the candidate token is identified and then following features are used to classify whether this candidate is valid decision boundary or not | ||
+ | |||
+ | The paper talks about two system (Each using different set of features) | ||
+ | |||
+ | 1.It takes advantage of the structure of the English language which makes it Domain specific.It uses the prefix and suffix of the candidate token. | ||
+ | Some domain specific features such as whether it is honorific(Mr. Dr. etc) or |
Revision as of 19:42, 27 September 2011
Contents
Citation
Jeffrey C. Reynar and Adwait Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, March–April 1997.
Online Version
http://www.aclweb.org/anthology-new/A/A97/A97-1004.pdf
Summary
In this paper author talks about problem of finding sentence boundary in the raw text.It uses the context information to identify whether the occurrence of '?', '.', '!'(or any other annotated sentence boundary) is a valid sentence boundary or not.The feature used were not domain specific which means that model can easily be trained for any other domain.
Method
First the candidate token is identified and then following features are used to classify whether this candidate is valid decision boundary or not
The paper talks about two system (Each using different set of features)
1.It takes advantage of the structure of the English language which makes it Domain specific.It uses the prefix and suffix of the candidate token. Some domain specific features such as whether it is honorific(Mr. Dr. etc) or