Reynar et al, A maximum entropy approach to identifying sentence boundaries. 1997

From Cohen Courses
Jump to navigationJump to search

Citation

Jeffrey C. Reynar and Adwait Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, March–April 1997.

Online Version

http://www.aclweb.org/anthology-new/A/A97/A97-1004.pdf

Summary

In this paper author talks about problem of [Sentence Boundary Detection in the raw text.It uses the context information to identify whether the occurrence of '?', '.', '!'(or any other annotated sentence boundary) is a valid sentence boundary or not.The feature used were not domain specific which means that model can easily be trained for any other domain.

Method

First the candidate token is identified and then following features are used to classify whether this candidate is valid decision boundary or not

The paper talks about two system (Each using different set of features)

1.It takes advantage of the structure of the English language which makes it Domain specific.It uses the prefix and suffix of the candidate token. Some domain specific features such as whether it is honorific(Mr. Dr. etc) or corporate designator (Corp. etc) and word before and after candidate etc

2.The other system does not uses any features specific to the English Language it just uses context around the specific candidate token the previous word the next word feature after candidate and prefix and suffix.It also sees whether prefix and suffix is on the list of abbreviations.

A maximum entropy model is learnt to identify whether the candidate is valid decision boundary given the surrounding features.

Experimental Results

Two systems were trained and tested on Wall street Journal and brown Corpus.First System gave a accuracy of 98.8 and 97.9 respectively.The second system which is highly portable and not domain specific produces accuracy of 98.0 and 97.5 respectively on both corpus.Results showed that performance degrades as the quantity of training Data increases and with only 500 examples system can give accuracy of 97% which is much better than the baseline accuracy of 64%