Difference between revisions of "Reynar et al, A maximum entropy approach to identifying sentence boundaries. 1997"

From Cohen Courses
Jump to navigationJump to search
Line 20: Line 20:
  
 
1.It takes advantage of the structure of the English language which makes it Domain specific.It uses the prefix and suffix of the candidate token.
 
1.It takes advantage of the structure of the English language which makes it Domain specific.It uses the prefix and suffix of the candidate token.
Some domain specific features such as whether it is honorific(Mr. Dr. etc) or
+
Some domain specific features such as whether it is honorific(Mr. Dr. etc) or corporate designator (Corp. etc) and word before and after candidate etc
 +
 
 +
2.The other system does not uses any features specific to the English Language it just uses context around the specific candidate token the previous word the next word feature after candidate and prefix and suffix.It also sees whether prefix and suffix is on the list of abbreviations.
 +
 
 +
A maximum entropy model is learnt to identify whether the candidate is valid decision boundary given the surrounding features.

Revision as of 19:52, 27 September 2011

Citation

Jeffrey C. Reynar and Adwait Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, March–April 1997.

Online Version

http://www.aclweb.org/anthology-new/A/A97/A97-1004.pdf

Summary

In this paper author talks about problem of finding sentence boundary in the raw text.It uses the context information to identify whether the occurrence of '?', '.', '!'(or any other annotated sentence boundary) is a valid sentence boundary or not.The feature used were not domain specific which means that model can easily be trained for any other domain.

Method

First the candidate token is identified and then following features are used to classify whether this candidate is valid decision boundary or not

The paper talks about two system (Each using different set of features)

1.It takes advantage of the structure of the English language which makes it Domain specific.It uses the prefix and suffix of the candidate token. Some domain specific features such as whether it is honorific(Mr. Dr. etc) or corporate designator (Corp. etc) and word before and after candidate etc

2.The other system does not uses any features specific to the English Language it just uses context around the specific candidate token the previous word the next word feature after candidate and prefix and suffix.It also sees whether prefix and suffix is on the list of abbreviations.

A maximum entropy model is learnt to identify whether the candidate is valid decision boundary given the surrounding features.