Difference between revisions of "Pattern Matching over Annotations"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Pattern matching over annotations is a generalization of many methods used in structured data, such as regular expressions and graph traversal. Generally speaking, it…')
 
m
Line 1: Line 1:
[[Pattern matching over annotations]] is a generalization of many methods used in structured data, such as [[regular expressions]] and [[graph traversal]]. Generally speaking, it allows complex sequential data to be analyzed on multiple dimensions at once, and for that information to be queried jointly for larger NLP systems.
+
'''Pattern matching over annotations''' is a generalization of many methods used in structured data, such as [[regular expressions]] and [[graph traversal]]. Generally speaking, it allows complex sequential data to be analyzed on multiple dimensions at once, and for that information to be queried jointly for larger NLP systems.
  
 
== Motivation ==
 
== Motivation ==

Revision as of 22:58, 29 November 2011

Pattern matching over annotations is a generalization of many methods used in structured data, such as regular expressions and graph traversal. Generally speaking, it allows complex sequential data to be analyzed on multiple dimensions at once, and for that information to be queried jointly for larger NLP systems.

Motivation

A key application of this method is in question answering. Most sources of information in QA are text that has been structured in some way. Occasionally, this data is in a database which can easily be queried. More often, however, it is stored in unstructured documents which can be decorated by external NLP tools. These decorations are then stored as annotation layers.

A challenge with these annotation layers comes in a real-world setting where different annotators will use different segmentations of text. A sentence classifier, such as an annotation for a sentence containing a question, will detect boundaries at periods or other sentence boundaries. A named entity recognizer will detect certain single- or multi-word expressions but will leave much of the text blank. A part-of-speech tagger will identify a boundary at every word unit and will annotate every word with exactly one label. A tree-based annotation, such as a syntactic parse, might have even more complex structure.

In order to handle these varied levels of annotation simultaneously in a single system, a variety of tools have been built to generalize over differing annotations and query them simultaneously.