Pattern Matching over Annotations
Pattern matching over annotations is a generalization of many methods used in structured data, such as regular expressions and graph traversal. Generally speaking, it allows complex sequential data to be analyzed on multiple dimensions at once, and for that information to be queried jointly for larger NLP systems.
Motivation
A key application of this method is in question answering. Most sources of information in QA are text that has been structured in some way. Occasionally, this data is in a database which can easily be queried. More often, however, it is stored in unstructured documents which can be decorated by external NLP tools. These decorations are then stored as annotation layers.
A challenge with these annotation layers comes in a real-world setting where different annotators will use different segmentations of text. A sentence classifier, such as an annotation for a sentence containing a question, will detect boundaries at periods or other sentence boundaries. A named entity recognizer will detect certain single- or multi-word expressions but will leave much of the text blank. A part-of-speech tagger will identify a boundary at every word unit and will annotate every word with exactly one label. A tree-based annotation, such as a syntactic parse, might have even more complex structure.
In order to handle these varied levels of annotation simultaneously in a single system, a variety of tools have been built to generalize over differing annotations and query them simultaneously. Searching for patterns across annotation layers is an important task when a query needs to incorporate multiple types of knowledge - for instance, identifying a named entity in an object semantic role in a sentence containing the year "1961."
Representation
Annotations are generally traversed as if they are a dense lattice, with a start point and end point. However, storing this information in a convenient way is non-trivial. The generally accepted approach is to use a back-off annotation. In this paradigm, the source text is stored as a single document with no annotations. Then, future layers point at that source document and define their annotations in terms of the character offsets. By defining offsets in terms of characters rather than words, it allows for greater flexibility to new types of annotations, such as morphology, as well as for adaptation to languages where word segmentation is itself a non-trivial problem.
The two most well-known implementations of back-off annotation are UIMA and GATE.
Query Language
Once data is stored in a back-off annotation format and has been converted into a lattice, a way of querying that lattice must be derived. Simple string matching is often used for simple tasks, where there are relatively few annotations which can be flattened into a string. Because most standard representations, including both UIMA and GATE, store annotations in XML format, any language which parses XML can in theory right arbitrarily specific functions for querying those annotations. XQuery is the language most often used to do this. However, this requires a great deal of problem-specific programming.
Alternatively, queries can be converted to a finite state transducer, and annotation lattices can be traversed step-by-step as if they were inputs to that FST. The two most well-known formalisms or systems which perform querying this way are AFST and Sprout.
Traversal
Traversal of a lattice involves a predefined FST, generated from a query. Each step in an annotation lattice implies a transition in the FST. If there exists a path through the lattice that results in a success state, the pattern matches the input string. These patterns often have various priorities for identifying different levels of annotation - for instance, Name may always subsume FirstName and LastName in importance, adjusting the traversal. This approach benefits from the sequential nature of queries, and the structure of an annotation lattice, to make the problem more tractable than the general problem of subgraph matching given arbitrary graphs and subgraphs.