Boguarev and Neff, 2010
Citation
Branimir Boguraev and Mary Neff. A Framework for Traversing Dense Annotation Lattices. In Language Resources and Evaluation. Link
Summary
This paper defines a query language and traversal algorithm for pattern matching over annotations. The goal of the traversal is to unambiguously determine whether a specified subgraph is contained within the annotation lattice using a single pass. The traversal is guaranteed to follow the correct path based on a set of prioritizing constraints. The query language can also add annotations along the way, meaning that simple heuristics can be used to rapidly label data with new annotations based on a query.
Query Language
This paper introduces AFST, a query language which defines very powerful pattern matching constrainsts. The language supports, for example, the following functions:
- Matching on the type of an annotation, without regard to content (Token[] matches any annotation of type Token)
- Matching on the content of an annotation as well as the type (Person[kind=~"named"] matches an annotation of type Person which contains an attribute "kind" with the value "named").
- Adding new annotations as old annotations are traversed (NP[] /] Subj[passive="false"] matches an annotation of type NP, then adds a new annotation on that span marking it as a Subj, with attribute passive and value "false").
These functions can be appended together with . and modified with the usual symbols, such as | or * to build regular-expression-like patterns, e.g. (E here marks an empty transition, which always succeeds in traversal):
AFST has a number of additional keywords. For instance, the honour keyword allows previous annotations to be acknowledged explicitly as places to avoid outputting a new annotation. This is relevant in cases where there are places where the iterator would normally face ambiguities. For instance, in the address 1600 Pennsylvania Avenue, there is no way for a regular expression over text to know that "1600" here is not a date. It is cases like these where the AFST formalism is an improvement.
Traversal
UIMA stores a prioritization of annotations for traversal based on a simple natural order:
- Start position of annotations (earlier annotations obviously have entry points earlier in the lattice)
- Length of the annotation (if two annotations start at the same point, the annotation with the later endpoint receives priority)
- Type priority (for annotations with identical start and endpoints,
The AFST traversal delegates much of the traversal to UIMA's internal iterators. However, additional functionality exists particularly in the case of grammar cascading. In these cases, a query may be searching for two different annotations which happen simultaneously over multiple annotation layers. Because the normal UIMA iterators must choose a path, this would normally add ambiguity. AFST defines a complex directive for mixed iterators which traverse multiple levels of annotation simultaneously:
- Test for an annotation of a certain type
- Upon successful match, descend under this annotation
- Test whether a given pattern matches exactly the sequence of lower annotations covered by the higher match
This traversal directive allows a pattern in AFST to acknowledge not only horizontal positioning (a sequence of annotations), but also the vertical structure of multiple annotation levels (for instance, annotations describing trees). The advantage of AFST is that it can do all of this searching over a standard back-off annotation, rather than having to build specialized representations of further structure in data.
Evaluation
Because this paper is primarily focused on introducing the functionality of the query language and the method by which it traverses annotations, it does not explicitly evaluate its performance. Claims of speed improvements over existing alternatives are not backed up with experimental tests; claims of improved robustness are primarily verified through counterexamples, showing functionality that AFST can represent which existing tools could not define unambiguously.