Peñas and Hovy, COLING 2010
Anselmo Peñas and Eduard Hovy. Filling Knowledge Gaps in Text for Machine Reading. COLING 2010.
The gist of this Paper is that there is a lot going on in real-world sentences that is does not just fall out of simple constituency or dependency parses. For example, in the phrase "a seven-yard Young touchdown pass," those parsers would just give you a bunch of noun-noun or adjective-noun constituents. That's nice, but it doesn't tell you the semantics of the phrase. On the flip side, semantic role labelers might just skip over the noun "pass" as not needing further analysis. What this paper does, essentially, is recognize when phrases (noun-noun compounds, specifically) are in need of further semantic analysis and try to fill in the blanks. Their system turns the phrase "a seven-yard Young touchdown pass" into something more like this: "a seven-yard pass that Young threw for a touchdown." That's obvious to a human reader, but very difficult for computers, and it's pretty impressive that they were able to do it.
They accomplish this by building up a big database of distributional semantics over words and part of speech labels. This database consists of sequences of coarse part of speech tags, like N:V:N, along with counts for each set of words seen with that part of speech sequence (they actually do a dependency parse and then simplify it down to those sequences, so it's not as naive as just running a POS tagger).
When they come across a noun-noun compound that they want to enrich, like "touchdown pass," they look in their database of counts for sequences that use those two nouns that make sense. In this instance they come up with a N:P:N sequence, pass:for:touchdown. They can get as specific as there are counts for, so they can do queries like "Young:throw:pass:?:touchdown." They also do some hierarchy and category instance learning, so that they know "Young" is a "quarterback," "quarterbacks" are "players," and "players" are "people." If they don't have enough evidence at one level, they can go up the tree until they reach a point where they can make some conclusion.
They did not have much in the way of quantitative results in their paper. They mention that they are in the process of performing an extrinsic evaluation of their system, by using it input to a QA system. In this paper, they just give some anecdotal evidence that their system is good, and a brief comparison of completeness for their domain as compared to other automatically built knowledge bases.