Sporleder&Li,EACL09
Contents
Citation
title = {Unsupervised recognition of literal and non-literal use of idiomatic expressions}, author = {Sporleder, Caroline and Li, Linlin}, booktitle = {Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics}, series = {EACL '09}, year = {2009}, location = {Athens, Greece}, pages = {754--762},
Abstract from the paper
We propose an unsupervised method for distinguishing literal and non-literal usages of idiomatic expressions. Our method determines how well a literal interpretation is linked to the overall cohesive structure of the discourse. If strong links can be found, the expression is classified as literal, otherwise as idiomatic. We show that this method can help to tell apart literal and non-literal usages, even for idioms which occur in canonical form.
Online version
Summary of approach
- The main goal of this article is to distinguish between literal and non-literal usages of idiomatic expressions. For example, given the expressions ‘break the ice’ and ‘spill the beans’, the algorithm should annotate the sentence ‘Somehow I always end up spilling the beans all over the floor and looking foolish when the clerk comes to sweep them up.’ as literal, and ‘Dad had to break the ice on the chicken troughs so that they could get water’ as idiomatic
- This method is based on the insight that figurative language exhibits less semantic cohesive ties with the context than literal language and in that idioms behave similarly to spelling errors. The approach, therefore, is similar to Hirst and St-Onge’s (1998) method for detecting malapropisms. The main idea is that if an expression is used literally, but not idiomatically, its component words will be related semantically to several words in the surrounding discourse. For example, when the expression ‘play with fire’ is used literally, words such as ‘smoke, ‘burn’, ‘fire department’, and ‘alarm’ tend to also be used nearby; when it is used idiomatically, they aren’t.
- Authors implement two classifiers of the semantic relatedness of an expression’s component words to nearby words in the text: the first one computes the lexical chains for the input text and classifies an expression as literal or non-literal depending on whether its component words participate in any of the chains, the second classifier builds a cohesion graph and determines how this graph changes when the expression is inserted or left out. If one or more of the expression’s components sufficiently related to enough nearby words, forming a ‘lexical chain’, the usage is classified as literal. Otherwise it is idiomatic.
- As a measure of semantic relatedness the Normalized Google Distance is used, which computes relatedness on the basis of the page counts returned by a search engine.
Experiments and results
The model was evaluated the idiom set consisting of 3964 idiom occurrences (17 idiom types) which were manually labeled as ’literal’ or ’figurative’.
Two classifiers based on lexical chains were compared with a supervised method that trains a classifier for each expression based on surrounding context. The results showed that the supervised classifier method did much better (90% F-score on literal uses) than the lexical chain classifier methods (60% F-score)
Related Papers
- Linlin Li and Caroline Sporleder. "Linguistic Cues for Distinguishing Literal and Non-Literal Usage", Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), August, 23-27, 2010, Beijing, China. pdf