Hitting the Right Paraphrases in Good Time

From Cohen Courses
Revision as of 19:54, 24 November 2011 by Asaluja (talk | contribs)
Jump to navigationJump to search

Citation

Hitting the Right Paraphrases in Good Time, by S. Kok, C.Brockett. In Proceedings of NAACL/HLT 2010, 2010.

This Paper is available online [1].

Summary

The authors present an approach to paraphrasing based on random walks and bilingual parallel phrase corpora. They first construct a graph, where the nodes are phrases in multiple languages, and the edges are between phrases that have been aligned to each other using some sort of phrase alignment model or system. Next, for a particular sentence they break it down into phrases, and then compute the hitting time for each phrase and all of the other phrases in the graph. The hitting time is the expected time it would take for a random walk going from node i to reach node j, based on the transition probabilities of the graph (more on how this is computed in their particular case later). They sort the hitting times based on lowest first to come up with the paraphrases that are most probable as defined by random walks on the graph. The authors discuss a couple of methods to prune the graph and also to minimize spurious word alignments, as well as show how they can incorporate domain knowledge into the graph-based paraphrase model.

Label Propagation and Graph Propagation are strongly related to random walks - namely, labels in a graph propagate as if undergoing a random walk from a labeled node to an unlabeled node (or vice versa), and the label of an unlabeled node is simply the sum of the probabilities of random walks from various labeled nodes to it (or vice versa).

Proposed Approach

Because the hitting time technically considers paths of length up to infinity, computing the hitting time can be prohibitively expensive for large graphs. The authors thus use a "truncated hitting time", where the limit the random walks on their graph to have at most steps.

The proposed algorithm takes as input a query phrase as well as a phrase table between various languages. It also has a number of parameters that will be explained.

First, a graph is created between aligned phrases. Nodes are phrases, and edge weights are computed as the probability of a foreign phrase given an English phrase, i.e. the count of foreign and english phrases co-occurring divided by the total count of English phrases. Unlike prior approaches, that model this graph as a simple bipartite graph between foreign and english words of one particular language pair and therefore only look at random walks of length two (from english to foreign and then from foreign phrase to another english phrase), the authors present a general graph approach here, where nodes can be from different languages and the graph is not necessarily bipartite. While one can create a full graph for small phrase tables, for large phrase tables the authors use a graph approximation similar to k nearest neighbors. They approximate the full graph with a graph , specifically by performing breadth-first search starting at the query node until a depth d. They approximate edge connections at the periphery by collapsing all edge connections between nodes at the periphery and nodes outside the periphery (but still in the full graph ) into one edge.

Once the graph is created, the authors sample independent length walks starting from the query node. Using these samples, they estimate the truncated hitting time between the query node and all other nodes in the approximate graph . Because a node may have large out-degree, and many of these edges may correspond to spurious word alignments, the authors limit the out-degree of each node to . Using the hitting times, one can come up with a sorted (by hitting time) list of all the other nodes in the graph with respect to the query node. Nodes with hitting time are additionally pruned.

Lastly, the authors can add domain knowledge in their representation. They add three types of features: n-gram features, syntactic features, and substring/superstring features. For each feature, they add a node, for example for the n-gram "where is" they add a node representing this n-gram, and edges are generated between all phrase nodes and nodes that encode this domain knowledge if the phrase contains the feature represented by the feature node. This brings nodes (phrases) that share features closer together by reducing the hitting time between these nodes.

Baseline and Results

The authors used the Europarl (link) dataset. Phrases wer aligned using Giza++, and then further refinements to alignment were made using higher order aligners and additional techniques.

The primary method that the authors compare their paraphrasing approach to is the Bannard and Callison Burch (BCB) approach from 2005 and 2008 (they actually compared to an extension to the basic BCB approach called syntactic bilingual phrases, SBP). The authors evaluated the quality of paraphrases on a 0-2 scale and used human evaluators (with relatively high Cohen's Kappa agreement). The comparison with the BCB approach was complicated by the fact that for a given query sentence, different numbers of paraphrases are returned by the two systems, and so to make comparison fair comparisons were done on both the minimum and the maximum of the number of phrases returned by the two systems (in terms of precision and recall numbers). Their system always generated more paraphrases than BCB/SBP.

The authors tried two variants - they experimented with including and excluding the domain knowledge feature nodes, and they also enforced their graph to be bipartite. In general, by eliminating this bipartite constraint (thereby allowing random walks of more than two steps to generate more paraphrases) they found improved performance, and by incorporating domain knowledge they also found improved performance.

Vssbp.png

Bipartite.png

Relevant Work

The authors talk about two sets of work, one that uses the concept of hitting times in interesting areas (for example computer vision, collaborative filtering, etc.), and the other is the corpus of related work in paraphrasing. They note that earlier work in paraphrasing was primarily based on the availability of parallel monolingual phrase corpora, which isn't widespread.