Difference between revisions of "Satpal and Sarawagi PKDD 2007"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Satpal, S. and Sarawagi, S. Domain adaptation of conditional probability models via feature subsetting. Proceedings of PKDD’07 (2007). == Online Version == […')
 
Line 7: Line 7:
 
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.8784&rep=rep1&type=pdf]
 
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.8784&rep=rep1&type=pdf]
  
 +
== Summary ==
  
== Summary ==
+
This [[Category::paper]] introduces a method for transfer learning that encourages a model trained in one domain to use features that it shares in common with a target domain. This proves useful in instances where there is abundant training data in a domain (such as news wire articles) but little in a domain of interest (such as blogs).
  
This [[Category::paper]] extends standard CRFs to allow each state to correspond to more than one word or token, similar to the way semi-HMMs extend HMMs. This allows for a richer feature set to be modeled, as features can now correspond to multiple words rather than just one word. These features are quite beneficial in a range of applications where the entities tend to be longer than just one word, including NP-chunking and NER.
+
The key challenge in transfer learning is how to reconcile the differences between two distributions. This method addresses this by searching for the subset of features that are present in both domains where the expected values for the features are closest. For instance, if we were trying to recognize names of people, we might take capital letters to be a useful, maybe even required, feature, but that feature may not be reliable in informal blogs, and should be ignored.
  
Similar to CRFs, a [[UsesMethod::semi-CRF]] applies one exponential model over the whole sequence. However, instead of modeling a sequence of words, we model a sequence of segments, which each are multiple words belonging to the same state. This expands the space to be explored, so that when performing inference, the Viterbi-like recursion algorithm must also maximize over the segment boundaries. The consequence of this is relatively minor, with inference still taking polynomial time. This cost is less than higher order CRFs, which consider all combinations of the L previous states, whereas semi-CRFs only consider where the L previous states are the same. Training the model is not much harder either. The likelihood is still convex and a recursion step will yield the normalizer.
+
Selecting the best feature subset to use is accomplished not by exploring the power set of all feature combinations but by converting the problem into a soft selection problem, where we strongly down-weight features which diverge greatly between the two domains. The formulation is quite similar to regularizing a standard [[UsesMethod::CRF]] with a Gaussian prior whose variance for each feature changes depending on how different that feature is between domains.
  
The method was then tested on various datasets for NER tasks and compared to standard CRFs. The key ingredient was the choice of richer features in the semi-CRF models. These segment-level features included the number of capital letters in a segment, the segment lengths, and dictionaries that allowed for non-exact matchings. Segment lengths, particularly, can be modeled as any distribution (such as Guassian or exponential) depending upon how this feature is defined, which is a commonly touted benefit of semi-HMMs over regular HMMs. The results indicate that the semi-CRFs outperformed the regular CRFs in almost all cases, sometimes by quite large margins.  
+
The algorithm can be trained using standard optimization approaches, but the objective function must be treated specially since it is non-convex and its gradient requires quadratic time to evaluate. The authors work around this by using a nested iterative approach, holding some expectations constant at each inner iteration. The method is tested on several combinations of training and target domains and is shown to bring improvements over unadapted models.
  
 
== Related Papers ==
 
== Related Papers ==
  
[[RelatedPaper::Skounakis, IJCAI 2003]] applies hierarchical HMMs to IE, which model segments like semi-CRFs, but where the segments are themselves Markov processes.
+
The authors compare their method to [[RelatedPaper::Blitzer, EMNLP 2006]] (structural correspondence learning) and find that their method performs better on the majority of the tasks. The methods are orthogonal though and can be combined to yield even stronger performance.
  
[[RelatedPaper::Okanoharu, ACL 2006]] improve the speed of semi-CRFs when the entities are very long by using a filtering process and a feature forest model.
+
The method bears resemblance to generalized expectations [[RelatedPaper::Mann, ACL 2008]], which also seeks to use unlabeled data (but from the same domain) and constrains expectations. These constraints though are sourced from experts.
  
[[RelatedPaper::Andrew, ENMLP 2006]] combine semi-CRFs with traditional CRFs in order to use segment and word level features. Some word level features are not well represented in the semi-CRF model. He demonstrates improved performance on the task of Chinese word segmentation.
+
[[RelatedPaper::Do and Ng, ANIPS 2006]] present a transfer learning approach which utilizes soft-max regression to train a meta-learner effective across domains.

Revision as of 06:49, 1 December 2010

Citation

Satpal, S. and Sarawagi, S. Domain adaptation of conditional probability models via feature subsetting. Proceedings of PKDD’07 (2007).

Online Version

[1]

Summary

This paper introduces a method for transfer learning that encourages a model trained in one domain to use features that it shares in common with a target domain. This proves useful in instances where there is abundant training data in a domain (such as news wire articles) but little in a domain of interest (such as blogs).

The key challenge in transfer learning is how to reconcile the differences between two distributions. This method addresses this by searching for the subset of features that are present in both domains where the expected values for the features are closest. For instance, if we were trying to recognize names of people, we might take capital letters to be a useful, maybe even required, feature, but that feature may not be reliable in informal blogs, and should be ignored.

Selecting the best feature subset to use is accomplished not by exploring the power set of all feature combinations but by converting the problem into a soft selection problem, where we strongly down-weight features which diverge greatly between the two domains. The formulation is quite similar to regularizing a standard CRF with a Gaussian prior whose variance for each feature changes depending on how different that feature is between domains.

The algorithm can be trained using standard optimization approaches, but the objective function must be treated specially since it is non-convex and its gradient requires quadratic time to evaluate. The authors work around this by using a nested iterative approach, holding some expectations constant at each inner iteration. The method is tested on several combinations of training and target domains and is shown to bring improvements over unadapted models.

Related Papers

The authors compare their method to Blitzer, EMNLP 2006 (structural correspondence learning) and find that their method performs better on the majority of the tasks. The methods are orthogonal though and can be combined to yield even stronger performance.

The method bears resemblance to generalized expectations Mann, ACL 2008, which also seeks to use unlabeled data (but from the same domain) and constrains expectations. These constraints though are sourced from experts.

Do and Ng, ANIPS 2006 present a transfer learning approach which utilizes soft-max regression to train a meta-learner effective across domains.