Difference between revisions of "Tsochantaridis, Joachims , Support vector machine learning for interdependent and structured output spaces 2004"
(Created page with '==Citation== Ioannis Tsochantaridis,Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In Pro…') |
|||
Line 6: | Line 6: | ||
[http://www.cs.cornell.edu/people/tj/publications/tsochantaridis_etal_04a.pdf] | [http://www.cs.cornell.edu/people/tj/publications/tsochantaridis_etal_04a.pdf] | ||
+ | |||
+ | == Citation == | ||
+ | |||
+ | D. Freitag and A. McCallum. Information extraction using HMMs and shrinkage. In Papers from the | ||
+ | AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31{36, 1999. | ||
+ | |||
+ | == Online == | ||
+ | |||
+ | http://www.cs.umass.edu/~mccallum/papers/ieshrink-aaaiws99.pdf | ||
== Summary == | == Summary == | ||
− | In this [[Category: | + | In this [[Category:Paper]] author proposes the use of HMM for the task of information extraction using shrinkage.The paper uses "shrinkage" to learn transition |
+ | probabilities for HMM when the data is limited.Experimentations are then performed on some real word data set and it is shown that shrinkage outperforms absolute | ||
+ | discounting | ||
+ | |||
+ | == Method == | ||
+ | There is always a tradeoff between the expressive power of the model (complexity) and parameter estimation.A complex model needs large amount of data for robust | ||
+ | paremeter estimation, where as simple model is not expressive enough.The paper talks about the use of "shrinkage" which balances these two tradeoffs.In this paper | ||
+ | parameter estimates from complex model(data sparse state) and parameter estimates from the simple models are combined using a weighted average with weight being | ||
+ | learnt by Expectation Maximization. | ||
+ | |||
+ | The shrinkage is defined for some hierarchical structure.States have common parent if the states are assumed to be coming from same word distribution.So in the | ||
+ | simpler model those states can be represented by the parent state.So the parameter estimates at leaf will be linear interpolation of all the estimates from | ||
+ | root.Probability estimates for the node sj will be | ||
+ | |||
+ | <math>P(w|s_j)=w_0L(x)+\sum\limits_{i=1}^{k} w_j^i.P(w|s_j^i)</math> |
Revision as of 21:22, 30 September 2011
Citation
Ioannis Tsochantaridis,Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the 21st International Conference on Machine Learning, pages 104–111, Banff, Alberta, Canada, July 2004.
Online Version
Citation
D. Freitag and A. McCallum. Information extraction using HMMs and shrinkage. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31{36, 1999.
Online
http://www.cs.umass.edu/~mccallum/papers/ieshrink-aaaiws99.pdf
Summary
In this author proposes the use of HMM for the task of information extraction using shrinkage.The paper uses "shrinkage" to learn transition probabilities for HMM when the data is limited.Experimentations are then performed on some real word data set and it is shown that shrinkage outperforms absolute discounting
Method
There is always a tradeoff between the expressive power of the model (complexity) and parameter estimation.A complex model needs large amount of data for robust paremeter estimation, where as simple model is not expressive enough.The paper talks about the use of "shrinkage" which balances these two tradeoffs.In this paper parameter estimates from complex model(data sparse state) and parameter estimates from the simple models are combined using a weighted average with weight being learnt by Expectation Maximization.
The shrinkage is defined for some hierarchical structure.States have common parent if the states are assumed to be coming from same word distribution.So in the simpler model those states can be represented by the parent state.So the parameter estimates at leaf will be linear interpolation of all the estimates from root.Probability estimates for the node sj will be