Freitag et al AAAI-99,Information extraction using HMMs and shrinkage.

Citation

D. Freitag and A. McCallum. Information extraction using HMMs and shrinkage. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31{36, 1999. Information Extraction with HMM and Shrinkage

Online

http://www.cs.umass.edu/~mccallum/papers/ieshrink-aaaiws99.pdf

Summary

In this Paper author proposes the use of HMM for the task of information extraction using shrinkage.The paper uses "shrinkage" to learn transition probabilities for HMM when the data is limited.Experimentations are then performed on some real word data set and it is shown that shrinkage outperforms absolute discounting

Method

There is always a tradeoff between the expressive power of the model (complexity) and parameter estimation.A complex model needs large amount of data for robust paremeter estimation, where as simple model is not expressive enough.The paper talks about the use of "shrinkage" which balances these two tradeoffs.In this paper parameter estimates from complex model(data sparse state) and parameter estimates from the simple models are combined using a weighted average with weight being learnt by Expectation Maximization.

The shrinkage is defined for some hierarchical structure.States have common parent if the states are assumed to be coming from same word distribution.So in the simpler model those states can be represented by the parent state.The paper talks about four topologies.

None - Only absolute discounting is used.
Uniform- All single state distributions are shrinked ro uniform distribution.
Global- Target states have common paresnt and all other non target state have same parent which in turn are child of uniform distribution.
Hierarchical - Same as global, but all other non target states have class specific different parent.

So the parameter estimates at leaf will be linear interpolation of all the estimates from root.Probability estimates for the term t at the node sj will be

       $P(t|s_{j})=\sum \limits _{i=1}^{k}w_{j}^{i}.P(t|s_{j}^{i})$

where $s_{j}^{i}$ is the ith node in the path from root to the node sj and $w_{j}^{i}$ is the weight of this node learn by EM.Weights are found such that they maximize the likelihood of some held out data.It is assumes that word is generated by choosinbg a state in the path to the root with some probability(weight for that node) and then using the probability distribution at that node.EM maximizes this likelihood

E step (Degree to which wach node predicts terms in state sj on held out Data Hj)

  $B_{j}^{i}=\sum \limits _{s=t_{t}\epsilon H_{j}}w_{j}^{i}P(t_{t}|s_{j}^{i})/(\sum \limits _{m}w_{j}^{m}P(t_{t}|s_{j}^{m}))$

M step

  $w_{j}^{i}=B_{j}^{i}/\sum \limits {m}B_{j}^{m}$

Experiments

Experiments were performed on information extraction problem of two corpora containg collection of seminar announcements and articles describing corporate aquisitions.The paper compares first shrinkage with absolute discounting for various configurations in which shrinkage outperforms absoulte discounting and Global topology performing the best.The paper also compares use of HMM and rule based method for the purpose of Information extraction in which HMM outperforms rule based systems

Freitag et al AAAI-99,Information extraction using HMMs and shrinkage.

Contents

Citation

Online

Summary

Method

Experiments

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools