Jin et al, 2009

From Cohen Courses
Revision as of 13:07, 1 December 2010 by PastStudents (talk | contribs)
Jump to navigationJump to search

Citation

Jin, W., Ho, H.,Srihari, R., 2009, OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction, KDD'09

Online version

[[1]]

Summary

This paper introduces a system that mines customer reviews of a product and extract product features from the review. The system return opinion expression that are extracted from product review as well as opinion direction. Opinion mining have been studied widely in machine learning and information extraction community. Most of these approaches have used statistical or rule-based learning to extract opinion expression. Jin et al. in this work have introduced a new technique that uses lexicalized HMM for opinion mining.

System Architecture

The architecture of their system is as follow:

- Pre-processing: The system first crawls web pages from the Web, clean HTML files, and segments sentences. The technique that has been used to extract reviews of a product is not described in the paper and they have assumed that the reviews are given to the input of the learning system.

- Entity types and tag sets: They have defined four entity types for each product review: components (e.g. physical object of a camera), functions (e.g. zoom in a camera), features (e.g. color), and opinions (e.g. ideas and thoughts). For each if these types they have defined a set of tags that are used in annotation process.

- Lexicalized HMMs: Given a review of a product as an input of the system, the goal of lexicalized HMM is to assign appropriate tag type to each part of product review. For classification they maximize conditional probability where T is the tags that we want to assign to different parts of product review, W is all the words in the review and S is the POS tag for each word. They have used MLE to learn parameters of the system.

- Information propagation: The goal of this part is to decrease the number of training data that this system requires. Suppose that we have sentence "Good picture quality" as part of a review in the training data. Word "good" is tagged as "<opinion_pos_exp>" in the training data. The system then adds more information by looking at a dictionary and substitute word "good" with it's synonyms. This idea is applied to all the words in the training data to extend the number of examples.

- Bootstrapping: The main contribution of this system is the bootstrapping part. The idea is to partition the training set to two different disjoint sets and train a HMM using each of these sets. Then for each instance of the test data (which is non annotated by the human), if two HMMs classify the input review to the same class and if the confidence value is above a threshold T then we add this new instance to the training example.


The intuition behind their technique is to use global features to infer rules about the local features. For example suppose that we know the name of a set of books. Then by looking at webpages of Amazon.com and by searching the name of the books that we already have we can infer the position and font of the book title. We can then use these two features (position and font of book title in web pages) to extract new book titles from other web pages.

They have described both generative and discriminative approaches for classification and extraction tasks. Global features are governed by the parameters that are shared by all the data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text or color of text are local features.

In generative model they have modeled each document by introducing a random variable that governs local features. The parameters of the model are:

- N words of documents are shown by

- Formatting features are shown by

- Class labels are shown by

The model can be shown by the following joint distribution over local parameters, class labels, words, and formatting features:

The parameters are estimated using maximum likelihood estimation on a set of training documents. For inference, one approach is to approximate parameter with a point estimation and infer the class label using MAP estimation. We can label each pair by the following formula:

can be approximated by . They have used EM algorithm to maximize the expected log likelihood of formatting features.

They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we explained above. The second dataset contains 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage. Their experimental result have shown that this approach can obtain high precision and low/moderate recall.