Difference between revisions of "Jin et al, 2009"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
(One intermediate revision by the same user not shown) | |||
Line 9: | Line 9: | ||
== Summary == | == Summary == | ||
− | This [[Category::paper]] introduces a system that mines customer reviews of a product and | + | This [[Category::paper]] introduces a system that mines customer reviews of a product and extracts product features from the review. The system returns opinion expression that are extracted from product review as well as opinion direction. [[AddressesProblem:: Opinion mining]] have been studied widely in machine learning and information extraction community. Most of these approaches have used statistical or rule-based learning to extract opinion expression. Jin et al. in this work have introduced a new technique that uses lexicalized HMM for opinion mining. |
== System Architecture == | == System Architecture == | ||
The architecture of their system is as follow: | The architecture of their system is as follow: | ||
− | - Pre-processing: The system first crawls web pages from the Web | + | - Pre-processing: The system first crawls web pages from the Web. It cleans the HTML files that are crawled from the Web and segments all the sentences. The technique that is used to extract reviews of a product from the input web pages is not described in the paper. For the next parts of the architecture they have assumed that the reviews are extracted from the webpage and are given to the system. |
- Entity types and tag sets: They have defined four entity types for each product review: components (e.g. physical object of a camera), functions (e.g. zoom in a camera), features (e.g. color), and opinions (e.g. ideas and thoughts). For each if these types they have defined a set of tags that are used in annotation process. | - Entity types and tag sets: They have defined four entity types for each product review: components (e.g. physical object of a camera), functions (e.g. zoom in a camera), features (e.g. color), and opinions (e.g. ideas and thoughts). For each if these types they have defined a set of tags that are used in annotation process. | ||
− | - Lexicalized HMMs: Given a review of a product as an input of the system, the goal of lexicalized HMM is to assign appropriate tag type to each part of product review. For classification they maximize conditional probability <math> P(T|W,S) </math> where T is the tags that | + | - Lexicalized HMMs: Given a review of a product as an input of the system, the goal of the lexicalized HMM is to assign appropriate tag type to each part of the product review. For classification, they maximize conditional probability <math> P(T|W,S) </math> where T is the tags that should be assigned to different parts of a product review, <math> W </math> is a set of all the words in the review and <math> S </math> is the POS tag for each word. They have used MLE to learn parameters of the system. |
− | - Information propagation: The goal of this part is to decrease the number of training data that this system requires. Suppose that we have sentence "Good picture quality" as part of a review in the training data. Word "good" is tagged as "<opinion_pos_exp>" in the training data. The system then | + | - Information propagation: The goal of this part is to decrease the number of training data that this system requires. Suppose that we have a sentence like "Good picture quality" as part of a review in the training data. Word "good" is tagged as "<opinion_pos_exp>" in the training data. The system then creates new training data by looking at a dictionary and substitute word "good" with it's synonyms. For example the new sentence "great picture quality" can be added as a new training data. This idea is applied to all the words in the training data to increase the number of examples. |
− | - Bootstrapping: The main contribution of this | + | - Bootstrapping: The main contribution of this paper is the bootstrapping part. The idea is to partition the training set to two different disjoint sets and train a HMM using each of these sets. Then for each instance of the test data (which is non annotated by the human), if two HMMs classify the input review as the same class and if the confidence value is above a threshold T then this new instance is added to the training set. This idea can significantly decrease the amount of time that human should spend to annotate training data. |
== Evaluation Results == | == Evaluation Results == | ||
− | They have tested their system | + | They have tested their system on reviews of different cameras that are chosen from Amazon.com. They have manually annotated reviews of 6 cameras to use as the training data. The system is tested using 4-fold validation. They have used the system that is developed by [[Turney,2002]] as the baseline for comparisons. The results have shown that their system can increase accuracy of mining opinion expression by a factor of 2 comparing to the baseline system. |
Latest revision as of 11:03, 2 December 2010
Citation
Jin, W., Ho, H.,Srihari, R., 2009, OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction, KDD'09
Online version
[[1]]
Summary
This paper introduces a system that mines customer reviews of a product and extracts product features from the review. The system returns opinion expression that are extracted from product review as well as opinion direction. Opinion mining have been studied widely in machine learning and information extraction community. Most of these approaches have used statistical or rule-based learning to extract opinion expression. Jin et al. in this work have introduced a new technique that uses lexicalized HMM for opinion mining.
System Architecture
The architecture of their system is as follow:
- Pre-processing: The system first crawls web pages from the Web. It cleans the HTML files that are crawled from the Web and segments all the sentences. The technique that is used to extract reviews of a product from the input web pages is not described in the paper. For the next parts of the architecture they have assumed that the reviews are extracted from the webpage and are given to the system.
- Entity types and tag sets: They have defined four entity types for each product review: components (e.g. physical object of a camera), functions (e.g. zoom in a camera), features (e.g. color), and opinions (e.g. ideas and thoughts). For each if these types they have defined a set of tags that are used in annotation process.
- Lexicalized HMMs: Given a review of a product as an input of the system, the goal of the lexicalized HMM is to assign appropriate tag type to each part of the product review. For classification, they maximize conditional probability where T is the tags that should be assigned to different parts of a product review, is a set of all the words in the review and is the POS tag for each word. They have used MLE to learn parameters of the system.
- Information propagation: The goal of this part is to decrease the number of training data that this system requires. Suppose that we have a sentence like "Good picture quality" as part of a review in the training data. Word "good" is tagged as "<opinion_pos_exp>" in the training data. The system then creates new training data by looking at a dictionary and substitute word "good" with it's synonyms. For example the new sentence "great picture quality" can be added as a new training data. This idea is applied to all the words in the training data to increase the number of examples.
- Bootstrapping: The main contribution of this paper is the bootstrapping part. The idea is to partition the training set to two different disjoint sets and train a HMM using each of these sets. Then for each instance of the test data (which is non annotated by the human), if two HMMs classify the input review as the same class and if the confidence value is above a threshold T then this new instance is added to the training set. This idea can significantly decrease the amount of time that human should spend to annotate training data.
Evaluation Results
They have tested their system on reviews of different cameras that are chosen from Amazon.com. They have manually annotated reviews of 6 cameras to use as the training data. The system is tested using 4-fold validation. They have used the system that is developed by Turney,2002 as the baseline for comparisons. The results have shown that their system can increase accuracy of mining opinion expression by a factor of 2 comparing to the baseline system.