Difference between revisions of "Hu and Liu, AAAI 2004"

From Cohen Courses
Jump to navigationJump to search
Line 32: Line 32:
 
The system uses association rule miner, CBA, based on the apriori algorithm to find the most frequent itemsets/features which occur above a user-defined threshold (threshold of 1% of total number of reviews used in this paper). It also considers only those itemsets/features which have at most three words. This gives a list of candidate features from the reviews. The system then prunes some irrelevant and redundant features from the candidate list using following two methods -
 
The system uses association rule miner, CBA, based on the apriori algorithm to find the most frequent itemsets/features which occur above a user-defined threshold (threshold of 1% of total number of reviews used in this paper). It also considers only those itemsets/features which have at most three words. This gives a list of candidate features from the reviews. The system then prunes some irrelevant and redundant features from the candidate list using following two methods -
 
# '''Compactness Pruning''': In this method, for each candidate feature, the system checks the distances between each word in the feature present in a sentence. If the distances is below a threshold for a certain number of sentences where the feature is present then that feature is selected else rejected.
 
# '''Compactness Pruning''': In this method, for each candidate feature, the system checks the distances between each word in the feature present in a sentence. If the distances is below a threshold for a certain number of sentences where the feature is present then that feature is selected else rejected.
# '''Redundancy Pruning''': The paper defines ''p-support'' for a feature as the number of sentences where a feature (containing single word) is present with none of its superset feature phrase present in it.
+
# '''Redundancy Pruning''': The paper defines ''p-support'' for a feature as the number of sentences where a feature (containing single word) is present with none of its superset feature phrase present in it. The features with p-support below a threshold are pruned from the list.
 +
 
 +
=== Opinion Words Extraction and infrequent Feature Identification ===
 +
The system finds the adjectives near a frequent feature in a given sentence. These adjectives are considered as an opinion word describing that feature. This way, the system builds a list of opinion words. It then proposes to use the words in opinion word list to identify infrequent features by assuming that an opinion word is used to describe various features which can include both frequent and infrequent features. Based on this assumption, the system identifies the noun phrases near an opinion word in a sentence and treats it as an infrequent feature. Their experimental results show that the infrequent features comprise around 15-20% of all the features identified for a product.
  
 
== Evaluation ==
 
== Evaluation ==

Revision as of 08:28, 27 September 2012

This is a summary of research paper as part of Social Media Analysis 10-802, Fall 2012.

Citation

M. Hu and B. Liu. Mining Opinion Features in Customer Reviews. In Proceedings of Nineteenth National Conference on Artificial Intelligence. 2004.

Online Version

Direct PDF link

Abstract from the paper

It is a common practice that merchants selling products on the Web ask their customers to review the products and associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds. This makes it difficult for a potential customer to read them in order to make a decision on whether to buy the product. In this project, we aim to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we are only interested in the specific features of the product that customers have opinions on and also whether the opinions are positive or negative. We do not summarize the reviews by selecting or rewriting a subset of the original sentences from the reviews to capture their main points as in the classic text summarization. In this paper, we only focus on mining opinion/product features that the reviewers have commented on. A number of techniques are presented to mine such features. Our experimental results show that these techniques are highly effective.

Summary

This paper proposes some techniques for feature-based opinion summarization of customer reviews for various products sold on e-commerce websites such as Amazon.com. They propose to perform this task in two steps -

  1. Identifying the product features (like size, picture quality etc for a camera) and listing them based on their frequency of occurrence and opinion expressed by customers.
  2. For each feature, identifying customer reviews that express positive or negative opinion.

In this paper, they mainly focus on the first task - finding the product features for which customers have expressed some opinion. They also mention that their approach is different from traditional text summarization as they provide a more structured summary of reviews and also limit the summary to the opinions expressed about product features.

It lists some of the common problems mentioned in getting list of product features from manufacturers/sellers -

  • Manufacturer/Seller may not be able to provide an exhaustive list of features for the entire catalog.
  • Manufacturer/Seller and customer/reviewer may use different terms for the same features and can lead to ambiguity.
  • Manufacturer/Seller may not reveal all the product features.
  • Customer/Seller may express opinion about some features which are missing in a product.

Proposed Techniques

The paper proposes a system which for a given set of inputs, crawls the reviews to form a review database. It then identifies the product features from the review database and finally for each feature, it finds the polarity of opinion expressed in reviews. They restrict to finding explicitly mentioned product features using NLProcessor parser and POS tagger to obtain the candidate noun phrases.

Frequent Features Generation

The system uses association rule miner, CBA, based on the apriori algorithm to find the most frequent itemsets/features which occur above a user-defined threshold (threshold of 1% of total number of reviews used in this paper). It also considers only those itemsets/features which have at most three words. This gives a list of candidate features from the reviews. The system then prunes some irrelevant and redundant features from the candidate list using following two methods -

  1. Compactness Pruning: In this method, for each candidate feature, the system checks the distances between each word in the feature present in a sentence. If the distances is below a threshold for a certain number of sentences where the feature is present then that feature is selected else rejected.
  2. Redundancy Pruning: The paper defines p-support for a feature as the number of sentences where a feature (containing single word) is present with none of its superset feature phrase present in it. The features with p-support below a threshold are pruned from the list.

Opinion Words Extraction and infrequent Feature Identification

The system finds the adjectives near a frequent feature in a given sentence. These adjectives are considered as an opinion word describing that feature. This way, the system builds a list of opinion words. It then proposes to use the words in opinion word list to identify infrequent features by assuming that an opinion word is used to describe various features which can include both frequent and infrequent features. Based on this assumption, the system identifies the noun phrases near an opinion word in a sentence and treats it as an infrequent feature. Their experimental results show that the infrequent features comprise around 15-20% of all the features identified for a product.

Evaluation

Related Papers

Study Plan

NLProcessor Association Rule mining Apriori algorithm