Hu and Liu, AAAI 2004
This is a summary of research paper as part of Social Media Analysis 10-802, Fall 2012.
Contents
Citation
M. Hu and B. Liu. Mining Opinion Features in Customer Reviews. In Proceedings of Nineteenth National Conference on Artificial Intelligence. 2004.
Online Version
Abstract from the paper
It is a common practice that merchants selling products on the Web ask their customers to review the products and associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds. This makes it difficult for a potential customer to read them in order to make a decision on whether to buy the product. In this project, we aim to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we are only interested in the specific features of the product that customers have opinions on and also whether the opinions are positive or negative. We do not summarize the reviews by selecting or rewriting a subset of the original sentences from the reviews to capture their main points as in the classic text summarization. In this paper, we only focus on mining opinion/product features that the reviewers have commented on. A number of techniques are presented to mine such features. Our experimental results show that these techniques are highly effective.
Summary
Overview
This paper proposes some techniques for feature-based opinion summarization of customer reviews for various products sold on e-commerce websites such as Amazon.com. They propose to perform this task in two steps -
- Identifying the product features (like size, picture quality etc for a camera) and listing them based on their frequency of occurrence and opinion expressed by customers.
- For each feature, identifying customer reviews that express positive or negative opinion.
In this paper, they mainly focus on the first task - finding the product features for which customers have expressed some opinion. They also mention that their approach is different from traditional text summarization as they provide a more structured summary of reviews and also limit the summary to the opinions expressed about product features.
It lists some of the common problems mentioned in getting list of product features from manufacturers/sellers -
- Manufacturer/Seller may not be able to provide an exhaustive list of features for the entire catalog.
- Manufacturer/Seller and customer/reviewer may use different terms for the same features and can lead to ambiguity.
- Manufacturer/Seller may not reveal all the product features.
- Customer/Seller may express opinion about some features which are missing in a product.
Proposed Techniques
The paper proposes a system which for a given set of inputs, crawls the reviews to form a review database. It then identifies the product features from the review database and finally for each feature, it finds the polarity of opinion expressed in reviews. They restrict to finding explicitly mentioned product features using NLProcessor parser and POS tagger to obtain the candidate noun phrases.
Frequent Features Generation
The system uses association rule miner, CBA, based on the apriori algorithm to find the most frequent itemsets/features which occur above a user-defined threshold (threshold of 1% of total number of reviews used in this paper). It also considers only those itemsets/features which have at most three words. This gives a list of candidate features from the reviews. The system then prunes some irrelevant and redundant features from the candidate list using following two methods -
- Compactness Pruning: In this method, for each candidate feature, the system checks the distances between each word in the feature present in a sentence. If the distances is below a threshold for a certain number of sentences where the feature is present then that feature is selected else rejected.
- Redundancy Pruning: The paper defines p-support for a feature as the number of sentences where a feature (containing single word) is present with none of its superset feature phrase present in it. The features with p-support below a threshold are pruned from the list.
Opinion Words Extraction and infrequent Feature Identification
The system finds the adjectives near a frequent feature in a given sentence. These adjectives are considered as an opinion word describing that feature. This way, the system builds a list of opinion words. It then proposes to use the words in opinion word list to identify infrequent features by assuming that an opinion word is used to describe various features which can include both frequent and infrequent features. Based on this assumption, the system identifies the noun phrases near an opinion word in a sentence and treats it as an infrequent feature. Their experimental results show that the infrequent features comprise around 15-20% of all the features identified for a product.
The proposed method for identifying the opinion polarity of each opinion word using bootstrapping method and subsequent techniques to find the opinion orientation of each sentence describing the features of the product is presented in a separate paper.
Evaluation
The paper presents experimental results for five electronic products - two digital cameras, one DVD player, one MP3 player and one cellular phone. The reviews (first 100 reviews for each product) were collected from amazon.com and c|net.com. The gold standard for the features list was obtained by having a manual tagger list all the features present in the reviews collection for each product. This included both the explicit and implicit features. The results show that pruning helps in improving the system precision significantly as compared to the frequent feature identification step using association mining. The addition of infrequent features also give a significant improvement in system recall without lowering the precision much as the number of infrequent features is limited. Overall the system shows an average recall of 80% and precision 72% which looks very promising.
Discussion
This paper proposes a multi-step approach to extract product features from the customer reviews available on e-commerce websites. The approach is based on some simple assumptions about how opinion words occur around feature words in reviews. The results show that the method works well but I believe it would have been interesting to see the performance of this method on reviews for various other product categories like apparel, sports items, etc. This could give us an insight on how the reviews structure may differ across different product categories for the purpose of feature extraction.
Related Papers
- Ana-Maria Popescu , Oren Etzioni, Extracting product features and opinions from reviews, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.339-346, October 06-08, 2005, Vancouver, British Columbia, Canada [1]
Study Plan
Resources useful for understanding this paper
- Article: Automatic Text Summarization
- Article: Opinion Mining
- Paper: Rakesh Agrawal , Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases, p.487-499, September 12-15, 1994
- NLProcessor - Text Analysis Toolkit. 2000. [2]