Difference between revisions of "L. Ku, Y. Liang, and H. Chen. Opinion extraction, summarization and tracking in news and blog corpora. In Proceedings of AAAI-2006"
(→Task) |
|||
Line 47: | Line 47: | ||
To investigate the opinions expressed in blogs, we retrieve | To investigate the opinions expressed in blogs, we retrieve | ||
documents from blog portals by the query “animal cloning”. There are 20 documents in total. | documents from blog portals by the query “animal cloning”. There are 20 documents in total. | ||
+ | |||
+ | === Annotation === | ||
+ | Inter-annotator agreement was investigated in labeling the sentence and document as positive, neutral, negative and non-sentiment. They drop the instance that has huge variations among different coders. | ||
+ | |||
=== Task === | === Task === |
Revision as of 16:36, 26 October 2012
Contents
Citation
Lun-Wei Ku, Yu-Ting Liang, Hsin-Hsi Chen: Opinion Extraction, Summarization and Tracking in News and Blog Corpora. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs 2006: 100-107
Online Version
Summary
Abstract
Humans like to express their opinions and are eager to know others’ opinions. Automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments. Opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. Opinion extraction mines opinions at word, sentence and document levels from articles. Opinion summarization summarizes opinions of articles by telling sentiment polarities, degree and the correlated events. In this paper, both news and web blog articles are investigated. TREC, NTCIR and articles collected from web blogs serve as the information sources for opinion extraction. Documents related to the issue of animal cloning are selected as the experimental materials. Algorithms for opinion extraction at word, sentence and document level are proposed. The issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized. Opinion summarizations are visualized by representative sentences. Text-based summaries in different languages, and from different sources, are compared. Finally, an opinionated curve showing supportive and nonsupportive degree along the timeline is illustrated by an opinion tracking system.
Data
- TREC 2003 (Soboroff and Harman, 2003). 50 document sets of 2003 TREC novelty corpus, and each set
contains 25 documents. All documents in the same set are relevant.
- NTCIR(Chen and Chen, 2001). The test collection
consists of 50 topics and 6 of them are opinionated topics. Total 192 documents relevant to the six topics are chosen as training data in this paper. Documents of an additional topic “animal cloning” of NTCIR 3 are selected from CIRB011 and CIRB020 document collections and used for testing.
- Blog is a new rising community for expressing opinions.
To investigate the opinions expressed in blogs, we retrieve documents from blog portals by the query “animal cloning”. There are 20 documents in total.
Annotation
Inter-annotator agreement was investigated in labeling the sentence and document as positive, neutral, negative and non-sentiment. They drop the instance that has huge variations among different coders.
Task
Opinion Extraction, Opinion Summarization, Opinion Tracking. Details will explained in the following sub-section
Opinion Extraction
Method
- Word Level: First collect sentiment words, and then enlarge with thesauri. It is a simple voting algorithm which deals with weight
- Sentence Level: Fusion of sentences and the sentiment words and opinion in the senescence.
- Document Level; Simple combination of every document.
Performance
The method is better than the machine learning algorithm(SVM,C5 decision tree). Because semantics within a word is not enough.
Opinion Summarization
Not only consider the sentiment of the sentence, but also take into consideration of whether this sentence is related to the topics as well. Only consider the sentence which is related to the topics in sentiment analysis.
Algorithm
Basically TF_IDF is used in identifying key words for certain topics. And in the end we choose the higher sentiment degree sentences as the opinion summary.