L. Ku, Y. Liang, and H. Chen. Opinion extraction, summarization and tracking in news and blog corpora. In Proceedings of AAAI-2006

From Cohen Courses
Jump to navigationJump to search

Citation

Lun-Wei Ku, Yu-Ting Liang, Hsin-Hsi Chen: Opinion Extraction, Summarization and Tracking in News and Blog Corpora. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs 2006: 100-107

Online Version

paper

Summary

Abstract

Humans like to express their opinions and are eager to know others’ opinions. Automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments. Opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. Opinion extraction mines opinions at word, sentence and document levels from articles. Opinion summarization summarizes opinions of articles by telling sentiment polarities, degree and the correlated events. In this paper, both news and web blog articles are investigated. TREC, NTCIR and articles collected from web blogs serve as the information sources for opinion extraction. Documents related to the issue of animal cloning are selected as the experimental materials. Algorithms for opinion extraction at word, sentence and document level are proposed. The issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized. Opinion summarizations are visualized by representative sentences. Text-based summaries in different languages, and from different sources, are compared. Finally, an opinionated curve showing supportive and nonsupportive degree along the timeline is illustrated by an opinion tracking system.

Data

  • TREC 2003 (Soboroff and Harman, 2003). 50 document sets of 2003 TREC novelty corpus, and each set

contains 25 documents. All documents in the same set are relevant.

  • NTCIR(Chen and Chen, 2001). The test collection

consists of 50 topics and 6 of them are opinionated topics. Total 192 documents relevant to the six topics are chosen as training data in this paper. Documents of an additional topic “animal cloning” of NTCIR 3 are selected from CIRB011 and CIRB020 document collections and used for testing.

  • Blog is a new rising community for expressing opinions.

To investigate the opinions expressed in blogs, we retrieve documents from blog portals by the query “animal cloning”. There are 20 documents in total.

Annotation

Inter-annotator agreement was investigated in labeling the sentence and document as positive, neutral, negative and non-sentiment. They drop the instance that has huge variations among different coders.


Task

Opinion Extraction, Opinion Summarization, Opinion Tracking. Details will explained in the following sub-section

Opinion Extraction

Method
  • Word Level: First collect sentiment words, and then enlarge with thesauri. It is a simple voting algorithm which deals with weight
  • Sentence Level: Fusion of sentences and the sentiment words and opinion in the senescence.
  • Document Level; Simple combination of every document.
Performance

The method is better than the machine learning algorithm(SVM,C5 decision tree). Because semantics within a word is not enough.

Opinion Summarization

Not only consider the sentiment of the sentence, but also take into consideration of whether this sentence is related to the topics as well. Only consider the sentence which is related to the topics in sentiment analysis.

Algorithm

TF_IDF is used in identifying key words for certain topics. And in the end we choose the higher sentiment degree sentences as the opinion summary.

Opinion Tracking

A tracker of the 2000 TAIWAN president election corpus was selected in tracking. No specific technique is use in this section. It is just doing summarization in every monthly collected data.