Project Draft Overwijk

From Cohen Courses
Revision as of 21:34, 15 February 2011 by Aoverwij (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Team Members

Arnold Overwijk

Project

A decade ago the main channels of news were the news paper, radio and television. News was mainly conveyed by professional journalists that were trained to be objective. Nowadays news spreads around the world over the internet in a much higher pace. Moreover news is more and more conveyed by people without any professional training, e.g. via social networks, blogs, etc. This makes it intractable to consume it all, but more importantly news is more often biased towards a certain perspective. For example a political article written by a republican is likely to reflect a different viewpoint than an article about the same event that is written by a democrat. Another example is the word choice between 'terrorists' and 'freedom fighters'.

Wei-Hao Lin [1] has addressed the problem of predicting perspectives of news videos based on visual concepts. This potentially allows to make people aware of highly biased articles and suggest content from a different perspective. In this project I would like to continue his work. First I want to predict perspectives based on close captions and compare the results with using visual concepts. The next step then would be to combine both types of features to do an even better job.

We chose the language as an indicator for the perspective of each videos. This means that there are 3 perspectives in the dataset described below. This is justified by the work that Wei-Hao Lin [1] did; he showed that it is possible to predict the language within the same topic. However when the videos are from different topics, then his approach was not able to predict the language anymore with a higher accuracy than making a random guess. The intuition behind this is that there can only be a bias towards a certain perspective within a certain topic. More importantly this indicates that his approach was not training on the characteristics of the broadcasters, such as the appearance of a logo, different language characters in the subtitles, etc.

The visual concepts in the dataset do not contain features that are specific to one of the broadcasters. The closed captions on the other hand do mention words that are specific to one of the broadcasters, such as the broadcasters name, etc. In this project we have to be careful that we do not use those features. We can avoid this by manually looking at the most discriminative features for each perspective and filter out the words that are related to the broadcaster.

To briefly summarize the above: We will predict perspectives on videos within the same topic, using supervised learning techniques. Our approach will be a general approach, i.e. it will not be restricted to videos or the chosen dataset and its perspectives.

The project consists of several steps: First we develop a system that is able to predict perspectives based on Wei-Hao Lin's thesis [1]. In the next step we will replicate the experiments Wei-Hau Lin did using the visual features to verify that our system is implemented correctly. We will then also be able to use the closed captions instead of the visual concepts. Finally we attempt to combine both the visual concepts and the closed captions. This can be done in multiple ways, the easiest are performing an early or late fusion, but more interesting would be to extend the graphical model to incorporate both features.

Dataset

For this project I plan to use the TRECVID 2005 dataset. This contains Arabic, Chinese and English television news from November 2004. I have created the following 5 topics:

  • Arafat's death (133 segments)
  • Iraq war (134 segments)
  • Al Qaeda (307 segments)
  • AIDS (139 segments)
  • United States elections (203 segments)

All segments have a duration between 30 seconds and 7 minutes. Furthermore each perspective is represented reasonably well in each topic.

A segment consists of multiple shots and for each shot we have visual concepts (human annotated), closed captions in English and of course the audio and video itself. There are about 300 visual concepts, which are binary labels that mark appearance in a shot. Those concepts include objects (e.g. weapons, newspaper), scenery information (e.g. outerspace, oceans, nighttime), events (e.g. parade), etc.

References

1. W.-H. Lin and A. Hauptmann, Identifying News Videos' Ideological Perspectives using Emphatic Patterns of Visual Concepts, ACM Multimedia 2009, Beijing, China.