Y. Borghol et al. Performance Evaluation 68 2011

From Cohen Courses
Revision as of 23:32, 1 October 2012 by Lujiang (talk | contribs)
Jump to navigationJump to search

Citation

Youmna Borghol, Siddharth Mitra, Sebastien Ardon, Niklas Carlsson, Derek L. Eager, Anirban Mahanti: Characterizing and modelling popularity of user-generated videos. Perform. Eval. 68(11): 1037-1055 (2011)

Online version

[1]

Summary

This is a paper proposing a model for the dynamics of YouTube videos popularity based on the data set they collected over 8 months from 2008-2009. Specially the author claim the peak view of an individual video follows a certain distribution called time-to-peak distribution. Based on it they divide the view into three phases namely before, at or after peak. Finally a three-phase evolution model is brought forward to explain the dynamics of video views for newly-uploaded videos.

Data set

1.1 million videos metadata collected at weekly level over 8 months in the following two ways: 1) sampling from the recently-uploaded videos (29,791 videos collected) and 2) sampling using keyword search (1,135,253 videos collected).

There are potential two drawbacks in their collecting methods. First collecting views in a weekly manner would discard some important information about the popular videos such as viral videos. According to Google's paper [2], many videos reach their peek-view within a week and receive 25% social view in their first uploaded day.

Secondly, about 97% videos metadata were collected by keyword search which is significantly biased towards the videos' current #views and their age. According to Google, given the similar keyword relevance the ranking algorithm often favors the newly updated and popular videos. In other words, the data set may not well represent a random proportion of Youtube videos (In contrast, the data set used in [3] derived from random sampling)

Therefore the conclusion drawn from this data set needs a careful examination and further validation.

Method

Through the empirical analysis, the authors claim that the time-to-peak distribution approximately follows an exponential distribution and they found that a large fraction of videos peak within the first six weeks, see Fig. 7.

Characterizing and modelling popularity of user-generated videos-fig7.png Chart.png

As they mentioned in the paper, the exogenous and endogenous factors both influence the popularity. However, they totally ignore the exogenous events during their analysis. For example the following is view plot I generated for a popular video "Dog Fight", the peak view of the video is clearly results from some exogenous events (Probably Event D which is "First embedded on jaramsie.pl"). Since according to Google's paper, the events, such as YouTube Search, related recommendation and refereed in an social sites, are the driving force for raising popular videos, analysis without considering them would be of little value.

Then they propose a three-phrases model namely before, at and after the peak. The assumption disagrees with the observation reported in [4] where they found some videos's view patterns are bi-modal, see the above figure. In addition they made an assumption that weekly viewing rate within each phase are invariant, based on which they propose the following model. Suppose

N the total number of newly uploaded videos
d the total number of weeks
the number of videos at week
time-to-peak distribution
the view distribution for videos in the before-peak phase
the view distribution for videos in the at-peak phase
the view distribution for videos in the after-peak phase

Step 0 For each week i = {1,..d};

Step 1 Sampling N values from and counts the number of videos in the at-peak phrase (); update

,

,

such that

Let when

Step 2 Sampling from respectively.

Step 3 Ranked the views sampled from and assign the scores to the videos in according to their ranked list (such that the video with the highest view during week i−1 is assigned the highest view in week i).

Ranked the views sampled from and assign to

Ranked the views sampled from and assign to and

Step 4 Let sampled videos peak in this week.

Based on the model, they generate the synthetic data and compare against the real-world data set. According to their experiments, their distribution of total views are similar. They also introduce an extension by shuffling the popularity of videos within each phase.

Related papers

There is an more convincing paper from Google addressing the similar problem [5].

The author published another interesting paper on KDD 2012[6].