Y. Borghol et al. Performance Evaluation 68 2011

Citation

Youmna Borghol, Siddharth Mitra, Sebastien Ardon, Niklas Carlsson, Derek L. Eager, Anirban Mahanti: Characterizing and modelling popularity of user-generated videos. Perform. Eval. 68(11): 1037-1055 (2011)

Online version

[1]

Summary

This is a paper proposing a model for the dynamics of YouTube videos popularity based on the data set they collected over 8 months from 2008-2009. Specially the author claim the peak view of an individual video follows a certain distribution called time-to-peak distribution. Based on it they divide the view into three phases namely before, at or after peak. Finally a three-phase evolution model is brought forward to explain the dynamics of video views for newly-uploaded videos.

Data set

1.1 million videos metadata collected at weekly level over 8 months in the following two ways:

sampling from the recently-uploaded videos (29,791 videos collected)
sampling using keyword search (1,135,253 videos collected).

There are potential two drawbacks in their collecting methods. First collecting views in a weekly manner would discard some important information about the popular videos such as viral videos. According to Google's paper [2], many videos reach their peek-view within a week and receive 25% social view in their first uploaded day.

Secondly, about 97% videos metadata were collected by keyword search which is significantly biased towards the videos' current #views and their age. According to Google, given the similar keyword relevance the ranking algorithm often favors the newly updated and popular videos. In other words, the data set may not well represent a random proportion of Youtube videos (In contrast, the data set used in [3] derived from random sampling)

Therefore the conclusion drawn from this data set needs a careful examination and further validation.

Method

Through the empirical analysis, the authors claim that the time-to-peak distribution approximately follows an exponential distribution and they found that a large fraction of videos peak within the first six weeks, see Fig. 7.

Fig. 7 Time-to-peak distribution

Fig.8 the view distribution for Youtube video Dog Fight

As they mentioned in the paper, the exogenous and endogenous factors both influence the popularity. However, they totally ignore the exogenous events during their analysis. For example Fig.8 is view plot I generated for a popular video "Dog Fight", the peak view of the video is clearly results from some exogenous events (Probably Event D which is "First embedded on jaramsie.pl"). Since according to Google's paper, the events, such as YouTube Search, related recommendation and refereed in an social sites, are the driving force for raising popular videos, analysis without considering them would be of little value.

Then they propose a three-phrases model namely before, at and after the peak. The assumption disagrees with the observation reported in [4] where they found some videos's view patterns are bi-modal or multi-modal, see Fig.8 for a counter example. In addition they make an assumption that weekly viewing rate within each phase are invariant, based on which they propose the following model. Suppose

$N\,$	the total number of newly uploaded videos
$d\,$	the total number of weeks
$n_{i}\,$	the number of videos at week $i$
$P_{peak}\,$	time-to-peak distribution
$P^{before}\,$	the view distribution for videos in the before-peak phase
$P^{at}\,$	the view distribution for videos in the at-peak phase
$P^{after}\,$	the view distribution for videos in the after-peak phase

Algorithm

Step 0 For each week i = {1,..d};

Step 1 Sampling N values from $P_{peak}$ and counts the number of videos in the at-peak phrase ( $n_{i}^{at}$ ); update

$n_{i}^{before}=n_{i-1}^{before}-n_{i}^{at}$ ,

$n_{i}^{after}=n_{i-1}^{after}+n_{i-1}^{at}$ ,

such that $n_{i-1}^{before}+n_{i-1}^{at}+n_{i-1}^{after}=n_{i}^{before}+n_{i}^{at}+n_{i}^{after}$

Let $n_{i}^{after}=0$ when $i=1$

Step 2 Sampling $n_{i}^{before},n_{i}^{at},n_{i}^{after}$ from $P^{before},P^{at},P^{after}\,$ respectively.

Step 3 Ranked the views sampled from $n_{i}^{before}$ and assign the scores to the videos in $n_{i-1}^{before}$ according to their ranked list (such that the video with the highest view during week i−1 is assigned the highest view in week i).

Ranked the views sampled from $n_{i}^{at}$ and assign to $n_{i-1}^{before}$

Ranked the views sampled from $n_{i}^{after}$ and assign to $n_{i-1}^{after}$ and $n_{i-1}^{peak}$

Step 4 Let sampled videos $n_{i}^{at}$ peak in this week.

Experiments

Based on the model, they generate the synthetic data and compare against the real-world data set. According to their experiments, their distribution of total views are similar as empirical distribution found in their data set, see Fig. 13. In addition they also evaluates the overlap of hot set which is the overlap between top popular videos in two consecutive weeks see Fig. 14 (1% and 10% curve represents the overlap of top 1% and 10% videos in two weeks, respectively). They also introduce an extension by shuffling the popularity of videos within each phase which improves the result.

Related papers

There is an more convincing paper from Google addressing the similar problem [5].

The author published another interesting paper on KDD 2012[6].

Y. Borghol et al. Performance Evaluation 68 2011

Contents

Citation

Online version

Summary

Data set

Method

Algorithm

Experiments

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools