Difference between revisions of "Y. Borghol et al. Performance Evaluation 68 2011"

From Cohen Courses
Jump to navigationJump to search
 
(48 intermediate revisions by the same user not shown)
Line 7: Line 7:
  
 
== Summary ==
 
== Summary ==
This is a [[Category::paper]] proposes a model for the dynamics of YouTube videos popularity based on the data set they collected over 8 months. Specially it claims the peak view of an individual video follows a certain distribution called Time-to-peak distribution. Based on it they divide the view into three phases namely before, at or after peak. Finally a three-phase evolution model is brought forward to explain the dynamics of video views for newly-uploaded videos.
+
This is a [[Category::paper]] ([[AddressesProblem::YouTube Analysis]]) proposing a model for the dynamics of YouTube videos popularity based on the data set they collected over 8 months from 2008-2009. Specially the author claim the peak view of an individual video follows a certain distribution called time-to-peak distribution. Based on it they divide the view into three phases namely before, at or after peak. Finally a three-phase evolution model is brought forward to explain the dynamics of video views for newly-uploaded videos.
  
 
== Data set ==
 
== Data set ==
 
1.1 million videos metadata collected at weekly level over 8 months in the following two ways:
 
1.1 million videos metadata collected at weekly level over 8 months in the following two ways:
1) sampling from the recently-uploaded videos (29,791 videos collected) and 2) sampling using keyword search (1,135,253 videos collected).
+
# sampling from the recently-uploaded videos (29,791 videos collected)
 +
# sampling using keyword search (1,135,253 videos collected).
  
 
There are potential two drawbacks in their collecting methods. First collecting views in a weekly manner would discard some important information about the popular videos such as viral videos. According to Google's paper [http://malt.ml.cmu.edu/mw/index.php/Tom_Broxton_el_al.,_Catching_a_viral_video,_J_Intell_Inf_Syst_2011], many videos reach their peek-view within a week and receive 25% social view in their first uploaded day.
 
There are potential two drawbacks in their collecting methods. First collecting views in a weekly manner would discard some important information about the popular videos such as viral videos. According to Google's paper [http://malt.ml.cmu.edu/mw/index.php/Tom_Broxton_el_al.,_Catching_a_viral_video,_J_Intell_Inf_Syst_2011], many videos reach their peek-view within a week and receive 25% social view in their first uploaded day.
  
Secondly, about 97% videos metadata were collected by keyword search which is significantly biased towards the current #views and the age of the response videos. The ranking algorithm often favors the newly updated and popular videos given the similar keyword relevance. In other words, the data set may not well represent a random proportion of Youtube videos (In contrast the data set used in [http://malt.ml.cmu.edu/mw/index.php/Tom_Broxton_el_al.,_Catching_a_viral_video,_J_Intell_Inf_Syst_2011] derived from random sampling)
+
Secondly, about 97% videos metadata were collected by keyword search which is significantly biased towards the videos' current #views and their age. According to Google, given the similar keyword relevance the ranking algorithm often favors the newly updated and popular videos. In other words, the data set may not well represent a random proportion of Youtube videos (In contrast, the data set used in [http://malt.ml.cmu.edu/mw/index.php/Tom_Broxton_el_al.,_Catching_a_viral_video,_J_Intell_Inf_Syst_2011] derived from random sampling)
  
Therefore the conclusion drawn from this data set needs a careful examination and further validation.
+
'''Therefore the conclusion drawn from this data set needs a careful examination and further validation.'''
  
 
== Method ==
 
== Method ==
  
Through the empirical analysis, the authors claim that the time-to-peak distribution approximately follows an exponential distribution and they found that a large fraction of videos peak within the first six weeks.  
+
Through the empirical analysis, the authors claim that the time-to-peak distribution approximately follows an exponential distribution and they found that a large fraction of videos peak within the first six weeks, see Fig. 7.
  
[[File:Characterizing and modelling popularity of user-generated videos-fig7.png]]
+
[[File:Characterizing and modelling popularity of user-generated videos-fig7.png|300px|thumb|alt=none|Fig. 7 Time-to-peak distribution]]
 +
[[File:Chart.png|300px|thumb|alt=none|Fig.8 the view distribution for Youtube video Dog Fight]]
  
As they mentioned in the paper, the exogenous and endogenous factors both influence the popularity. However, they totally ignore the exogenous events during their analysis. For example the following is view plot I generated for a popular video "Dog Fight", the peak view of the video is clearly results from some exogenous events (Probably Event D which is "First embedded on jaramsie.pl"). Since the events are the major reasons accounting for a popular video, analysis without them would be of little use.   
+
As they mentioned in the paper, the exogenous and endogenous factors both influence the popularity. However, they totally ignore the exogenous events during their analysis. For example Fig.8 is view plot I generated for a popular video "Dog Fight", the peak view of the video is clearly results from some exogenous events (Probably Event D which is "First embedded on jaramsie.pl"). Since according to Google's paper, the events, such as YouTube Search, related recommendation and refereed in an social sites, are the driving force for raising popular videos, analysis without considering them would be of little value.   
  
[[File:Chart.png]]
+
Then they propose a three-phrases model namely before, at and after the peak. The assumption disagrees with the observation reported in [http://malt.ml.cmu.edu/mw/index.php/Tom_Broxton_el_al.,_Catching_a_viral_video,_J_Intell_Inf_Syst_2011] where they found some videos's view patterns are bi-modal or multi-modal, see Fig.8 for a counter example. In addition they make an assumption that weekly viewing rate within each phase are invariant, based on which they propose the following model. Suppose
  
Then they propose a three phrases namely before, at and after the peak which disagrees the observation reported in [http://malt.ml.cmu.edu/mw/index.php/Tom_Broxton_el_al.,_Catching_a_viral_video,_J_Intell_Inf_Syst_2011] where they found some videos's view patterns are bi-modal, see the above figure. In addition they made an assumption that weekly viewing rate within each phase are invariant.
+
{| border="1"
 +
|-
 +
| <math>N \, </math> || the total number of newly uploaded videos
 +
|-
 +
| <math>d \, </math> || the total number of weeks
 +
|-
 +
| <math>n_i \, </math> || the number of videos at week <math>i</math>
 +
|-
 +
| <math>P_{peak} \, </math> || time-to-peak distribution
 +
|-
 +
| <math>P^{before} \, </math> || the view distribution for videos in the before-peak phase
 +
|-
 +
| <math>P^{at} \, </math> || the view distribution for videos in the at-peak phase
 +
|-
 +
| <math>P^{after} \, </math> || the view distribution for videos in the after-peak phase
 +
|}
  
 +
=== Algorithm ===
 +
'''Step 0''' For each week i = {1,..d};
  
 +
'''Step 1''' Sampling N values from <math>P_{peak}</math> and counts the number of videos in the at-peak phrase (<math>n_i^{at}</math>); update
  
 +
<math>n_i^{before} = n_{i-1}^{before} - n_i^{at}</math>,
  
=== 2. Socialness & categories ===
+
<math>n_i^{after} = n_{i-1}^{after} + n_{i-1}^{at}</math>,  
Different measurement for the level of socialness would yields different categories that are most social, see Fig.5.
 
  
'''Observation III: In terms of the fraction of videos within the category that are highly social, the most social category is "Pets & Animal"; in terms of the fraction of views within the category that are social the category becomes "Education"; Regarding the absolute number of social views within category the answer is "Music".'''
+
such that <math>n_{i-1}^{before} + n_{i-1}^{at} + n_{i-1}^{after} = n_{i}^{before} + n_{i}^{at} + n_{i}^{after}</math>
  
If Facebook and Twitter are regarded as two categories representing social network sites and micro-blogs they found:
+
Let <math>n_{i}^{after}=0</math> when <math>i=1</math>
  
'''Observation IV: The Twitter views (by a factor of 4.5) are more highly concentrated near the day of peak viewing than the Facebook views (by a factor of 2.4) which may be result of Twitter's real-time sharing paradigm. In addition the Twitter views are more likely to be associated with highly shared videos than Facebook views are.
+
'''Step 2''' Sampling <math>n_i^{before}, n_i^{at}, n_i^{after}</math> from <math>P^{before}, P^{at}, P^{after} \, </math> respectively.
  
=== 3. Viral vs. Popular videos===
+
'''Step 3''' Ranked the views sampled from <math>n_i^{before}</math> and assign the scores to the videos in <math>n_{i-1}^{before}</math> according to their ranked list (such that the video with the highest view during week i−1 is assigned the highest view in week i).
The authors track two videos a viral video and a popular music video which received most of its views from searches. They found that
 
  
'''Observation V: The viral video tends to peak more sharply and wane more rapidly whereas the popular music video exhibits a steady and regular growth pattern after the peak view.'''
+
Ranked the views sampled from <math>n_i^{at}</math> and assign to <math>n_{i-1}^{before}</math>
  
'''Observation VI: YouTube related video and search are the other two major reasons account for large number of views besides the social sharing. Most of the popular videos (top 1% videos in terms of views) result from the related videos and search.'''
+
Ranked the views sampled from <math>n_i^{after}</math> and assign to <math>n_{i-1}^{after}</math> and <math>n_{i-1}^{peak}</math>
  
If defining popular ratio as:
+
'''Step 4''' Let sampled videos <math>n_i^{at}</math> peak in this week. Goto Step 1
  
<math>PR(video ) = \frac{\text{Views in the Second Month}}{\text{Views in the First Ten Days}}</math>
+
==Experiments==
 +
Based on the model, they generate the synthetic data and compare against the real-world data set. According to their experiments, their distribution of total views are similar as empirical distribution found in their data set, see Fig. 13. In addition they also evaluates the overlap of hot set which is the overlap between top popular videos in two consecutive weeks see Fig. 14 (1% and 10% curve represents the overlap of top 1% and 10% videos in two weeks, respectively).
  
A video is called short-term popular if its PR is low and the one with large PR is called long-term popular, see Fig 12.
+
[[File:Characterizing and modelling popularity of user-generated videos-fig13.png|600px|thumb|alt=none|Fig.13 Distribution of the total views by week]]
  
'''Observation VII: the density of viral and non-viral videos in short-term popular videos set are similar. However, almost no viral videos belongs to the long-term popular videos (because viral video fade quickly after its peek view).'''
+
[[File:Characterizing and modelling popularity of user-generated videos-fig14.png|600px|thumb|alt=none|Fig.14 Churn in video popularity measured by changes to the hot set for the recently-uploaded dataset and the basic model]]
  
=== 4. Ranking social sites ===
+
They also introduce an extension by shuffling the popularity of videos within each phase which improves the result.
  
By the viral videos, the authors propose a method to rank the social blogs in terms of their propensity of spreading the viral videos. Let <math>V_p</math> be the set of viral videos (with at least 60% of social views in the first month) and <math>V_u</math> as the set of unpopular videos (with less than 100 views in the first 30 days) and <math>W_{100}(u)</math> be the set of videos with at least of 100 views from url u.
+
However, since no quantified evaluation criteria is adopted and they fail to compare with a baseline model such as [[UsesMethod::BA model]] model rendering the whole comparison unconvincing and it is difficult to measure the efficacy of the proposed model.
  
<math>r(u) = \frac{|V_p \cap W_{100}(u)|}{|V_p \cup V_u|}</math>
 
  
r(u) is applied to filter out (<math>r(u) < r_{low}</math>) some outliers such as Facebook. For each video <math>v \in V_p</math>, <math>views(u,v)</math> denotes its number of view coming from url u. Based on the notions, the rank function for each url becomes:
+
== Related papers ==
  
<math>R(u) = \sum_{v \in V_p} views(u,v)</math>
+
There is an more convincing paper from Google addressing the similar problem:
 +
Tom Broxton and Yannet Interian and Jon Vaver and Mirjam Wattenhofer: Catching a viral video. Journal of Intelligent Information Systems 2011: 1-19.
 +
[http://malt.ml.cmu.edu/mw/index.php/Tom_Broxton_el_al.,_Catching_a_viral_video,_J_Intell_Inf_Syst_2011].
  
  
== Figures ==
+
The author published another interesting paper on KDD 2012:
 +
Youmna Borghol, Sebastien Ardon, Niklas Carlsson, Derek Eager, and Anirban Mahanti, The Untold Story of the Clones: Content-agnostic Factors that Impact YouTube Video Popularity, Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Beijing, China, Aug. 2012.[http://www.ida.liu.se/~nikca/papers/kdd12.pdf].

Latest revision as of 23:47, 3 October 2012

Citation

Youmna Borghol, Siddharth Mitra, Sebastien Ardon, Niklas Carlsson, Derek L. Eager, Anirban Mahanti: Characterizing and modelling popularity of user-generated videos. Perform. Eval. 68(11): 1037-1055 (2011)

Online version

[1]

Summary

This is a paper (YouTube Analysis) proposing a model for the dynamics of YouTube videos popularity based on the data set they collected over 8 months from 2008-2009. Specially the author claim the peak view of an individual video follows a certain distribution called time-to-peak distribution. Based on it they divide the view into three phases namely before, at or after peak. Finally a three-phase evolution model is brought forward to explain the dynamics of video views for newly-uploaded videos.

Data set

1.1 million videos metadata collected at weekly level over 8 months in the following two ways:

  1. sampling from the recently-uploaded videos (29,791 videos collected)
  2. sampling using keyword search (1,135,253 videos collected).

There are potential two drawbacks in their collecting methods. First collecting views in a weekly manner would discard some important information about the popular videos such as viral videos. According to Google's paper [2], many videos reach their peek-view within a week and receive 25% social view in their first uploaded day.

Secondly, about 97% videos metadata were collected by keyword search which is significantly biased towards the videos' current #views and their age. According to Google, given the similar keyword relevance the ranking algorithm often favors the newly updated and popular videos. In other words, the data set may not well represent a random proportion of Youtube videos (In contrast, the data set used in [3] derived from random sampling)

Therefore the conclusion drawn from this data set needs a careful examination and further validation.

Method

Through the empirical analysis, the authors claim that the time-to-peak distribution approximately follows an exponential distribution and they found that a large fraction of videos peak within the first six weeks, see Fig. 7.

none
Fig. 7 Time-to-peak distribution
none
Fig.8 the view distribution for Youtube video Dog Fight

As they mentioned in the paper, the exogenous and endogenous factors both influence the popularity. However, they totally ignore the exogenous events during their analysis. For example Fig.8 is view plot I generated for a popular video "Dog Fight", the peak view of the video is clearly results from some exogenous events (Probably Event D which is "First embedded on jaramsie.pl"). Since according to Google's paper, the events, such as YouTube Search, related recommendation and refereed in an social sites, are the driving force for raising popular videos, analysis without considering them would be of little value.

Then they propose a three-phrases model namely before, at and after the peak. The assumption disagrees with the observation reported in [4] where they found some videos's view patterns are bi-modal or multi-modal, see Fig.8 for a counter example. In addition they make an assumption that weekly viewing rate within each phase are invariant, based on which they propose the following model. Suppose

the total number of newly uploaded videos
the total number of weeks
the number of videos at week
time-to-peak distribution
the view distribution for videos in the before-peak phase
the view distribution for videos in the at-peak phase
the view distribution for videos in the after-peak phase

Algorithm

Step 0 For each week i = {1,..d};

Step 1 Sampling N values from and counts the number of videos in the at-peak phrase (); update

,

,

such that

Let when

Step 2 Sampling from respectively.

Step 3 Ranked the views sampled from and assign the scores to the videos in according to their ranked list (such that the video with the highest view during week i−1 is assigned the highest view in week i).

Ranked the views sampled from and assign to

Ranked the views sampled from and assign to and

Step 4 Let sampled videos peak in this week. Goto Step 1

Experiments

Based on the model, they generate the synthetic data and compare against the real-world data set. According to their experiments, their distribution of total views are similar as empirical distribution found in their data set, see Fig. 13. In addition they also evaluates the overlap of hot set which is the overlap between top popular videos in two consecutive weeks see Fig. 14 (1% and 10% curve represents the overlap of top 1% and 10% videos in two weeks, respectively).

none
Fig.13 Distribution of the total views by week
none
Fig.14 Churn in video popularity measured by changes to the hot set for the recently-uploaded dataset and the basic model

They also introduce an extension by shuffling the popularity of videos within each phase which improves the result.

However, since no quantified evaluation criteria is adopted and they fail to compare with a baseline model such as BA model model rendering the whole comparison unconvincing and it is difficult to measure the efficacy of the proposed model.


Related papers

There is an more convincing paper from Google addressing the similar problem: Tom Broxton and Yannet Interian and Jon Vaver and Mirjam Wattenhofer: Catching a viral video. Journal of Intelligent Information Systems 2011: 1-19. [5].


The author published another interesting paper on KDD 2012: Youmna Borghol, Sebastien Ardon, Niklas Carlsson, Derek Eager, and Anirban Mahanti, The Untold Story of the Clones: Content-agnostic Factors that Impact YouTube Video Popularity, Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Beijing, China, Aug. 2012.[6].