Difference between revisions of "Tom Broxton el al., Catching a viral video, J Intell Inf Syst 2011"

From Cohen Courses
Jump to navigationJump to search
 
(48 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
Tom Broxton and Yannet Interian and Jon Vaver and Mirjam Wattenhofer: Catching a viral video. Journal of Intelligent Information Systems 2011: 1-19.
 
Tom Broxton and Yannet Interian and Jon Vaver and Mirjam Wattenhofer: Catching a viral video. Journal of Intelligent Information Systems 2011: 1-19.
 
  
 
== Online version ==
 
== Online version ==
Line 8: Line 7:
  
 
== Summary ==
 
== Summary ==
This is a [[Category::paper]] of Google Research introducing the preliminary analysis on virus video [http://en.wikipedia.org/wiki/Viral_video]([[AddressesProblem::Viral Video Analysis]]). Since the data set used in the study is large-scale, confidential and exclusive to other researchers, the revealed conclusion are valuable.  
+
This is a [[Category::paper]] of Google Research introducing the preliminary analysis on viral videos [http://en.wikipedia.org/wiki/Viral_video]([[AddressesProblem::YouTube Analysis]]). The data set used in the study is a large-scale and confidential data set, thus the revealed conclusion from which is considerable valuable.  
Specifically it
+
Specifically it analyzes the correlation between the degree of social sharing and the video view growth, the video category and the social sites linking to it.
 +
 
 +
Different research reaches the same conclusion that the most distinguishing characteristic of the viral video is its lifespan. Compared with "popular videos" which are also capable of attracting large number of views, the viral videos gain traction in social media quickly and fade quickly as well.
  
 +
== Data set ==
 +
1.5 million video randomly selected from the set of video uploaded to YouTube between Apirl 2009 and March 2010. Each video is associated with the meta-data including its category, the number of view at daily level and most importantly the "referrer" (It seems impossible for obtain such information outside of Google) which accounts the source from which the user came to watch a particular video. The authors further classify the referrer into social and non-social categories:
  
== Method ==
+
*'''Social''': External link and embeds (from a social site such as Facebook, blogs or instant messages) and  Unknown (the user typed or copied URL into browser)
  
=== Social segmentation and video growth ===
+
*'''Non-social''': Youtube internal link (related or recommended videos) and Youtube search(found by an search engine).
First of all, pre-processing is conducted to eliminate the noisy phrases within the data set including:
 
  
1. remove the phrases whose word-length is less than 4.
+
== Conclusions==
 +
First of all, the authors categorize the videos into 10 group according to their level of "socialness" which is quantified by the fraction of social views during the first 30 days of viewing. The least social group has 0.0 to 6.1% social views whereas the most social group enjoys 81.8 to 100% social views (the number of videos in each group is approximately the same).
  
2. remove the phrases whose term-frequency is less than 10.
+
=== Socialness & view growth ===
  
3. eliminate the phrases whose domain-frequency is at least 25% (avoid spammers).
+
Fig.3(a) and Fig.3(b) illustrate the pattern of the view growth from three socialness segments from which which the following observations could be reached:
  
=== Graph construction ===
+
[[File:Catching a viral video-fig3a.png|300px|thumb|alt=none|Fig.3(a) growth of relative views from three video segments]]
Each node <math>p</math> in the phrase graph <math>G</math> represents a phrase extracted from the corpus. An edge <math>(p,q)</math> is included for every pair of phrases p and q, which always points from shorter phrases to longer phrases. Two phrases are connected either the edit-distance (treating a word as a token) is smaller than 1 or there is at least a 10-word consecutive overlap between them. In other words, the edge implies the inclusion relation between the phrases and since the direction is strictly pointing to longer phrases the graph becomes a directed acyclic graph (DAG).
+
[[File:Catching a viral video-fig3b.png|300px|thumb|alt=none|Fig.3(b) growth of absolute views from three video segments]]
  
The authors fail to elaborate how the weight <math>w_{pq}</math> on each edge is calculated. They only state that the weight is increased as the directed edit distance as well as the frequency of q grows.
+
*'''Observation I: The videos with higher socialness has the highest growth leading up to the peak as well as the highest declination rate.'''
  
=== Clustering ===
+
*'''Observation II: The videos with higher socialness enjoys the peaks that are systematically higher than less social videos.'''
The goal of [[UsesMethod::Clustering]] is to retrieve all ''single rooted'' components so that all phrases in a component are closely related by deleting a set of edges of minimum total weight. The single rooted indicates a directed acyclic sub-graph if it contains exactly one root node (out-degree = 0). As other clustering problem, it proves to be a NP-hard problem. Therefore the authors propose three heuristic towards a feasible clustering solution. And the authors claim using the heuristic (although, in my opinion , the contribution of the heuristic is obscure) they found that keeping an edge to the shortest phrase yields 9% improvement over the baseline, 12% improvement keeping an edge to the most frequent phrases and 13% greedily assigning the node to the cluster with the most edges([[UsesMethod::Hill Climbing]]). The experimental result ([[UsesMethod::Network Structure Analysis]]) demonstrates that the volume distribution for both phrase (solid blue) and phrase cluster (dashed green) generated by their cluster method follows power law distribution .
 
[[File:Meme-tracking_and_the_Dynamics_of_the_News_Cycle_3.png‎]]
 
  
== Data set ==
+
=== Socialness & categories ===
90 million news and blog articles 390GB collected over the final three months of the 2008 U.S. Presidential Election (from August 1 to October 31 2008).
+
Different measurement for the level of socialness would yields different categories that are most social, see Fig.5.
 +
[[File:Catching a viral video-fig5.png|300px|thumb|alt=none|Fig.5 Three measures for
 +
ranking the level of sharing within video categories]]
  
== Experimental Result ==
+
*'''Observation III: In terms of the fraction of videos within the category that are highly social, the most social category is "Pets & Animal"; in terms of the fraction of views within the category that are social the category becomes "Education"; Regarding the absolute number of social views within category the answer is "Music".'''
  
Based on the 35,800 non-trivial clusters (at least two phrases), the author extracted 50 largest threads which can be regarded as the cluster of the cluster containing some phrases and the threads are depicted in the following famous figure.
+
If Facebook and Twitter are regarded as two categories representing social network sites and micro-blogs they found:
  
[[File:Meme-tracking and the Dynamics of the News Cycle 2.png]]
+
*'''Observation IV: The Twitter views (by a factor of 4.5) are more highly concentrated near the day of peak viewing than the Facebook views (by a factor of 2.4) which may be result of Twitter's real-time sharing paradigm. In addition the Twitter views are more likely to be associated with highly shared videos than Facebook views are.
  
From the above figure we can not only obtain a clue about the news cycle but also get an idea about the popular news in each period. In addition, the authors also conclude their findings by global analysis and local analysis.
+
=== Viral vs. Popular videos===
 +
The authors track two videos a viral video and a popular music video which received most of its views from searches. They found that
  
=== Global Analysis ===
+
*'''Observation V: The viral video tends to peak more sharply and wane more rapidly whereas the popular music video exhibits a steady and regular growth pattern after the peak view.'''
The authors compare the dynamics of news threads to the follicular development within an ecosystem and claims two ingredients affecting the dynamics: '''imitation''' (different sources imitate one another) and '''recency''' (up-to-date news are always favored to old ones). Based on the idea, the authors propose a generation model based on the famous preferential attachment model ([[UsesMethod::BA model]]). At each discrete time step <math>t = 1,2,3...</math>, a source chooses thread j with probability proportional to <math>f(n_j)\delta(t-t_j)</math>, where <math>f(\cdot)</math> is a monotonically increasing function and <math>n_j</math> denotes the #stories about thread j; <math>\delta(\cdot)</math> is a monotonically decreasing function and <math>t_j</math> is the first time j was proposed. Intuitively the attachment is governed by the two factors and is preferential to "richer threads" ('''imitation''') and the novelty of the threads ('''recency''').
 
  
An interesting theoretical property of the dynamics is that:
+
*'''Observation VI: YouTube related video and search are the other two major reasons account for large number of views besides the social sharing. Most of the popular videos (top 1% videos in terms of views) result from the related videos and search.'''
  
Suppose we focus on thread j and let <math>X(t)</math> be the thread's volume at time t and <math>t_j = 0</math> then we have:
+
If defining popular ratio as:
  
<math>
+
<math>PR(video ) = \frac{\text{Views in the Second Month}}{\text{Views in the First Ten Days}}\, </math>
X(t+1) = cf(X(t))delta(t)
 
</math>
 
  
Subtracting X(t) and dividing <math>\delta(t)</math> on both sides we have:
+
A video is called short-term popular if its PR is low and the one with large PR is called long-term popular, see Fig 12.
  
<math>
+
[[File:Catching a viral video-fig12.png|300px|thumb|alt=none|Fig.12 Density of percent of
\frac{X(t+1)-X(t)}{\delta(t)} = \frac{dX}{d\delta} = cf(X(t))-\frac{X(t)}{\delta(t)}
+
social views for popular videos in the first ten days]]
</math>
 
  
which is essentially an differential equation. For certain <math>f(\cdot)</math> and <math>\delta(\cdot)</math> we can obtain closed form for X.
+
*'''Observation VII: the density of viral and non-viral videos in short-term popular videos set are similar. However, almost no viral videos belongs to the long-term popular videos (because viral video fade quickly after its peek view).'''
  
=== Local Analysis ===
+
=== Ranking social sites ===
The authors find some interesting observations through local analysis:
 
  
1. News stories will gradually diffused to Blog after its cycle.
+
By the viral videos, the authors propose a method to rank the social blogs in terms of their propensity of spreading the viral videos. Let <math>V_p\, </math> be the set of viral videos (with at least 60% of social views in the first month) and <math>V_u\, </math> as the set of unpopular videos (with less than 100 views in the first 30 days) and <math>W_{100}(u)\, </math> be the set of videos with at least of 100 views from url u.
  
2. Quotes can migrate from blogs to news media
+
<math>r(u) = \frac{|V_p \cap W_{100}(u)|}{|V_p \cup V_u|}\, </math>
  
3. Different sites have different respond time for a phrase.
+
r(u) is applied to filter out (<math>r(u) < r_{low}\, </math>) some outliers such as Facebook. For each video <math>v \in V_p\, </math>, <math>views(u,v)\, </math> denotes its number of view coming from url u. Based on the notions, the rank function for each url becomes:
  
== Notes ==
+
<math>R(u) = \sum_{v \in V_p} views(u,v)\, </math>
[http://memetracker.org/supp]
+
[[File:Catching a viral video-tab2.png|300px|thumb|alt=none|Tab.2 Viral video rank for
Support website
+
top 25 urls]]
  
[http://cs.stanford.edu/people/jure/pubs/blogs-sdm07.pdf]
+
The author perform the ranking function through all external links and rank social blogs in terms of their propensity of spreading the viral videos. Tab. 2 presents top 25 sites. To verify the ranking list they compare against the rank on technorati.com which applies different methods for ranking blogs. Though the ranking function seems simple, the comparison demonstrates that their method exhibits a reasonable correlation with the rank list from technorati.com.
J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, M. Hurst. Cascading behavior in large blog graphs.SDM’07.
 
  
[http://www.google.com/urlsa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCcQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.89.7278%26rep%3Drep1%26type%3Dpdf&ei=kcNjUO-ZAYTo0QGF_YGwDg&usg=AFQjCNG_LOaTmD6hJ9Z4gMgXC914XSvYSA&sig2=fqyTk1yl1PlBS7Kh1comrg]
+
== Related Papers ==
X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends.Proc. KDD, 2006.
 
  
[http://www.xuanhui.me/pub/kdd07-bursty.pdf]
+
There is a paper highly cited paper approaching the similar problem:
X. Wang, C. Zhai, X. Hu, R. Sproat. Mining correlated bursty topic patterns from coordinated text streams.KDD, 2007.
+
Gábor Szabó, Bernardo A. Huberman: Predicting the popularity of online content. Commun. ACM 53(8): 80-88 (2010)[http://www.hpl.hp.com/research/idl/papers/predictions/predictions.pdf]

Latest revision as of 23:41, 3 October 2012

Citation

Tom Broxton and Yannet Interian and Jon Vaver and Mirjam Wattenhofer: Catching a viral video. Journal of Intelligent Information Systems 2011: 1-19.

Online version

[1]

Summary

This is a paper of Google Research introducing the preliminary analysis on viral videos [2](YouTube Analysis). The data set used in the study is a large-scale and confidential data set, thus the revealed conclusion from which is considerable valuable. Specifically it analyzes the correlation between the degree of social sharing and the video view growth, the video category and the social sites linking to it.

Different research reaches the same conclusion that the most distinguishing characteristic of the viral video is its lifespan. Compared with "popular videos" which are also capable of attracting large number of views, the viral videos gain traction in social media quickly and fade quickly as well.

Data set

1.5 million video randomly selected from the set of video uploaded to YouTube between Apirl 2009 and March 2010. Each video is associated with the meta-data including its category, the number of view at daily level and most importantly the "referrer" (It seems impossible for obtain such information outside of Google) which accounts the source from which the user came to watch a particular video. The authors further classify the referrer into social and non-social categories:

  • Social: External link and embeds (from a social site such as Facebook, blogs or instant messages) and Unknown (the user typed or copied URL into browser)
  • Non-social: Youtube internal link (related or recommended videos) and Youtube search(found by an search engine).

Conclusions

First of all, the authors categorize the videos into 10 group according to their level of "socialness" which is quantified by the fraction of social views during the first 30 days of viewing. The least social group has 0.0 to 6.1% social views whereas the most social group enjoys 81.8 to 100% social views (the number of videos in each group is approximately the same).

Socialness & view growth

Fig.3(a) and Fig.3(b) illustrate the pattern of the view growth from three socialness segments from which which the following observations could be reached:

none
Fig.3(a) growth of relative views from three video segments
none
Fig.3(b) growth of absolute views from three video segments
  • Observation I: The videos with higher socialness has the highest growth leading up to the peak as well as the highest declination rate.
  • Observation II: The videos with higher socialness enjoys the peaks that are systematically higher than less social videos.

Socialness & categories

Different measurement for the level of socialness would yields different categories that are most social, see Fig.5.

none
Fig.5 Three measures for ranking the level of sharing within video categories
  • Observation III: In terms of the fraction of videos within the category that are highly social, the most social category is "Pets & Animal"; in terms of the fraction of views within the category that are social the category becomes "Education"; Regarding the absolute number of social views within category the answer is "Music".

If Facebook and Twitter are regarded as two categories representing social network sites and micro-blogs they found:

  • Observation IV: The Twitter views (by a factor of 4.5) are more highly concentrated near the day of peak viewing than the Facebook views (by a factor of 2.4) which may be result of Twitter's real-time sharing paradigm. In addition the Twitter views are more likely to be associated with highly shared videos than Facebook views are.

Viral vs. Popular videos

The authors track two videos a viral video and a popular music video which received most of its views from searches. They found that

  • Observation V: The viral video tends to peak more sharply and wane more rapidly whereas the popular music video exhibits a steady and regular growth pattern after the peak view.
  • Observation VI: YouTube related video and search are the other two major reasons account for large number of views besides the social sharing. Most of the popular videos (top 1% videos in terms of views) result from the related videos and search.

If defining popular ratio as:

A video is called short-term popular if its PR is low and the one with large PR is called long-term popular, see Fig 12.

none
Fig.12 Density of percent of social views for popular videos in the first ten days
  • Observation VII: the density of viral and non-viral videos in short-term popular videos set are similar. However, almost no viral videos belongs to the long-term popular videos (because viral video fade quickly after its peek view).

Ranking social sites

By the viral videos, the authors propose a method to rank the social blogs in terms of their propensity of spreading the viral videos. Let be the set of viral videos (with at least 60% of social views in the first month) and as the set of unpopular videos (with less than 100 views in the first 30 days) and be the set of videos with at least of 100 views from url u.

r(u) is applied to filter out () some outliers such as Facebook. For each video , denotes its number of view coming from url u. Based on the notions, the rank function for each url becomes:

none
Tab.2 Viral video rank for top 25 urls

The author perform the ranking function through all external links and rank social blogs in terms of their propensity of spreading the viral videos. Tab. 2 presents top 25 sites. To verify the ranking list they compare against the rank on technorati.com which applies different methods for ranking blogs. Though the ranking function seems simple, the comparison demonstrates that their method exhibits a reasonable correlation with the rank list from technorati.com.

Related Papers

There is a paper highly cited paper approaching the similar problem: Gábor Szabó, Bernardo A. Huberman: Predicting the popularity of online content. Commun. ACM 53(8): 80-88 (2010)[3]