Difference between revisions of "Tom Broxton el al., Catching a viral video, J Intell Inf Syst 2011"

From Cohen Courses
Jump to navigationJump to search
Line 15: Line 15:
  
 
== Data set ==
 
== Data set ==
1.5 million video randomly selected from the sey of video uploaded to YouTube between Apirl 2009 and March 2010. Each video  
+
1.5 million video randomly selected from the sey of video uploaded to YouTube between Apirl 2009 and March 2010. Each video is associated with the meta-data including its category, the number of view at daily level and most important the "referrer" (It seems impossible for obtain such information outside of Google) which accounts the source from which the user came to watch a particular video. The authors further classify the referrer into social and non-social categories:
 +
 
 +
Social: External link  and embeds (from a social site such as Facebook, blogs or instant messages) and  Unknown (the user typed or copied URL into browser)
 +
 
 +
Non-social: Youtube internal link (related or recommended videos) and Youtube search(found by an search engine).
  
  

Revision as of 22:11, 30 September 2012

Citation

Tom Broxton and Yannet Interian and Jon Vaver and Mirjam Wattenhofer: Catching a viral video. Journal of Intelligent Information Systems 2011: 1-19.


Online version

[1]

Summary

This is a paper of Google Research introducing the preliminary analysis on virus video [2](Viral Video Analysis). The data set used in the study is a large-scale, confidential and exclusive data set, the revealed conclusion from which are considerable valuable. Specifically it

Different research reaches the same conclusion that the most distinguishing characteristic of the viral video is its lifespan. Compared with "popular videos" which are capable of attracting large number of views, the viral video gain traction in social media quickly and fade quickly as well.


Data set

1.5 million video randomly selected from the sey of video uploaded to YouTube between Apirl 2009 and March 2010. Each video is associated with the meta-data including its category, the number of view at daily level and most important the "referrer" (It seems impossible for obtain such information outside of Google) which accounts the source from which the user came to watch a particular video. The authors further classify the referrer into social and non-social categories:

Social: External link and embeds (from a social site such as Facebook, blogs or instant messages) and Unknown (the user typed or copied URL into browser)

Non-social: Youtube internal link (related or recommended videos) and Youtube search(found by an search engine).


Conclusions

First of all, the authors categorize the videos into 10 group according to their level of "socialness".

Social segmentation and video growth

First of all, pre-processing is conducted to eliminate the noisy phrases within the data set including:

1. remove the phrases whose word-length is less than 4.

2. remove the phrases whose term-frequency is less than 10.

3. eliminate the phrases whose domain-frequency is at least 25% (avoid spammers).

Graph construction

Each node in the phrase graph represents a phrase extracted from the corpus. An edge is included for every pair of phrases p and q, which always points from shorter phrases to longer phrases. Two phrases are connected either the edit-distance (treating a word as a token) is smaller than 1 or there is at least a 10-word consecutive overlap between them. In other words, the edge implies the inclusion relation between the phrases and since the direction is strictly pointing to longer phrases the graph becomes a directed acyclic graph (DAG).

The authors fail to elaborate how the weight on each edge is calculated. They only state that the weight is increased as the directed edit distance as well as the frequency of q grows.


Notes

[3] Support website

[4] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, M. Hurst. Cascading behavior in large blog graphs.SDM’07.

[5] X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends.Proc. KDD, 2006.

[6] X. Wang, C. Zhai, X. Hu, R. Sproat. Mining correlated bursty topic patterns from coordinated text streams.KDD, 2007.