Tom Broxton el al., Catching a viral video, J Intell Inf Syst 2011

From Cohen Courses
Revision as of 22:00, 30 September 2012 by Lujiang (talk | contribs)
Jump to navigationJump to search

Citation

Tom Broxton and Yannet Interian and Jon Vaver and Mirjam Wattenhofer: Catching a viral video. Journal of Intelligent Information Systems 2011: 1-19.


Online version

[1]

Summary

This is a paper of Google Research introducing the preliminary analysis on virus video [2](Viral Video Analysis). The data set used in the study is a large-scale, confidential and exclusive data set, the revealed conclusion from which are considerable valuable. Specifically it

Different research reaches the same conclusion that the most distinguishing characteristic of the viral video is its lifespan. Compared with "popular videos" which are capable of attracting large number of views, the viral video gain traction in social media quickly and fade quickly as well.


Data set

1.5 million video randomly selected from the sey of video uploaded to YouTube between Apirl 2009 and March 2010. Each video


Conclusions

First of all, the authors categorize the videos into 10 group according to their level of "socialness".

Social segmentation and video growth

First of all, pre-processing is conducted to eliminate the noisy phrases within the data set including:

1. remove the phrases whose word-length is less than 4.

2. remove the phrases whose term-frequency is less than 10.

3. eliminate the phrases whose domain-frequency is at least 25% (avoid spammers).

Graph construction

Each node in the phrase graph represents a phrase extracted from the corpus. An edge is included for every pair of phrases p and q, which always points from shorter phrases to longer phrases. Two phrases are connected either the edit-distance (treating a word as a token) is smaller than 1 or there is at least a 10-word consecutive overlap between them. In other words, the edge implies the inclusion relation between the phrases and since the direction is strictly pointing to longer phrases the graph becomes a directed acyclic graph (DAG).

The authors fail to elaborate how the weight on each edge is calculated. They only state that the weight is increased as the directed edit distance as well as the frequency of q grows.


Notes

[3] Support website

[4] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, M. Hurst. Cascading behavior in large blog graphs.SDM’07.

[5] X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends.Proc. KDD, 2006.

[6] X. Wang, C. Zhai, X. Hu, R. Sproat. Mining correlated bursty topic patterns from coordinated text streams.KDD, 2007.