Difference between revisions of "Guralnik 99"

From Cohen Courses
Jump to navigationJump to search
m
Line 24: Line 24:
  
 
== Summary ==
 
== Summary ==
=== Data Collection===  
+
=== Task Definition===  
 +
*Develop a general approach to change-point detection that generalize across wide range of application
  
=== Automatic Credibility Analysis ===
+
=== Method ===
Four types of features depending on their scope: message-based features,
+
==== Batch Algorithm ====
user-based features, topic-based features, and propagation-
+
*Algorithm overview
based features.
 
*'''Message-based features''' consider characteristics of messages,
 
these features can be Twitter-independent or Twitterdependent.
 
Twitter-independent features include: the length
 
of a message, whether or not the text contains exclamation
 
or question marks and the number of positive/negative sentiment
 
words in a message. Twitter-dependent features include
 
features such as: if the tweet contains a hashtag, and
 
if the message is a re-tweet.
 
*'''User-based features''' consider characteristics of the users
 
which post messages, such as: registration age, number of
 
followers, number of followees (“friends” in Twitter), and the
 
number of tweets the user has authored in the past.
 
*'''Topic-based features''' are aggregates computed from the
 
previous two feature sets; for example, the fraction of tweets
 
that contain URLs, the fraction of tweets with hashtags and
 
the fraction of sentiment positive and negative in a set.
 
*'''Propagation-based features''' consider characteristics related
 
to the propagation tree that can be built from the retweets
 
of a message. These includes features such as the
 
depth of the re-tweet tree, or the number of initial tweets of
 
a topic.
 
=== Automatic Assessing Credibility ===
 
Standard machine learning techniques, the best they report is using J48 decision tree.
 
  
Results:
+
The algorithm takes the set of approximating basis functions MSet
 +
and the time series T
  
Results for the credibility classification.
+
# new-change-point = find-candidate(T, MSet)
 +
# Change-Points = <math>\phi</math>
 +
# Candidates = <math>\phi</math>
 +
# Tl, Tz = get-new-time-ranges(T, Change-Points, new-change-point)
 +
# '''while'''(stopping criteria is not met) do begin
 +
## cl = find-candidate(T1, MSet)
 +
## c2 = find-andidate(T2, MSet)
 +
## Candidates = Candidates <math>\cup c_1</math>
 +
##Candidates = Candidates <math>\cup c_2</math>
 +
##new-change-point = c <math>\in</math> Candidates |Q(Change-Points,c) = min
 +
##Candidates = Candidates \ new-change-point
 +
##Tl,T2 = get-new-time-ranges(T, Change-Points, new-change-point)
 +
##Change-Points = Change-Points <math>\cup</math> new-change-points
 +
#'''end'''
  
Class      TP_Rate  FP_Rate  Prec.  Recall  F1
+
*Stopping Criteria
  
A (“true”)  0.825    0.108    0.874  0.825  0.849
+
If in iterations k and k+1 the respective likelihood criteria, the algorithm should stop if the difference of the likelihood proportioned to the last step likelihood is below a small constant.
  
B (“false”) 0.892    0.175    0.849  0.892  0.87
+
*Experiment with Traffic Data
 +
The data used in our experiments was taken from
 +
highway traffic sensors, called loop detectors, in the Minneapolis-St. Paul metro area. A loop detector is a
 +
sensor, embedded in the road, with an electro-magnetic
 +
field around it, which is broken when a vehicle goes
 +
over it. Each such breaking, and the subsequent reestablishment,
 +
of the field is recorded as a count. Traffic
 +
volume is defined as the vehicle count per unit time. We need to find the change point detection
 +
algorithm performed compared to a person doing
 +
the same task through visual inspection.
 +
* The results seems a little week. It has no ground truth. Statistically speaking it does performs better than 4 human subject.
  
W. Avg.     0.860    0.143    0.861  0.860  0.86
+
==== Incremental Algorithm ====
 +
Because the next data point collected by the sensor reflects significant change in  phenomenon, then its likelihood criteria of being a change point is going to b smaller than likelihood criteria that it is not. However, if the difference in likelihood is small, it can just be noise. So the incremental algorithm change the stopping criteria to
  
 +
* The no_change period and change period likelihood difference is below a small constant times of the no_change period likelihood.
 +
 +
=====Performance Evaluation=====
 +
Not as good as the batch model, because it is local optimum since the future coming data is not observed
 +
 +
== Interesting points ==
 +
It is a non-Bayesian model, hence prior model doesn't require. It is very easy to implement and it considers the time span which is very important, but since it is a quite old paper, the experiments are obsolete.
  
=== Feature Level Analysis ===
 
Top feature that contribute more on deciding credibility:
 
*Tweets having an URL is the root of the tree.
 
*Sentiment-based feature like fraction of negative sentiment
 
*Low credibility news are mostly propagated by users who have not written many message in the past
 
  
== Interesting Aspect ==
 
I like the coding scheme of this paper. It is reasonable and comprehensive. Some of the conclusion that drew from the paper is interesting to look at. For example
 
  
* Among several other features, newsworthy topics tend to include URLs and to have deep propagation trees
 
* Among several other features, credible news are propagated through authors that have previously written a large number of messages, originate
 
at a single or a few users in the network, and have many re-posts.
 
  
 
== Related Papers ==
 
== Related Papers ==

Revision as of 22:07, 1 October 2012

Castillo http://delivery.acm.org/10.1145/320000/312190/p33-guralnik.pdf?ip=128.237.122.250&acc=ACTIVE%20SERVICE&CFID=119212228&CFTOKEN=52277574&__acm__=1348531826_377333b00daa1db4fd36cb60f6bb28fb


Citation

@inproceedings{:conf/kdd/GuralnikS99,

 author    = {Valery Guralnik and
              Jaideep Srivastava},
 title     = {Event Detection from Time Series Data},
 booktitle = {KDD},
 year      = {1999},
 pages     = {33-42},
 ee        = {http://doi.acm.org/10.1145/312129.312190},
 bibsource = {http://dblp.uni-trier.de}

}


Abstract from the paper

Online version

pdf link to the paper

Summary

Task Definition

  • Develop a general approach to change-point detection that generalize across wide range of application

Method

Batch Algorithm

  • Algorithm overview

The algorithm takes the set of approximating basis functions MSet and the time series T

  1. new-change-point = find-candidate(T, MSet)
  2. Change-Points =
  3. Candidates =
  4. Tl, Tz = get-new-time-ranges(T, Change-Points, new-change-point)
  5. while(stopping criteria is not met) do begin
    1. cl = find-candidate(T1, MSet)
    2. c2 = find-andidate(T2, MSet)
    3. Candidates = Candidates
    4. Candidates = Candidates
    5. new-change-point = c Candidates |Q(Change-Points,c) = min
    6. Candidates = Candidates \ new-change-point
    7. Tl,T2 = get-new-time-ranges(T, Change-Points, new-change-point)
    8. Change-Points = Change-Points new-change-points
  6. end
  • Stopping Criteria

If in iterations k and k+1 the respective likelihood criteria, the algorithm should stop if the difference of the likelihood proportioned to the last step likelihood is below a small constant.

  • Experiment with Traffic Data

The data used in our experiments was taken from highway traffic sensors, called loop detectors, in the Minneapolis-St. Paul metro area. A loop detector is a sensor, embedded in the road, with an electro-magnetic field around it, which is broken when a vehicle goes over it. Each such breaking, and the subsequent reestablishment, of the field is recorded as a count. Traffic volume is defined as the vehicle count per unit time. We need to find the change point detection algorithm performed compared to a person doing the same task through visual inspection.

  • The results seems a little week. It has no ground truth. Statistically speaking it does performs better than 4 human subject.

Incremental Algorithm

Because the next data point collected by the sensor reflects significant change in phenomenon, then its likelihood criteria of being a change point is going to b smaller than likelihood criteria that it is not. However, if the difference in likelihood is small, it can just be noise. So the incremental algorithm change the stopping criteria to

  • The no_change period and change period likelihood difference is below a small constant times of the no_change period likelihood.
Performance Evaluation

Not as good as the batch model, because it is local optimum since the future coming data is not observed

Interesting points

It is a non-Bayesian model, hence prior model doesn't require. It is very easy to implement and it considers the time span which is very important, but since it is a quite old paper, the experiments are obsolete.



Related Papers

  • T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time event detection by social sensors.

In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 851–860, New York, NY, USA, April 2010. ACM

  • J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D.Lieberman, and J. Sperling. TwitterStand: news in tweets. In GIS ’09: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42–51, New York, NY, USA, November 2009. ACM Press.