Difference between revisions of "Guralnik 99"

From Cohen Courses
Jump to navigationJump to search
(Created page with '''Castillo http://delivery.acm.org/10.1145/320000/312190/p33-guralnik.pdf?ip=128.237.122.250&acc=ACTIVE%20SERVICE&CFID=119212228&CFTOKEN=52277574&__acm__=1348531826_377333b00daa1…')
 
 
(2 intermediate revisions by 2 users not shown)
Line 18: Line 18:
  
 
== Abstract from the paper ==
 
== Abstract from the paper ==
 +
In the past few years there has been increased interest in
 +
using data-mining techniques to extract interesting patterns
 +
from time series data generated by sensors monitoring
 +
temporally varying phenomenon. Most work has assumed
 +
that raw data is somehow processed to generate a sequence
 +
of events, which is then mined for interesting episodes. In
 +
some cases the rule for determining when a sensor reading
 +
should generate an event is well known. However, if the
 +
phenomenon is ill-understood, stating such a rule is difficult.
 +
Detection of events in such an environment is the focus
 +
of this paper. Consider a dynamic phenomenon whose
 +
behavior changes enough over time to be considered a
 +
qualitatively significant change. The problem we investigate
 +
is of identifying the time points at which the behavior
 +
change occurs. In the statistics literature this has been
 +
called the change-point detection problem. The standard
 +
approach has been to (a) upriori determine the number
 +
of change-points that are to be discovered, and (b) decide
 +
the function that will be used for curve fitting in the
 +
interval between successive change-points. In this paper
 +
we generalize along both these dimensions. We propose an
 +
iterative algorithm that fits a model to a time segment, and
 +
uses a likelihood criterion to determine if the segment should
 +
be partitioned further, i.e. if it contains a new changepoint.
 +
In this paper we present algorithms for both the batch
 +
and incremental versions of the problem, and evaluate their
 +
behavior with synthetic and real data. Finally, we present
 +
initial results comparing the change-points detected by the
 +
batch algorithm with those detected by people using visual
 +
inspection.
  
 
== Online version ==
 
== Online version ==
Line 24: Line 54:
  
 
== Summary ==
 
== Summary ==
=== Data Collection===  
+
=== Task Definition===  
 +
*Develop a general approach to change-point detection that generalize across wide range of application
  
=== Automatic Credibility Analysis ===
+
=== Method ===
Four types of features depending on their scope: message-based features,
+
==== Batch Algorithm ====
user-based features, topic-based features, and propagation-
+
*Algorithm overview
based features.
 
*'''Message-based features''' consider characteristics of messages,
 
these features can be Twitter-independent or Twitterdependent.
 
Twitter-independent features include: the length
 
of a message, whether or not the text contains exclamation
 
or question marks and the number of positive/negative sentiment
 
words in a message. Twitter-dependent features include
 
features such as: if the tweet contains a hashtag, and
 
if the message is a re-tweet.
 
*'''User-based features''' consider characteristics of the users
 
which post messages, such as: registration age, number of
 
followers, number of followees (“friends” in Twitter), and the
 
number of tweets the user has authored in the past.
 
*'''Topic-based features''' are aggregates computed from the
 
previous two feature sets; for example, the fraction of tweets
 
that contain URLs, the fraction of tweets with hashtags and
 
the fraction of sentiment positive and negative in a set.
 
*'''Propagation-based features''' consider characteristics related
 
to the propagation tree that can be built from the retweets
 
of a message. These includes features such as the
 
depth of the re-tweet tree, or the number of initial tweets of
 
a topic.
 
=== Automatic Assessing Credibility ===
 
Standard machine learning techniques, the best they report is using J48 decision tree.
 
  
Results:
+
The algorithm takes the set of approximating basis functions MSet
 +
and the time series T
  
Results for the credibility classification.
+
# new-change-point = find-candidate(T, MSet)
 +
# Change-Points = <math>\phi</math>
 +
# Candidates = <math>\phi</math>
 +
# Tl, Tz = get-new-time-ranges(T, Change-Points, new-change-point)
 +
# '''while'''(stopping criteria is not met) do begin
 +
## cl = find-candidate(T1, MSet)
 +
## c2 = find-andidate(T2, MSet)
 +
## Candidates = Candidates <math>\cup c_1</math>
 +
##Candidates = Candidates <math>\cup c_2</math>
 +
##new-change-point = c <math>\in</math> Candidates |Q(Change-Points,c) = min
 +
##Candidates = Candidates \ new-change-point
 +
##Tl,T2 = get-new-time-ranges(T, Change-Points, new-change-point)
 +
##Change-Points = Change-Points <math>\cup</math> new-change-points
 +
#'''end'''
  
Class      TP_Rate  FP_Rate  Prec.  Recall  F1
+
*Stopping Criteria
  
A (“true”)  0.825    0.108    0.874  0.825  0.849
+
If in iterations k and k+1 the respective likelihood criteria, the algorithm should stop if the difference of the likelihood proportioned to the last step likelihood is below a small constant.
  
B (“false”) 0.892    0.175    0.849  0.892  0.87
+
*Experiment with Traffic Data
 +
The data used in our experiments was taken from
 +
highway traffic sensors, called loop detectors, in the Minneapolis-St. Paul metro area. A loop detector is a
 +
sensor, embedded in the road, with an electro-magnetic
 +
field around it, which is broken when a vehicle goes
 +
over it. Each such breaking, and the subsequent reestablishment,
 +
of the field is recorded as a count. Traffic
 +
volume is defined as the vehicle count per unit time. We need to find the change point detection
 +
algorithm performed compared to a person doing
 +
the same task through visual inspection.
 +
* The results seems a little week. It has no ground truth. Statistically speaking it does performs better than 4 human subject.
  
W. Avg.     0.860    0.143    0.861  0.860  0.86
+
==== Incremental Algorithm ====
 +
Because the next data point collected by the sensor reflects significant change in  phenomenon, then its likelihood criteria of being a change point is going to b smaller than likelihood criteria that it is not. However, if the difference in likelihood is small, it can just be noise. So the incremental algorithm change the stopping criteria to
  
 +
* The no_change period and change period likelihood difference is below a small constant times of the no_change period likelihood.
 +
 +
=====Performance Evaluation=====
 +
Not as good as the batch model, because it is local optimum since the future coming data is not observed
 +
 +
== Interesting points ==
 +
It is a non-Bayesian model, hence prior model doesn't require. It is very easy to implement and it considers the time span which is very important, but since it is a quite old paper, the experiments are obsolete.
  
=== Feature Level Analysis ===
 
Top feature that contribute more on deciding credibility:
 
*Tweets having an URL is the root of the tree.
 
*Sentiment-based feature like fraction of negative sentiment
 
*Low credibility news are mostly propagated by users who have not written many message in the past
 
  
== Interesting Aspect ==
 
I like the coding scheme of this paper. It is reasonable and comprehensive. Some of the conclusion that drew from the paper is interesting to look at. For example
 
  
* Among several other features, newsworthy topics tend to include URLs and to have deep propagation trees
 
* Among several other features, credible news are propagated through authors that have previously written a large number of messages, originate
 
at a single or a few users in the network, and have many re-posts.
 
  
 
== Related Papers ==
 
== Related Papers ==
Line 85: Line 114:
  
 
*J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D.Lieberman, and J. Sperling. TwitterStand: news in tweets. In GIS ’09: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42–51, New York, NY, USA, November 2009. ACM Press.
 
*J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D.Lieberman, and J. Sperling. TwitterStand: news in tweets. In GIS ’09: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42–51, New York, NY, USA, November 2009. ACM Press.
 +
 +
* [[RelatedPaper::Lin_et_al_KDD_2011|A Statistical Model for Popular Events Tracking in Social Communities. Lin et al, KDD 2011]] This paper address a method to observe and track the popular events or topics that evolve over time in the communities.
 +
* [[RelatedPaper::Yang et al, SIGIR 98|A study on retrospective and online event detection. Yang et al, SIGIR 98]] This paper addresses the problems of detecting events in news stories.
 +
* [[RelatedPaper::Zhao et al, AAAI 07|Temporal and information flow based event detection from social text streams. Zhao et al, AAAI 07]] This paper addresses the problems of detecting events in news stories.
 +
* [[RelatedPaper::Automatic_Detection_and_Classification_of_Social_Events|Automatic Detection and Classification of Social Events. Agarwal and Rambow, ACL 10]] This paper aims at detecting and classifying social events using Tree kernels.
 +
* [[RelatedPaper::Popescu and Pennacchiotti, CIKM 10|Detecting controversial events from Twitter. Popescu and Pennacchiotti, CIKM 10]] This paper addresses the task of identifying controversial events using Twitter as a starting point.
 +
* [[RelatedPaper::Castillo_2011|Information credibility on twitter. Castillo et al, WWW 11]] The authors develop a general approach to change-point detection that generalize across wide range of application.

Latest revision as of 22:08, 1 October 2012

Castillo http://delivery.acm.org/10.1145/320000/312190/p33-guralnik.pdf?ip=128.237.122.250&acc=ACTIVE%20SERVICE&CFID=119212228&CFTOKEN=52277574&__acm__=1348531826_377333b00daa1db4fd36cb60f6bb28fb


Citation

@inproceedings{:conf/kdd/GuralnikS99,

 author    = {Valery Guralnik and
              Jaideep Srivastava},
 title     = {Event Detection from Time Series Data},
 booktitle = {KDD},
 year      = {1999},
 pages     = {33-42},
 ee        = {http://doi.acm.org/10.1145/312129.312190},
 bibsource = {http://dblp.uni-trier.de}

}


Abstract from the paper

In the past few years there has been increased interest in using data-mining techniques to extract interesting patterns from time series data generated by sensors monitoring temporally varying phenomenon. Most work has assumed that raw data is somehow processed to generate a sequence of events, which is then mined for interesting episodes. In some cases the rule for determining when a sensor reading should generate an event is well known. However, if the phenomenon is ill-understood, stating such a rule is difficult. Detection of events in such an environment is the focus of this paper. Consider a dynamic phenomenon whose behavior changes enough over time to be considered a qualitatively significant change. The problem we investigate is of identifying the time points at which the behavior change occurs. In the statistics literature this has been called the change-point detection problem. The standard approach has been to (a) upriori determine the number of change-points that are to be discovered, and (b) decide the function that will be used for curve fitting in the interval between successive change-points. In this paper we generalize along both these dimensions. We propose an iterative algorithm that fits a model to a time segment, and uses a likelihood criterion to determine if the segment should be partitioned further, i.e. if it contains a new changepoint. In this paper we present algorithms for both the batch and incremental versions of the problem, and evaluate their behavior with synthetic and real data. Finally, we present initial results comparing the change-points detected by the batch algorithm with those detected by people using visual inspection.

Online version

pdf link to the paper

Summary

Task Definition

  • Develop a general approach to change-point detection that generalize across wide range of application

Method

Batch Algorithm

  • Algorithm overview

The algorithm takes the set of approximating basis functions MSet and the time series T

  1. new-change-point = find-candidate(T, MSet)
  2. Change-Points =
  3. Candidates =
  4. Tl, Tz = get-new-time-ranges(T, Change-Points, new-change-point)
  5. while(stopping criteria is not met) do begin
    1. cl = find-candidate(T1, MSet)
    2. c2 = find-andidate(T2, MSet)
    3. Candidates = Candidates
    4. Candidates = Candidates
    5. new-change-point = c Candidates |Q(Change-Points,c) = min
    6. Candidates = Candidates \ new-change-point
    7. Tl,T2 = get-new-time-ranges(T, Change-Points, new-change-point)
    8. Change-Points = Change-Points new-change-points
  6. end
  • Stopping Criteria

If in iterations k and k+1 the respective likelihood criteria, the algorithm should stop if the difference of the likelihood proportioned to the last step likelihood is below a small constant.

  • Experiment with Traffic Data

The data used in our experiments was taken from highway traffic sensors, called loop detectors, in the Minneapolis-St. Paul metro area. A loop detector is a sensor, embedded in the road, with an electro-magnetic field around it, which is broken when a vehicle goes over it. Each such breaking, and the subsequent reestablishment, of the field is recorded as a count. Traffic volume is defined as the vehicle count per unit time. We need to find the change point detection algorithm performed compared to a person doing the same task through visual inspection.

  • The results seems a little week. It has no ground truth. Statistically speaking it does performs better than 4 human subject.

Incremental Algorithm

Because the next data point collected by the sensor reflects significant change in phenomenon, then its likelihood criteria of being a change point is going to b smaller than likelihood criteria that it is not. However, if the difference in likelihood is small, it can just be noise. So the incremental algorithm change the stopping criteria to

  • The no_change period and change period likelihood difference is below a small constant times of the no_change period likelihood.
Performance Evaluation

Not as good as the batch model, because it is local optimum since the future coming data is not observed

Interesting points

It is a non-Bayesian model, hence prior model doesn't require. It is very easy to implement and it considers the time span which is very important, but since it is a quite old paper, the experiments are obsolete.



Related Papers

  • T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time event detection by social sensors.

In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 851–860, New York, NY, USA, April 2010. ACM

  • J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D.Lieberman, and J. Sperling. TwitterStand: news in tweets. In GIS ’09: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42–51, New York, NY, USA, November 2009. ACM Press.