Difference between revisions of "Comparison: A Latent Variable Model for Geographic Lexical Variation and A probabilistic approach to spatiotemporal theme pattern mining on weblogs"

From Cohen Courses
Jump to navigationJump to search
(Created page with '==Papers== #[http://malt.ml.cmu.edu/mw/index.php/A_Latent_Variable_Model_for_Geographic_Lexical_Variation A Latent Variable Model for Geographic Lexical Variation] #[http://malt.…')
 
Line 5: Line 5:
 
== Problem ==
 
== Problem ==
  
Hassan et al. were trying to target the problem of ranking documents in a set based on their similarity to identify the representative blogs in a given set usually based on different topics, similar to [[AddressesProblem::Blog summarization|blog summarization]].<br>
+
Jacob et al. aims to analyze the variation in the usage of words in vernacular wrt geography. In particular, it analyzes lexical variation by both topic and geography. It also separates regions into coherent linguistic communities. Also it can predict with some accuracy the location of the author from raw text.
<p>Arguello et al. were trying to target the problem of [[AddressesProblem::Blog retrieval|blog retrieval]] - retrieving ranked list of blogs relevant to the given user query.</p>
 
<p>Basically these two papers are trying to achieve different goals. Hassan et al. proposed methods of ranking blogs within a given topic collection. This ranking of blogs based on their importance in a topic collection can be useful for blog search tasks. Arguello et al. experimented with various models to try and improve the blog search results.</p>
 
  
== Big Idea ==
+
 
The two papers differ in their respective central ideas as they both try to solve different problems as mentioned in Problem section above. They do use a common data set to evaluate their experiments, but their results can't be compared due to the difference in the problem they are addressing.
+
Q. Mei et al. aims to analyze webblogs by analyzing their spatiotemporal petterns. In particular, it addresses the problem that former approaches in finding subtopics for weblogs only considering either spatial information or temporal information.  
  
 
== Method ==
 
== Method ==
Hassan et al. have used BlogRank algorithm to rank blogs according to their popularity which considers lexical similarity between two blogs to identify graphical links between nodes representing the two blogs. And then based on the iterative algorithm like a random walk of this graph, it determines the rank for each blog. It also enhances diversity by penalizing blogs which are similar to a higher ranked blog.
 
  
<p>
+
Jacob et al. use an enhanced edition of LDA by considering location information in modeling word distribution and assigning a probabilistic model for location and document.  
Arguello et al. have used different blog representation models and query expansion techniques to enhance the blog retrieval results. They have tried representation models considering entire blog as one large document or treating each blog post as a small document within a collection. For query expansion they experimented with the traditional pseudo-relevance feedback model and another method where they extended the query using ranked anchor text from Wikipedia corpus related to the base query.
+
 
</p>
+
Q. Mei et al. designed their model based on pLSI and give no probabilistic model for document.  
  
 
== Dataset Used ==
 
== Dataset Used ==
Hassan et al. used the [[UsesDataset::TREC BLOG06]] and [[UsesDataset::UCLA Blogocenter]] datasets for experiments, whereas Arguello et al. used only the [[UsesDataset::TREC BLOG06]] dataset for its experiments.
+
Jacob et al. use [[UsesDataset::GeoTagged Twitter Dataset]] and Q. Mei et al. use [[UsesDataset:: Hurricane Katrina]],[[UsesDataset:: Hurricane Rita]],[[UsesDataset:: IPod Nano]]
 +
 
 +
 
 +
== Big Idea ==
 +
 
 +
These two papers are different in all three above aspects, i.e problem addressed, methods, dataset used.
 +
Problem: Jacob et al. try to find topics that related to a specific user by incorporating its location information while Q. Mei et al. aim at finding subtopics in different time and locations from documents that have the same topics.
 +
 
 +
Method: Jacob et al. use a LDA type model while Q. Mei et al. adapt a pLSI type mehtod.
 +
 
 +
Dataset: Because of different problem they address, the data set they used are very different. Jacob et al. use twitter type documents, which are very short. Q. Mei use Weblogs, which are relative long.  
  
 
== Other Discussions ==
 
== Other Discussions ==
  
Both the papers show significant improvement in results from baseline with their proposed methods. Both these papers deal with problems which together are essential in better understanding of the blogosphere and will be helpful in blog retrieval and summarization.
+
It would be interesting to apply the methods used in Jacob et al. to the problems that Q. Mei et al. try to address since LDA claims that it is better than pLSI.  
  
 
== Other Questions ==
 
== Other Questions ==
#How much time did you spend reading the (new, non-wikified) paper you summarized? 3 hours
+
#How much time did you spend reading the (new, non-wikified) paper you summarized? 2.5 hours
#How much time did you spend reading the old wikified paper? 1 hour 30 min
+
#How much time did you spend reading the old wikified paper? 4 hours
#How much time did you spend reading the summary of the old paper? 20 min
+
#How much time did you spend reading the summary of the old paper? 10 min
#How much time did you spend reading background material? 1 hour
+
#How much time did you spend reading background material? 3 hours
#Was there a study plan for the old paper? No
+
#Was there a study plan for the old paper? Yes
##if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? Study Plan Not Available
+
##if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? 3 hours
 
#Give us any additional feedback you might have about this assignment.
 
#Give us any additional feedback you might have about this assignment.
#*I think, it might be useful to compare papers which address same problems. In this case, the two papers were trying to deal with different problems and hence there wasn't much to compare between the two. The time spent on this task can also be reduced if both the papers have been summarized by the same person comparing it, but that would depend on whether or not we want a different opinion for an existing summary (provided the comparison is between papers addressing same problems).
+
I like this task very much, it helps me to think deeper about the paper I read. It would be great if we can read some paper that have great impact and real applications. The experiments on the paper I read are quite restrict and hard to re-implement.

Revision as of 11:11, 6 November 2012

Papers

  1. A Latent Variable Model for Geographic Lexical Variation
  2. A probabilistic approach to spatiotemporal theme pattern mining on weblogs

Problem

Jacob et al. aims to analyze the variation in the usage of words in vernacular wrt geography. In particular, it analyzes lexical variation by both topic and geography. It also separates regions into coherent linguistic communities. Also it can predict with some accuracy the location of the author from raw text.


Q. Mei et al. aims to analyze webblogs by analyzing their spatiotemporal petterns. In particular, it addresses the problem that former approaches in finding subtopics for weblogs only considering either spatial information or temporal information.

Method

Jacob et al. use an enhanced edition of LDA by considering location information in modeling word distribution and assigning a probabilistic model for location and document.

Q. Mei et al. designed their model based on pLSI and give no probabilistic model for document.

Dataset Used

Jacob et al. use GeoTagged Twitter Dataset and Q. Mei et al. use Hurricane Katrina,Hurricane Rita,IPod Nano


Big Idea

These two papers are different in all three above aspects, i.e problem addressed, methods, dataset used. Problem: Jacob et al. try to find topics that related to a specific user by incorporating its location information while Q. Mei et al. aim at finding subtopics in different time and locations from documents that have the same topics.

Method: Jacob et al. use a LDA type model while Q. Mei et al. adapt a pLSI type mehtod.

Dataset: Because of different problem they address, the data set they used are very different. Jacob et al. use twitter type documents, which are very short. Q. Mei use Weblogs, which are relative long.

Other Discussions

It would be interesting to apply the methods used in Jacob et al. to the problems that Q. Mei et al. try to address since LDA claims that it is better than pLSI.

Other Questions

  1. How much time did you spend reading the (new, non-wikified) paper you summarized? 2.5 hours
  2. How much time did you spend reading the old wikified paper? 4 hours
  3. How much time did you spend reading the summary of the old paper? 10 min
  4. How much time did you spend reading background material? 3 hours
  5. Was there a study plan for the old paper? Yes
    1. if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? 3 hours
  6. Give us any additional feedback you might have about this assignment.

I like this task very much, it helps me to think deeper about the paper I read. It would be great if we can read some paper that have great impact and real applications. The experiments on the paper I read are quite restrict and hard to re-implement.