Difference between revisions of "Latent Friend Mining from Blog Data, ICDM 2006"

From Cohen Courses
Jump to navigationJump to search
Line 12: Line 12:
  
 
== Discussion ==
 
== Discussion ==
This paper addresses the problem of judging how positive or negative or neutral a word (here is more about [[WordNet]] synset) is, which is one of major task in [[AddressesProblem::sentiment analysis]]. In this paper, the authors proposed to leverage [[PageRank]] algorithm on the graph built on [[WordNet]] synset. Under the intuition that if a synset sk that contributes to the definition of synset si by virtue of its member terms occurring in the gloss of si, then the polarity of synset sk contributes to the polarity of synset si, the authors built the graph as G=(V.E) where V is all [[WordNet]] synsets and edge (si -> sk) is in E if and only if the gloss of synset si contains a term belonging to synset sk.
+
This paper proposed a novel problem of finding latent friend within web bloggers based on their interests. In this paper, the authors gave a formal definition of "latent friend" and introduced the importance of this new problem. Three methods were proposed and compared in the paper, they are:
 +
  1. Cosine Similarity. This approach just build a bag-of-words vector for each user and "friendship" of two users is measured according to the cosine similarity between the corresponding words vectors.
 +
  2. Topic model. Each user is represented by a topic distribution, and "friendship" is based on the KL divergence of the two distribution.
 +
  3. Ad-hoc two phrase algorithm. In first phrase the authors calculate similarity at topic level, where topic is predefined hierarchy. In second phrase, they calculate similarity within each topic.
  
The strong points of the paper includes:
+
After giving the algorithms, this paper build a dataset from MSN Spaces to evaluate those three methods and find the ad-hoc two phrase algorithm worked best.
  1. It first introduced PageRank into solving the words (or synset) polarity problem.
 
  2. It considered positivity and negativity separately so that it can classify words (or synset) into three categories: positive, negative and neutral.
 
 
 
The weak point of the paper includes:
 
  1. This paper defined, solved and evaluated the problem on [[WordNet]] synsets, but [[WordNet]] synsets is not what we meet in real text. As a result, I think it might be better if the authors can provide a method to convert words into [[WordNet]] synsets and evaluate the proposed method on real world text.
 
  2. It didn't consider the POS tag. We know that sense of words might vary a lot on different POS tags. As a result, even if a term in sk occurs in the gloss of si, it not necessarily suggest that the term represents the meaning of synset sk, thus sk might have different polarity with si.
 
  3. The degree of a node is associated with the length of definition, which has nothing to do with the polarity.
 
  
 
== Study plan ==
 
== Study plan ==

Revision as of 15:50, 1 November 2012

Citation

Latent Friend Mining from Blog Data

Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen

Online version

Latent Friend Mining from Blog Data

Summary

The paper proposed a new problem to find latent friend for web bloggers. The paper compared three different algorithms, i.e. cosine similarity, topic model and a ad-hoc two phrase algorithm to address this new problem. Moreover, they built a dataset MSN Spaces to evaluate those three methods.

Discussion

This paper proposed a novel problem of finding latent friend within web bloggers based on their interests. In this paper, the authors gave a formal definition of "latent friend" and introduced the importance of this new problem. Three methods were proposed and compared in the paper, they are:

 1. Cosine Similarity. This approach just build a bag-of-words vector for each user and "friendship" of two users is measured according to the cosine similarity between the corresponding words vectors.
 2. Topic model. Each user is represented by a topic distribution, and "friendship" is based on the KL divergence of the two distribution.
 3. Ad-hoc two phrase algorithm. In first phrase the authors calculate similarity at topic level, where topic is predefined hierarchy. In second phrase, they calculate similarity within each topic.

After giving the algorithms, this paper build a dataset from MSN Spaces to evaluate those three methods and find the ad-hoc two phrase algorithm worked best.

Study plan

]