Difference between revisions of "Jaccard similarity"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'This is a technical [[category::method]] discussed in Social Media Analysis 10-802 in Spring 2010. == What problem does it address == Quantifying similarity between two vec…')
 
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]].
+
Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called [[UsesMethod::Tanimoto coefficient]].
  
== What problem does it address ==
+
== Algorithm ==
  
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
+
* Input 
 
+
:<math> \mathbf{A} : \text{Binary Vector 1}</math>
== Algorithm ==
+
:<math> \mathbf{B} : \text{Binary Vector 2}</math>
  
* Input -
+
The size of A and B are same.
          A : Vector 1
 
          B : Vector 2
 
 
            
 
            
 +
* Output
  
* Output - cosine : cosine of angle between the vectors
+
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math>
 
+
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math>
:<math>\mathbf{a}\cdot\mathbf{b}
+
:<math> \mathbf{M_{10}} : \text{the number of attributes where A is 1 and B is 0}</math>
=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta</math>
+
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math>
 
 
Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as
 
 
 
:<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math>
 
  
== Used in ==
+
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math>
  
Widely used for calculating the similarity of documents using the bag-of-words and vector space models
+
:<math> \text{Jaccard dissimilarity} = 1 - \mathbf{J} </math>
  
 
== Relevant Papers ==
 
== Relevant Papers ==
  
{{#ask: [[UsesMethod::Cosine_similarity]]
+
{{#ask: [[UsesMethod::Jaccard_similarity]]
 
| ?AddressesProblem
 
| ?AddressesProblem
 
| ?UsesDataset
 
| ?UsesDataset
 
}}
 
}}

Latest revision as of 22:21, 30 March 2011

Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called Tanimoto coefficient.

Algorithm

  • Input

The size of A and B are same.

  • Output

Relevant Papers