Difference between revisions of "Jaccard similarity"

From Cohen Courses
Jump to navigationJump to search
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]].
+
Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called [[UsesMethod::Tanimoto coefficient]].
  
== What problem does it address ==
+
== Algorithm ==
  
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
+
* Input 
 +
:<math> \mathbf{A} : \text{Binary Vector 1}</math>
 +
:<math> \mathbf{B} : \text{Binary Vector 2}</math>
  
== Algorithm ==
+
The size of A and B are same.
 
 
* Input -
 
          A : Binary Vector 1
 
          B : Binary Vector 2
 
 
            
 
            
 +
* Output
  
* Output -
+
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math>
 
+
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math>
:<math>\mathbf{a}\cdot\mathbf{b}
+
:<math> \mathbf{M_{10}} : \text{the number of attributes where A is 1 and B is 0}</math>
=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta</math>
+
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math>
 
 
Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as
 
:<math> M_{11} : \text{the number of attributes where A is 1 and B is 1}</math>  
 
 
 
:<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math>
 
 
 
 
 
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
 
 
 
    M11 represents the total number of attributes where A and B both have a value of 1.
 
    M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
 
    M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
 
    M00 represents the total number of attributes where A and B both have a value of 0.
 
  
== Used in ==
+
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math>
  
Widely used for calculating the similarity of documents using the bag-of-words and vector space models
+
:<math> \text{Jaccard dissimilarity} = 1 - \mathbf{J} </math>
  
 
== Relevant Papers ==
 
== Relevant Papers ==
  
{{#ask: [[UsesMethod::Cosine_similarity]]
+
{{#ask: [[UsesMethod::Jaccard_similarity]]
 
| ?AddressesProblem
 
| ?AddressesProblem
 
| ?UsesDataset
 
| ?UsesDataset
 
}}
 
}}

Latest revision as of 22:21, 30 March 2011

Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called Tanimoto coefficient.

Algorithm

  • Input

The size of A and B are same.

  • Output

Relevant Papers