Difference between revisions of "Jaccard similarity"

From Cohen Courses
Jump to navigationJump to search
Line 7: Line 7:
 
== Algorithm ==
 
== Algorithm ==
  
* Input -
+
* Input
          A : Binary Vector 1
+
A : Binary Vector 1
          B : Binary Vector 2  
+
B : Binary Vector 2  
 
            
 
            
 
+
* Output  
* Output -
 
  
 
:<math>\mathbf{a}\cdot\mathbf{b}
 
:<math>\mathbf{a}\cdot\mathbf{b}
Line 23: Line 22:
 
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math>
 
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math>
  
:<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math>
+
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_11} }{ M_01 + M_10 + M_00 }</math>
  
 
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
 
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

Revision as of 21:51, 30 March 2011

This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.

What problem does it address

Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.

Algorithm

  • Input

A : Binary Vector 1 B : Binary Vector 2

  • Output

Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

Failed to parse (syntax error): {\displaystyle \text{Jaccard similarity} = \mathbf{J} = \frac{ M_11} }{ M_01 + M_10 + M_00 }}

Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

Used in

Widely used for calculating the similarity of documents using the bag-of-words and vector space models

Relevant Papers