Difference between revisions of "Jaccard similarity"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]].
 
 
 
== What problem does it address ==
 
== What problem does it address ==
  
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
+
Jaccard similarity is used to measure the similarity between two sample sets.  
  
 
== Algorithm ==
 
== Algorithm ==
Line 15: Line 13:
 
* Output  
 
* Output  
  
Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as
 
 
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math>  
 
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math>  
 
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math>
 
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math>
Line 23: Line 20:
 
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math>
 
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math>
  
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
 
 
== Used in ==
 
 
Widely used for calculating the similarity of documents using the bag-of-words and vector space models
 
  
 
== Relevant Papers ==
 
== Relevant Papers ==

Revision as of 21:12, 30 March 2011

What problem does it address

Jaccard similarity is used to measure the similarity between two sample sets.

Algorithm

  • Input

The size of A and B are same.

  • Output


Relevant Papers