Jaccard similarity

From Cohen Courses
Jump to navigationJump to search

This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.

What problem does it address

Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.

Algorithm

  • Input -
         A : Binary Vector 1
         B : Binary Vector 2 
         
  • Output -

Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as


Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

   M11 represents the total number of attributes where A and B both have a value of 1.
   M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
   M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
   M00 represents the total number of attributes where A and B both have a value of 0.

Used in

Widely used for calculating the similarity of documents using the bag-of-words and vector space models

Relevant Papers