Difference between revisions of "Jaccard similarity"

Revision as of 21:01, 30 March 2011

This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.

What problem does it address

Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.

Algorithm

Input

\mathbf {A} :{\text{Binary Vector 1}}

\mathbf {B} :{\text{Binary Vector 2}}

The size of A and B are same.

Output

Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

\mathbf {M_{11}} :{\text{the number of attributes where A is 1 and B is 1}}

\mathbf {M_{01}} :{\text{the number of attributes where A is 0 and B is 1}}

\mathbf {M_{10}} :{\text{the number of attributes where A is 1 and B is 0}}

\mathbf {M_{00}} :{\text{the number of attributes where A is 0 and B is 0}}

{\text{Jaccard similarity}}=\mathbf {J} ={\frac {M_{11}}{M_{01}+M_{10}+M_{00}}}

Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

Used in

Widely used for calculating the similarity of documents using the bag-of-words and vector space models

@@ Line 8: / Line 8: @@
 * Input
-A : Binary Vector 1
+:<math> \mathbf{A} : \text{Binary Vector 1}</math>
-B : Binary Vector 2
+:<math> \mathbf{B} : \text{Binary Vector 2}</math>
+The size of A and B are same.
 * Output
-:<math>\mathbf{a}\cdot\mathbf{b}
-=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta</math>
 Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as

Difference between revisions of "Jaccard similarity"

Revision as of 21:01, 30 March 2011

Contents

What problem does it address

Algorithm

Used in

Relevant Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools