Difference between revisions of "Jaccard similarity"
(Created page with 'This is a technical [[category::method]] discussed in Social Media Analysis 10-802 in Spring 2010. == What problem does it address == Quantifying similarity between two vec…') |
|||
Line 8: | Line 8: | ||
* Input - | * Input - | ||
− | A : Vector 1 | + | A : Binary Vector 1 |
− | B : Vector 2 | + | B : Binary Vector 2 |
− | * Output - | + | * Output - |
:<math>\mathbf{a}\cdot\mathbf{b} | :<math>\mathbf{a}\cdot\mathbf{b} | ||
Line 18: | Line 18: | ||
Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as | Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as | ||
+ | :<math> M_{11} : the number of attributes where A is 1 and B is 1</math> | ||
:<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math> | :<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math> | ||
+ | |||
+ | |||
+ | Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows: | ||
+ | |||
+ | M11 represents the total number of attributes where A and B both have a value of 1. | ||
+ | M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1. | ||
+ | M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0. | ||
+ | M00 represents the total number of attributes where A and B both have a value of 0. | ||
== Used in == | == Used in == |
Revision as of 20:45, 30 March 2011
This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.
What problem does it address
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
Algorithm
- Input -
A : Binary Vector 1 B : Binary Vector 2
- Output -
Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
M11 represents the total number of attributes where A and B both have a value of 1. M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1. M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0. M00 represents the total number of attributes where A and B both have a value of 0.
Used in
Widely used for calculating the similarity of documents using the bag-of-words and vector space models