Difference between revisions of "Jaccard similarity"
Line 8: | Line 8: | ||
* Input | * Input | ||
− | A : Binary Vector 1 | + | :<math> \mathbf{A} : \text{Binary Vector 1}</math> |
− | B : Binary Vector 2 | + | :<math> \mathbf{B} : \text{Binary Vector 2}</math> |
+ | |||
+ | The size of A and B are same. | ||
* Output | * Output | ||
− | |||
− | |||
− | |||
Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as | Given two vectors of attributes, ''A'' and ''B'', the cosine similarity, ''θ'', is represented using a dot product and magnitude as |
Revision as of 21:01, 30 March 2011
This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.
What problem does it address
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
Algorithm
- Input
The size of A and B are same.
- Output
Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
Used in
Widely used for calculating the similarity of documents using the bag-of-words and vector space models