Difference between revisions of "Jaccard similarity"
Line 7: | Line 7: | ||
== Algorithm == | == Algorithm == | ||
− | * Input | + | * Input |
− | + | A : Binary Vector 1 | |
− | + | B : Binary Vector 2 | |
− | + | * Output | |
− | * Output | ||
:<math>\mathbf{a}\cdot\mathbf{b} | :<math>\mathbf{a}\cdot\mathbf{b} | ||
Line 23: | Line 22: | ||
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math> | :<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math> | ||
− | :<math> \text{similarity} = \ | + | :<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_11} }{ M_01 + M_10 + M_00 }</math> |
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows: | Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows: |
Revision as of 20:51, 30 March 2011
This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.
What problem does it address
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
Algorithm
- Input
A : Binary Vector 1 B : Binary Vector 2
- Output
Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as
- Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{Jaccard similarity} = \mathbf{J} = \frac{ M_11} }{ M_01 + M_10 + M_00 }}
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
Used in
Widely used for calculating the similarity of documents using the bag-of-words and vector space models