Difference between revisions of "Cosine similarity"
Line 1: | Line 1: | ||
− | |||
− | |||
This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]]. | This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]]. | ||
== What problem does it address == | == What problem does it address == | ||
− | Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. | + | Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight. |
== Algorithm == | == Algorithm == |
Latest revision as of 23:49, 6 February 2011
This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.
What problem does it address
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
Algorithm
- Input -
A : Vector 1 B : Vector 2
- Output - cosine : cosine of angle between the vectors
Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as
Used in
Widely used for calculating the similarity of documents using the bag-of-words and vector space models