Difference between revisions of "Cosine similarity"
(Created page with 'Refers to measuring the angular distance (cosine) between two vectors. Cosine of two vectors can be easily derived by using the [[Euclidean vector#Dot product|Euclidean Dot Prod…') |
|||
Line 1: | Line 1: | ||
− | Refers to measuring the angular distance (cosine) between two vectors. | + | In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight. |
− | + | ||
+ | This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]]. | ||
+ | |||
+ | == What problem does it address == | ||
+ | |||
+ | Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. | ||
+ | |||
+ | == Algorithm == | ||
+ | |||
+ | * Input - | ||
+ | A : Vector 1 | ||
+ | B : Vector 2 | ||
+ | |||
+ | |||
+ | * Output - cosine : cosine of angle between the vectors | ||
:<math>\mathbf{a}\cdot\mathbf{b} | :<math>\mathbf{a}\cdot\mathbf{b} | ||
Line 9: | Line 23: | ||
:<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math> | :<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math> | ||
− | + | == Used in == | |
+ | |||
+ | Widely used for calculating the similarity of documents using the bag-of-words and vector space models | ||
+ | |||
+ | == Relevant Papers == | ||
+ | |||
+ | {{#ask: [[UsesMethod::Cosine_similarity]] | ||
+ | | ?AddressesProblem | ||
+ | | ?UsesDataset | ||
+ | }} |
Revision as of 23:09, 6 February 2011
In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.
What problem does it address
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors.
Algorithm
- Input -
A : Vector 1 B : Vector 2
- Output - cosine : cosine of angle between the vectors
Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as
Used in
Widely used for calculating the similarity of documents using the bag-of-words and vector space models