Difference between revisions of "Cosine similarity"

Latest revision as of 00:49, 7 February 2011

This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.

What problem does it address

Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.

Algorithm

Input -

         A : Vector 1
         B : Vector 2

Output - cosine : cosine of angle between the vectors

\mathbf {a} \cdot \mathbf {b} =\left\|\mathbf {a} \right\|\left\|\mathbf {b} \right\|\cos \theta

Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

{\text{similarity}}=\cos(\theta )={A\cdot B \over \|A\|\|B\|}={\frac {\sum _{i=1}^{n}{A_{i}\times B_{i}}}{{\sqrt {\sum _{i=1}^{n}{(A_{i})^{2}}}}\times {\sqrt {\sum _{i=1}^{n}{(B_{i})^{2}}}}}}

Used in

Widely used for calculating the similarity of documents using the bag-of-words and vector space models

@@ Line 1: / Line 1: @@
-Refers to measuring the angular distance (cosine) between two vectors.
+This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]].
-Cosine of two vectors can be easily derived by using the [[Euclidean vector#Dot product|Euclidean Dot Product]] formula:
+== What problem does it address ==
+Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
+== Algorithm ==
+* Input -
+          A : Vector 1
+          B : Vector 2
+* Output - cosine : cosine of angle between the vectors
 :<math>\mathbf{a}\cdot\mathbf{b}
@@ Line 9: / Line 21: @@
 :<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math>
-In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
+== Used in ==
+Widely used for calculating the similarity of documents using the bag-of-words and vector space models
+== Relevant Papers ==
+{{#ask: [[UsesMethod::Cosine_similarity]]
+| ?AddressesProblem
+| ?UsesDataset
+}}

Difference between revisions of "Cosine similarity"

Latest revision as of 00:49, 7 February 2011

Contents

What problem does it address

Algorithm

Used in

Relevant Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools