Difference between revisions of "Cosine similarity"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Refers to measuring the angular distance (cosine) between two vectors. Cosine of two vectors can be easily derived by using the [[Euclidean vector#Dot product|Euclidean Dot Prod…')
 
Line 1: Line 1:
Refers to measuring the angular distance (cosine) between two vectors.  
+
In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
Cosine of two vectors can be easily derived by using the [[Euclidean vector#Dot product|Euclidean Dot Product]] formula:
+
 
 +
This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]].
 +
 
 +
== What problem does it address ==
 +
 
 +
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors.  
 +
 
 +
== Algorithm ==
 +
 
 +
* Input -
 +
          A : Vector 1
 +
          B : Vector 2
 +
         
 +
 
 +
* Output - cosine : cosine of angle between the vectors
  
 
:<math>\mathbf{a}\cdot\mathbf{b}
 
:<math>\mathbf{a}\cdot\mathbf{b}
Line 9: Line 23:
 
:<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math>
 
:<math> \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }</math>
  
In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
+
== Used in ==
 +
 
 +
Widely used for calculating the similarity of documents using the bag-of-words and vector space models
 +
 
 +
== Relevant Papers ==
 +
 
 +
{{#ask: [[UsesMethod::Cosine_similarity]]
 +
| ?AddressesProblem
 +
| ?UsesDataset
 +
}}

Revision as of 23:09, 6 February 2011

In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.

This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.

What problem does it address

Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors.

Algorithm

  • Input -
         A : Vector 1
         B : Vector 2 
         
  • Output - cosine : cosine of angle between the vectors

Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

Used in

Widely used for calculating the similarity of documents using the bag-of-words and vector space models

Relevant Papers