Difference between revisions of "Cosine similarity"

From Cohen Courses
Jump to navigationJump to search
 
Line 1: Line 1:
In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
 
 
 
This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]].
 
This is a technical [[category::method]] discussed in [[Social Media Analysis 10-802 in Spring 2010]].
  
 
== What problem does it address ==
 
== What problem does it address ==
  
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors.  
+
Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.
  
 
== Algorithm ==
 
== Algorithm ==

Latest revision as of 00:49, 7 February 2011

This is a technical method discussed in Social Media Analysis 10-802 in Spring 2010.

What problem does it address

Quantifying similarity between two vectors. Refers to measuring the angular distance (cosine) between two vectors. In text domains, a document is generally treated as a bag of words where each unique word in the vocabulary is a dimension of the vector. Thus similarity between two documents can be assessed by finding the cosine similarity between the vectors corresponding to these two documents. Each element of vector A and vector B is generally taken to be tf-idf weight.

Algorithm

  • Input -
         A : Vector 1
         B : Vector 2 
         
  • Output - cosine : cosine of angle between the vectors

Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

Used in

Widely used for calculating the similarity of documents using the bag-of-words and vector space models

Relevant Papers