Sklearn Cosine Similarity : Implementation Step By Step


We can import sklearn cosine similarity function from sklearn.metrics.pairwise.  It will calculate cosine similarity between two NumPy arrays. In this article, We will implement cosine similarity step by step.


sklearn cosine similarity: Python –

Suppose you have two documents of different sizes. Now how you will compare both the documents or find similarities between them. Cosine Similarity is a metric that allows you to measure the similarity of the documents.

The formulae for finding the cosine similarity is the below.

Cosine Similarity formulae
Cosine Similarity formulae



We will implement this function in various small steps. Let’s start.

Step 1: Importing package –

Firstly, In this step, We will import cosine_similarity module from sklearn.metrics.pairwise package. Here will also import NumPy module for array creation. Here is the syntax for this.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

Step 2: Vector Creation –

Secondly, In order to demonstrate cosine similarity function we need vectors. Here vectors are numpy array. Lets create numpy array.

array_vec_1 = np.array([[12,41,60,11,21]])
array_vec_2 = np.array([[40,11,04,11,14]]) 

Step 3: Cosine Similarity-

Finally, Once we have vectors, We can call cosine_similarity() by passing both vectors. It will calculate the cosine similarity between these two. It will be a value between [0,1]. If it is 0 then both vectors are complete different. But in the place of that if it is 1, It will be completely similar.

cosine_similarity(array_vec_1 , array_vec_2)

Complete code with output-


Lets put the code from each steps together. Here it is-

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
array_vec_1 = np.array([[12,41,60,11,21]])
array_vec_2 = np.array([[40,11,4,11,14]])
print(cosine_similarity(array_vec_1, array_vec_2))

Here we have used two different vectors. After applying this function, We got a cosine similarity of around 0.45227 . Which signifies that it is not very similar and not very different. In Actually scenario, We use text embedding as NumPy vectors. We can use TF-IDF, Count vectorizer, FastText or bert etc for embedding generation.

Conclusion –

cosine similarity is one of the best ways to judge or measure the similarity between documents. Irrespective of the size, This similarity measurement tool works fine. We can also implement this without sklearn module. But It will be a more tedious task. Sklearn simplifies this. I hope this article, must have cleared implementation. Still, if you found, any of the information gaps. Please let us know. You may also comment as a comment below.


Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.


Source link