OpenAI Embeddings & Cosine Similarity with Python

Python OpenAI Embeddings NumPy NLP Vector Search

What Are Text Embeddings?

Text embeddings are numerical vector representations of text that capture semantic meaning. Sentences with similar meanings have vectors that point in similar directions in high-dimensional space. This makes embeddings incredibly powerful for search, recommendations, clustering, and classification tasks.

💡 Why Embeddings Matter

Unlike keyword matching, embeddings understand meaning. "I love dogs" and "Dogs are wonderful pets" share a high similarity score even though they use different words — because the underlying meaning is the same.

How It Works

Embedding & Similarity Flow

📝 Input Text

→

🔢 OpenAI Embedding API

→

📊 1536-D Vector

→

📐 Cosine Similarity

→

🏆 Similarity Score

Text → OpenAI Embedding Model → High-dimensional Vector → Cosine Similarity → Similarity Score (0 to 1)

Step 1: Setup & Imports

Start by importing the required libraries and loading your API key from a .env file:

python

from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np

load_dotenv()

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

Step 2: Define Your Sentences

Prepare the sentences you want to compare. We'll use three examples — two related to dogs, and one about databases:

python

sentences = [
    "I love dogs",
    "Dogs are wonderful pets",
    "SQL Server is a database"
]

Step 3: Generate Embeddings

Loop through each sentence and call the OpenAI Embeddings API using the text-embedding-3-small model:

python

vectors = []

for sentence in sentences:

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=sentence
    )

    vectors.append(response.data[0].embedding)

🔑 Key Point

Each embedding from text-embedding-3-small is a 1536-dimensional vector. These vectors capture the semantic essence of the input text in a format that machines can compare mathematically.

Step 4: Cosine Similarity Function

Define a function to compute the cosine similarity between two vectors. Cosine similarity ranges from -1 (opposite) to 1 (identical direction):

python

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)

    return np.dot(a, b) / (
        np.linalg.norm(a) * np.linalg.norm(b)
    )

Step 5: Compare & Print Results

Compare the first sentence against the second and third to see how semantic similarity is reflected in the scores:

python

score1 = cosine_similarity(
    vectors[0],
    vectors[1]
)

score2 = cosine_similarity(
    vectors[0],
    vectors[2]
)

print("Dog vs Dog:", score1)
print("Dog vs Database:", score2)

Expected Output

Dog vs Dog: ~0.85

Dog vs Database: ~0.30

✅ What This Shows

The similarity between "I love dogs" and "Dogs are wonderful pets" is high (~0.85) because they share semantic meaning. The similarity with "SQL Server is a database" is much lower (~0.30) because the topics are completely unrelated. This is the foundation of semantic search and RAG systems.

Complete Code

Here's the complete script in one block — ready to run:

python

from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np

load_dotenv()

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

sentences = [
    "I love dogs",
    "Dogs are wonderful pets",
    "SQL Server is a database"
]

vectors = []

for sentence in sentences:

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=sentence
    )

    vectors.append(response.data[0].embedding)


def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)

    return np.dot(a, b) / (
        np.linalg.norm(a) * np.linalg.norm(b)
    )


score1 = cosine_similarity(
    vectors[0],
    vectors[1]
)

score2 = cosine_similarity(
    vectors[0],
    vectors[2]
)

print("Dog vs Dog:", score1)
print("Dog vs Database:", score2)

Dependencies

Install these packages before running the script:

bash

pip install openai python-dotenv numpy

⚠️ Requirements

• Create a .env file with your OPENAI_API_KEY
• Ensure you have billing set up on your OpenAI account
• The text-embedding-3-small model is cost-effective for most use cases

Real-World Use Cases

Text embeddings with cosine similarity power many AI applications:

Semantic Search — Find documents that match the meaning of a query, not just keywords
Recommendation Engines — Suggest similar products, articles, or content based on description similarity
Document Clustering — Group similar documents together for organization or analysis
Duplicate Detection — Identify near-duplicate content even with different wording
Chatbot Memory — Retrieve the most relevant past conversation context

Next Steps

Now that you understand embeddings and similarity, take it further:

Build a semantic search engine over your own documents
Combine embeddings with a vector database like FAISS or ChromaDB
Implement RAG (Retrieval-Augmented Generation) to ground LLM responses in your data
Try text-embedding-3-large for higher-quality embeddings at a slightly higher cost

🚀 Want to Build Something With This?

I can help you build semantic search, recommendation systems, or RAG-powered AI applications tailored to your data.

Let's Talk