AI / Embeddings Tutorial

OpenAI Embeddings & Cosine Similarity with Python

Learn how to convert text into numerical vectors using OpenAI's embedding models and measure semantic similarity with cosine similarity.

📅 June 2026 ⏱️ 5 min read 👤 Trenzy Vibes
Python OpenAI Embeddings NumPy NLP Vector Search

What Are Text Embeddings?

Text embeddings are numerical vector representations of text that capture semantic meaning. Sentences with similar meanings have vectors that point in similar directions in high-dimensional space. This makes embeddings incredibly powerful for search, recommendations, clustering, and classification tasks.

💡 Why Embeddings Matter

Unlike keyword matching, embeddings understand meaning. "I love dogs" and "Dogs are wonderful pets" share a high similarity score even though they use different words — because the underlying meaning is the same.

How It Works

Embedding & Similarity Flow
📝 Input Text
🔢 OpenAI Embedding API
📊 1536-D Vector
📐 Cosine Similarity
🏆 Similarity Score

Text → OpenAI Embedding Model → High-dimensional Vector → Cosine Similarity → Similarity Score (0 to 1)

Step 1: Setup & Imports

Start by importing the required libraries and loading your API key from a .env file:

python
from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np

load_dotenv()

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

Step 2: Define Your Sentences

Prepare the sentences you want to compare. We'll use three examples — two related to dogs, and one about databases:

python
sentences = [
    "I love dogs",
    "Dogs are wonderful pets",
    "SQL Server is a database"
]

Step 3: Generate Embeddings

Loop through each sentence and call the OpenAI Embeddings API using the text-embedding-3-small model:

python
vectors = []

for sentence in sentences:

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=sentence
    )

    vectors.append(response.data[0].embedding)

🔑 Key Point

Each embedding from text-embedding-3-small is a 1536-dimensional vector. These vectors capture the semantic essence of the input text in a format that machines can compare mathematically.

Step 4: Cosine Similarity Function

Define a function to compute the cosine similarity between two vectors. Cosine similarity ranges from -1 (opposite) to 1 (identical direction):

python
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)

    return np.dot(a, b) / (
        np.linalg.norm(a) * np.linalg.norm(b)
    )

Step 5: Compare & Print Results

Compare the first sentence against the second and third to see how semantic similarity is reflected in the scores:

python
score1 = cosine_similarity(
    vectors[0],
    vectors[1]
)

score2 = cosine_similarity(
    vectors[0],
    vectors[2]
)

print("Dog vs Dog:", score1)
print("Dog vs Database:", score2)

Expected Output

Dog vs Dog: ~0.85
Dog vs Database: ~0.30

✅ What This Shows

The similarity between "I love dogs" and "Dogs are wonderful pets" is high (~0.85) because they share semantic meaning. The similarity with "SQL Server is a database" is much lower (~0.30) because the topics are completely unrelated. This is the foundation of semantic search and RAG systems.

Complete Code

Here's the complete script in one block — ready to run:

python
from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np

load_dotenv()

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

sentences = [
    "I love dogs",
    "Dogs are wonderful pets",
    "SQL Server is a database"
]

vectors = []

for sentence in sentences:

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=sentence
    )

    vectors.append(response.data[0].embedding)


def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)

    return np.dot(a, b) / (
        np.linalg.norm(a) * np.linalg.norm(b)
    )


score1 = cosine_similarity(
    vectors[0],
    vectors[1]
)

score2 = cosine_similarity(
    vectors[0],
    vectors[2]
)

print("Dog vs Dog:", score1)
print("Dog vs Database:", score2)

Dependencies

Install these packages before running the script:

bash
pip install openai python-dotenv numpy

⚠️ Requirements

• Create a .env file with your OPENAI_API_KEY
• Ensure you have billing set up on your OpenAI account
• The text-embedding-3-small model is cost-effective for most use cases

Real-World Use Cases

Text embeddings with cosine similarity power many AI applications:

Next Steps

Now that you understand embeddings and similarity, take it further:

  1. Build a semantic search engine over your own documents
  2. Combine embeddings with a vector database like FAISS or ChromaDB
  3. Implement RAG (Retrieval-Augmented Generation) to ground LLM responses in your data
  4. Try text-embedding-3-large for higher-quality embeddings at a slightly higher cost

🚀 Want to Build Something With This?

I can help you build semantic search, recommendation systems, or RAG-powered AI applications tailored to your data.

Let's Talk