What Are Text Embeddings?
Text embeddings are numerical vector representations of text that capture semantic meaning. Sentences with similar meanings have vectors that point in similar directions in high-dimensional space. This makes embeddings incredibly powerful for search, recommendations, clustering, and classification tasks.
💡 Why Embeddings Matter
Unlike keyword matching, embeddings understand meaning. "I love dogs" and "Dogs are wonderful pets" share a high similarity score even though they use different words — because the underlying meaning is the same.
How It Works
Text → OpenAI Embedding Model → High-dimensional Vector → Cosine Similarity → Similarity Score (0 to 1)
Step 1: Setup & Imports
Start by importing the required libraries and loading your API key from a .env file:
from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np
load_dotenv()
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
Step 2: Define Your Sentences
Prepare the sentences you want to compare. We'll use three examples — two related to dogs, and one about databases:
sentences = [
"I love dogs",
"Dogs are wonderful pets",
"SQL Server is a database"
]
Step 3: Generate Embeddings
Loop through each sentence and call the OpenAI Embeddings API using the text-embedding-3-small model:
vectors = []
for sentence in sentences:
response = client.embeddings.create(
model="text-embedding-3-small",
input=sentence
)
vectors.append(response.data[0].embedding)
🔑 Key Point
Each embedding from text-embedding-3-small is a 1536-dimensional vector.
These vectors capture the semantic essence of the input text in a format that machines can compare mathematically.
Step 4: Cosine Similarity Function
Define a function to compute the cosine similarity between two vectors. Cosine similarity ranges from -1 (opposite) to 1 (identical direction):
def cosine_similarity(a, b):
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (
np.linalg.norm(a) * np.linalg.norm(b)
)
Step 5: Compare & Print Results
Compare the first sentence against the second and third to see how semantic similarity is reflected in the scores:
score1 = cosine_similarity(
vectors[0],
vectors[1]
)
score2 = cosine_similarity(
vectors[0],
vectors[2]
)
print("Dog vs Dog:", score1)
print("Dog vs Database:", score2)
Expected Output
✅ What This Shows
The similarity between "I love dogs" and "Dogs are wonderful pets" is high (~0.85) because they share semantic meaning. The similarity with "SQL Server is a database" is much lower (~0.30) because the topics are completely unrelated. This is the foundation of semantic search and RAG systems.
Complete Code
Here's the complete script in one block — ready to run:
from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np
load_dotenv()
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
sentences = [
"I love dogs",
"Dogs are wonderful pets",
"SQL Server is a database"
]
vectors = []
for sentence in sentences:
response = client.embeddings.create(
model="text-embedding-3-small",
input=sentence
)
vectors.append(response.data[0].embedding)
def cosine_similarity(a, b):
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (
np.linalg.norm(a) * np.linalg.norm(b)
)
score1 = cosine_similarity(
vectors[0],
vectors[1]
)
score2 = cosine_similarity(
vectors[0],
vectors[2]
)
print("Dog vs Dog:", score1)
print("Dog vs Database:", score2)
Dependencies
Install these packages before running the script:
pip install openai python-dotenv numpy
⚠️ Requirements
• Create a .env file with your OPENAI_API_KEY
• Ensure you have billing set up on your OpenAI account
• The text-embedding-3-small model is cost-effective for most use cases
Real-World Use Cases
Text embeddings with cosine similarity power many AI applications:
- Semantic Search — Find documents that match the meaning of a query, not just keywords
- Recommendation Engines — Suggest similar products, articles, or content based on description similarity
- Document Clustering — Group similar documents together for organization or analysis
- Duplicate Detection — Identify near-duplicate content even with different wording
- Chatbot Memory — Retrieve the most relevant past conversation context
Next Steps
Now that you understand embeddings and similarity, take it further:
- Build a semantic search engine over your own documents
- Combine embeddings with a vector database like FAISS or ChromaDB
- Implement RAG (Retrieval-Augmented Generation) to ground LLM responses in your data
- Try text-embedding-3-large for higher-quality embeddings at a slightly higher cost
🚀 Want to Build Something With This?
I can help you build semantic search, recommendation systems, or RAG-powered AI applications tailored to your data.
Let's Talk