Embeddings are the translation layer between human concepts and machine math. They take words, sentences, images, or any chunk of data and convert it into a list of numbers that captures meaning. Similar things end up close together in this numerical space; dissimilar things end up far apart. This abstraction powers everything from semantic search to recommendation systems to RAG pipelines.
The concept is elegant. The execution? Fragile in ways most teams never see coming.
Meaning as Geometry
The core idea is deceptively simple: represent meaning as position in space. A neural network learns to place semantically similar items close together in a high-dimensional vector space. "King" lands near "queen." "Dog" lands near "cat" but far from "democracy." The model learns these positions through training on massive datasets, optimizing until the geometry reflects human intuitions about similarity.
Modern embedding models like those based on BERT use deep neural networks to produce vectors with hundreds or thousands of dimensions. According to Pinecone's technical documentation, these high-dimensional representations capture nuances that simpler approaches miss. The result is a mathematical space where you can do arithmetic with concepts: the classic example is "king - man + woman ≈ queen."
Useful, yes. But the industry has built an entire infrastructure layer on top of this abstraction, and that's where things get messy.
Nearly every production embedding system uses cosine similarity as its default metric. It measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in roughly the same direction are "similar," regardless of how long they are. The assumption is that direction encodes meaning while magnitude is noise.
This assumption is often wrong.
Research from Cornell and Netflix demonstrates that cosine similarity can produce "arbitrary and meaningless results" for embeddings from regularized models. The problem stems from a degree of freedom in how models learn: embeddings can be rescaled during training without affecting model predictions, but this rescaling dramatically changes similarity measurements. The researchers found that for some models, similarities are not even unique. The same data can produce different similarity scores depending on training choices that have nothing to do with semantic meaning.
Their explicit recommendation: stop "blindly using cosine similarity."
This is not a minor edge case. Regularization is standard practice in modern deep learning. The techniques that make models generalize better also make cosine similarity measurements potentially meaningless.
Even when cosine similarity "works" in a technical sense, it has systematic biases. Research published at ACL found that it systematically underestimates similarity for high-frequency words compared to human judgment. Common words like "the," "is," and "have" show lower similarity scores than humans would assign, even after controlling for polysemy. The cause is geometric: high-frequency and low-frequency words occupy different regions of the embedding space with different representational properties.
Our read: if your RAG system underperforms on queries with common words, this might be why.
When Production Systems Quietly Degrade
The theoretical fragility of cosine similarity connects to practical production failures that teams often misdiagnose.
Embedding drift happens when the same text produces different embeddings over time. This can result from preprocessing changes, model updates, or index inconsistencies. Your document embeddings and query embeddings slowly drift out of alignment. Retrieval quality degrades, but gradually enough that you blame the model rather than your infrastructure.
Chunking trade-offs create a different failure mode entirely. Embeddings work best on coherent semantic units, but production systems need to chunk documents. As noted by Shaped.ai's engineering team, chunk too large and embeddings become averaged mush, losing the specific information you need. Chunk too small and you lose context, producing embeddings that don't capture meaning.
There is no universal right answer. There is only the wrong answer you've chosen for your specific use case.
Model migration is perhaps the most painful failure. Weaviate's engineering blog documents the problem: upgrading embedding models breaks retrieval because new query vectors don't map to old document vectors. The new model might be objectively better, but it learned a completely different geometry. Your entire index becomes incompatible. Re-embedding datasets at scale is expensive. Workarounds like normalization risk losing important variance. Embedding model choice is not a one-time technical decision; it is an ongoing architectural constraint.
What actually helps? The research points toward several practices.
Train directly on cosine similarity if that's your intended metric. Models optimized for a different objective and then measured by cosine similarity are measuring something the model was never optimized for. Normalize during training, not after. Post-hoc normalization can introduce the arbitrary results the Cornell/Netflix paper warns about. If you need normalized embeddings, build that into the training process.
Consider alternative metrics. Dot products preserve magnitude information that might be meaningful. Euclidean distance measures absolute position rather than direction. Fine-tuned semantic similarity models can outperform generic embeddings on domain-specific tasks. Shaped.ai recommends evaluating these alternatives rather than defaulting to cosine. And critically: monitor for drift. Embedding systems degrade silently. Build monitoring that catches when retrieval quality drops, not just when systems fail entirely.
Embeddings are a brilliant abstraction. They turn the fuzzy human notion of "similar" into something you can compute. But the tooling around them has been built on assumptions that don't always hold. Cosine similarity became the default because it was convenient, not because it was optimal. Chunking strategies became standardized before we understood their trade-offs. Model upgrades were treated as drop-in replacements when they are actually architectural migrations.
These are infrastructure problems with infrastructure solutions. The failure modes are predictable. Teams that monitor for drift, test their similarity metrics against ground truth, and plan for model migration will avoid the worst pitfalls.
Most teams aren't doing this yet. They're treating embeddings as a solved problem when the research says otherwise.