What is a loss function in machine learning?

A loss function is a mathematical formula that measures how wrong a model's predictions are. It takes the model's output and the correct answer, then returns a single number representing the error. The model's goal during training is to minimize this number. Loss functions must be differentiable so gradients can flow backward through the network during backpropagation.

How does cross-entropy loss work for LLMs?

For each token position, the LLM predicts a probability distribution over its entire vocabulary (often 50,000+ tokens). The loss is -log(P(true_token)), averaged across all tokens in the sequence. Perplexity, a common evaluation metric, is just 2^CrossEntropy: it measures how many tokens the model is equally uncertain between on average.

Why does mean squared error amplify outliers?

MSE squares the difference between prediction and target, which means large errors grow quadratically. An error of 10 counts as 100 times worse than an error of 1, not 10 times worse. A single wildly wrong prediction can dominate the entire loss, forcing the model to prioritize reducing that error at the expense of everything else.

What is the difference between L1 and L2 regularization?

L2 regularization adds the sum of squared weights to the loss, pushing weights toward zero but rarely exactly to zero. L1 regularization adds absolute weight values, which often drives many weights exactly to zero, effectively performing feature selection. L2 encourages smaller weights overall; L1 creates sparse models with fewer active parameters.

Loss Functions: The Math Behind "You Get What You Measure"

Every neural network learns by chasing a number. That number comes from a loss function: a mathematical formula measuring how wrong the model's predictions are. Get it right, and your model learns what you actually want. Get it wrong, and you've built an optimization machine pointed at the wrong target.

This is machine learning's quiet curriculum. The loss function doesn't just evaluate performance; it actively shapes what the model optimizes for, how it weights different types of errors, and ultimately what behaviors emerge from training.

Two inputs, one number, all the leverage

A loss function takes what the model predicted and what the correct answer was. It outputs a single number representing how bad the prediction was. The model's entire goal during training? Minimize that number.

One constraint matters enormously: the loss function must be differentiable to work with gradient descent. Backpropagation needs to compute how small changes to each parameter affect the loss. If you can't take derivatives, you can't train.

This gives loss functions a dual role. They evaluate accuracy and guide parameter adjustment during training. The choice of loss function directly determines what "accurate" even means.

Mean squared error and the outlier problem

The simplest regression loss is mean squared error (MSE): take the difference between prediction and target, square it, average across examples. Simple enough.

But that squaring operation has consequences.

Squaring large errors dramatically amplifies penalties while errors below 1 become proportionally smaller. An error of 10 doesn't count as 10 times worse than an error of 1; it counts as 100 times worse.

This makes MSE highly sensitive to outliers. A single wildly wrong prediction can dominate the entire loss, forcing the model to prioritize getting that prediction less wrong at the expense of everything else. Sometimes you want exactly this. If you're predicting equipment failure and an off-by-10 error could mean catastrophe, treat large errors as catastrophically worse. Other times, outliers are noise you'd rather the model mostly ignore.

Mean absolute error (MAE) offers a different perspective: instead of squaring, take the absolute value. An error of 10 is exactly 10 times worse than an error of 1. No amplification, no special treatment. The tradeoff? MAE gradients are constant regardless of error magnitude, which can make optimization less stable near the solution.

Huber loss splits the difference. It behaves like MSE for small errors (smooth gradients, fast convergence) but switches to MAE for large errors (outlier resistance). You choose the threshold where the switchover happens. Each function encodes different assumptions about what matters. There's no universally correct answer; there's only "correct for your problem."

Cross-entropy and confident mistakes

Classification needs something completely different. Cross-entropy loss measures how surprised the model is when it sees the true label.

The formula for binary cross-entropy: -(y×log(p) + (1-y)×log(1-p)), where y is the true label (0 or 1) and p is the predicted probability. The intuition is straightforward: cross-entropy increases rapidly as predicted probability diverges from actual labels.

A prediction of 0.9 for a positive example produces low loss. A prediction of 0.1 for the same positive example produces very high loss. Crucially, the logarithm means confident wrong predictions are punished exponentially harder.

This shapes model behavior in important ways. Cross-entropy penalizes confident wrong predictions heavily, encouraging the model to express appropriate uncertainty when it's not sure. A model trained on MSE for classification wouldn't have this property.

Large language models use cross-entropy in a specific way. For each token position, the model predicts a probability distribution over the entire vocabulary (often 50,000+ tokens). The loss is -log(P(true_token)), averaged across all tokens in the sequence. This is why perplexity matters as an LLM evaluation metric: perplexity is just 2^CrossEntropy, an intuitive measure of how "confused" the model is over vocabulary choices. A perplexity of 10 means the model is, on average, equally uncertain between about 10 tokens at each position.

The cross-entropy objective is also why LLMs are fundamentally next-token predictors. The loss function defines the goal, and the goal is: given all previous tokens, predict the next one. Everything these models learn flows from optimizing that objective.

Beyond prediction error

Modern machine learning increasingly uses compound losses tailored to specific tasks.

Focal loss addresses class imbalance by down-weighting well-classified examples. Standard cross-entropy gives equal weight to easy and hard examples. Focal loss says: if the model is already confident and correct, don't waste gradient signal on that example. Focus training on the hard cases.

Dice loss is popular for image segmentation, where you're classifying each pixel. It directly optimizes the overlap between predicted and ground-truth regions rather than treating each pixel independently.

Adversarial losses power generative models like GANs. Instead of comparing outputs to a fixed target, the loss comes from a discriminator network that learns to distinguish real from generated samples. The generator's loss is "how convincingly did I fool the discriminator?"

The arXiv survey on loss functions notes that modern generative models often combine multiple losses: L1 for pixel-level detail, adversarial for realism, perceptual losses for feature-level quality. Each component shapes a different aspect of what the model learns.

Loss functions can also include terms beyond prediction error. L1 and L2 regularization add penalties based on the magnitude of model weights. L2 (ridge) adds the sum of squared weights, pushing them toward zero but rarely exactly there. L1 (lasso) adds the sum of absolute weight values, often driving many weights exactly to zero and effectively performing feature selection. Both modify what the model optimizes for: it's no longer just "minimize prediction error" but "minimize prediction error while keeping weights small."

The gradient's origin story

Loss functions exist at the end of the computational graph. When you run backpropagation, gradients flow backward from the loss through every layer, telling each parameter how to change.

This is why differentiability matters so much.

The loss function is the starting point for all gradient computation. If you can't differentiate it, you can't propagate gradients, and you can't train with gradient descent. It's also why the choice has such profound effects: every parameter update, across millions or billions of weights, is driven by gradients originating at the loss function. Change the loss, change what those gradients push toward.

Our read: Loss functions are the most underappreciated design choice in machine learning. Everyone obsesses over architectures, but the loss function is where you actually encode what you want the model to do. MSE makes outliers catastrophic. Cross-entropy demands confidence calibration. Focal loss shifts attention to hard examples. Regularization adds implicit preferences about model complexity. When a model behaves unexpectedly, the loss function is often the culprit. It's not that the model failed to optimize; it optimized exactly what it was told to. That wasn't quite what you meant.

Sources cited: Claims as analysis: