Pre-training

The initial compute-intensive training phase where a model learns general patterns from massive unlabeled datasets using self-supervision.

Pre-training is the first stage of the foundation model pipeline, where models learn to predict masked words, next tokens, or image patches without seeing any labels. This phase requires enormous compute resources, with costs ranging from under $1,000 for the original Transformer to $191 million for Gemini Ultra. The resulting pre-trained model captures general knowledge and patterns that can later be specialized for specific tasks.

Also known as

pretraining, pre-train, base training