Context Window

The maximum number of tokens a language model can process in a single forward pass, determining how much text the model can "see" at once.

Context window size directly impacts what tasks a model can perform—longer windows enable processing entire codebases or documents. However, because self-attention scales quadratically with sequence length (O(n²)), expanding context windows dramatically increases compute and memory costs. Modern models range from 4K to over 1 million tokens, with techniques like FlashAttention and hybrid architectures helping manage the scaling challenge.

Also known as