Causal Masking

A constraint in decoder models (like GPT) that prevents each token from attending to future tokens, ensuring the model generates text left-to-right without seeing the answer.

Causal masking sets attention scores to negative infinity for all positions after the current token, effectively zeroing them out after softmax. This ensures autoregressive generation is valid: when predicting token t, the model can only use information from tokens 1 through t-1. Encoder models like BERT don't use causal masking, allowing bidirectional attention where every token sees the full sequence.

Also known as

autoregressive masking, decoder masking