Flash Attention
A memory-efficient attention algorithm that reduces GPU memory usage and speeds up transformer training and inference by avoiding materialization of the full attention matrix.
Flash Attention rewrites the attention computation to tile operations and fuse them into a single GPU kernel, dramatically reducing memory bandwidth bottlenecks. The technique has become standard in modern LLM training and serving stacks, with Flash Attention 3 delivering significant tokens-per-second improvements on recent hardware.
Also known as
FlashAttention, flash attention, Flash Attention 2, Flash Attention 3, FA2, FA3