FlashAttention

A hardware-aware optimization that computes attention using memory tiling to avoid materializing the full n×n attention matrix in GPU memory, yielding 2-4x speedups.

FlashAttention restructures the attention computation to work within GPU SRAM (fast on-chip memory) rather than requiring the full n×n attention matrix to be stored in slower high-bandwidth memory. By tiling the computation into blocks that fit in SRAM, it reduces memory reads/writes and achieves 2-4x wall-clock speedups. It doesn't change the O(n²) algorithmic complexity but dramatically improves the constant factor in practice.

Also known as

Flash Attention