Flash Attention 3

The third generation of Flash Attention, an efficient attention algorithm that reduces memory usage and speeds up transformer training.

Flash Attention 3 is the latest version of the Flash Attention algorithm, which computes attention in transformers using tiling and kernel fusion to minimize memory reads and writes. In Karpathy's nanochat, it delivered approximately 9% tokens-per-second improvement and enabled sliding window attention patterns that alternate between short and long context lengths. Flash Attention's benefits are well-established across model scales.

Also known as