Does FlashAttention approximate attention or compute exact results?

FlashAttention computes exact attention, not an approximation. It uses a technique called online softmax to process attention in tiles while producing mathematically identical results to standard attention. The innovation is in how it manages memory access patterns, not in changing the underlying computation.

How much faster is FlashAttention-3 compared to earlier versions?

FlashAttention-3 achieves 740 TFLOPS on H100 GPUs, reaching 75% of theoretical maximum throughput. This compares to 35% utilization for FlashAttention-2 on the same hardware. The improvement comes from exploiting Hopper architecture features like asynchronous compute and pingpong scheduling between matrix multiplication and softmax operations.

Why does FlashAttention reduce memory usage from O(N²) to O(N)?

Standard attention materializes the full N×N attention matrix in GPU memory. FlashAttention avoids this by processing attention in tiles that fit in fast SRAM, keeping only running statistics (current max and sum for softmax) rather than the complete matrix. This means memory grows linearly with sequence length instead of quadratically.

FlashAttention: The GPU Memory Trick Behind Long Context

Standard transformer attention has a dirty secret: the math is the easy part. Moving data around? That's where everything breaks down. FlashAttention, introduced in 2022 by Tri Dao and collaborators at Stanford, isn't a clever approximation or a new attention mechanism. It computes exact attention. It just does it in a way that respects how GPU memory actually works.

Context windows went from 2-4K tokens to 128K-1M in the same period FlashAttention matured. Not a coincidence.

Modern GPUs are memory-starved

An H100 can push over 1,000 teraflops of matrix multiplication. Absurdly fast. But that speed is useless if the chip spends most of its time waiting for data to arrive from memory.

The architecture that matters: GPUs have high-bandwidth memory (HBM), which is large (tens of gigabytes) but relatively slow, and SRAM, which is tiny (tens of megabytes) but extremely fast. Standard attention implementations compute the full N×N attention matrix, store it in HBM, then read it back for softmax and output computation.

For a 4K context window, that's 16 million elements. For 100K tokens? Ten billion.

The original FlashAttention paper makes a key observation: standard attention is memory-bound, not compute-bound. GPUs aren't waiting on calculations; they're waiting on data transfers between HBM and SRAM.

The algorithm is mathematically simple. The I/O pattern is the killer.

FlashAttention's insight is that you don't need to materialize the full N×N matrix. Instead, you process attention in tiles: small blocks that fit entirely in fast SRAM. But there's a catch. Computing softmax normally requires knowing the maximum value across the entire row (to prevent numerical overflow) and the sum of all exponentials (to normalize). You can't do that if you're only looking at a chunk of the row at a time.

The solution is online softmax, a technique for computing softmax incrementally. As you process each tile, you keep running statistics: the current maximum and a running sum. When you encounter a new maximum in a later tile, you rescale your previous partial results. The math works out to give you the exact same answer as computing softmax over the full row. This lets FlashAttention reduce HBM accesses from O(N²) to O(N). The algorithm is proven IO-optimal for a range of SRAM sizes, meaning you cannot do better in terms of memory transfers given the fundamental constraints.

Three generations, each faster than the last

The first version delivered 3x speedup on GPT-2 at 1K token contexts, 2.4x on long-range benchmarks at 1K-4K tokens. More importantly, it enabled training on sequence lengths that were previously impossible. The original paper reports being the first to achieve 61.4% accuracy on the Path-X challenge, which requires handling 16K token sequences.

FlashAttention-2, released in July 2023, doubled down on parallelization. The original version was already fast, but it wasn't fully utilizing GPU resources. FA2 improved work partitioning at the warp level (the GPU's basic unit of parallel execution) and reached 50-73% of theoretical maximum FLOPS on A100 GPUs, compared to 25-40% for the original. On GPT-style training: around 225 TFLOPs/s per A100.

FlashAttention-3, released in 2024, exploits features specific to NVIDIA's Hopper architecture. Hopper GPUs have dedicated hardware for asynchronous operations: one part of the chip can be doing matrix multiplication while another handles softmax. FA3's "pingpong scheduling" keeps both busy simultaneously. The result is 740 TFLOPS, hitting 75% of the H100's theoretical maximum, compared to 35% for FA2 on the same hardware.

Speed is the smaller win.

The memory reduction from O(N²) to O(N) didn't just make attention faster. It made long context possible. Before FlashAttention, a 100K context window would require storing a 10-billion-element attention matrix in GPU memory. For 16-bit precision, that's 20GB just for the attention matrix at a single layer. With 96 layers, you're looking at nearly 2 terabytes.

Not happening.

FlashAttention sidesteps this entirely by never materializing the full matrix. Memory usage grows linearly with sequence length, not quadratically. This is why models could jump from 4K contexts (GPT-4's original limit) to 128K (Claude, Gemini) to over a million tokens (Gemini 1.5 Pro). Our self-attention explainer covered FlashAttention briefly, noting it achieves 2-4x speedups through "clever memory tiling." Now you know what that tiling actually does: it restructures computation to minimize HBM round-trips, using online softmax to avoid ever needing the full attention matrix in memory at once.

This isn't a research prototype

The GitHub repository shows FlashAttention integrated into every major transformer library and LLM inference system. It supports NVIDIA Ampere, Ada, and Hopper GPUs (A100, RTX 3090/4090, H100) as well as AMD's MI200x and MI300x. Recent versions added sliding window attention, paged KV cache for inference, and torch.compile support.

This is the kind of optimization that enabled Karpathy's nanochat to train GPT-2 for $20. Flash Attention 3 was listed as the single biggest contributor to that project's training speed improvements.

FlashAttention's lesson extends beyond attention. As chips get faster, the bottleneck increasingly shifts from compute to memory bandwidth. Algorithms that minimize data movement, even at the cost of doing more arithmetic, often win. The specific technique (tiling with online softmax) matters less than the underlying insight: modern hardware is wildly unbalanced between compute and memory throughput. An algorithm optimized for FLOP count may perform worse than one that does more FLOPs but keeps data in fast memory.

If you grew up on Big-O analysis of computational complexity, this feels counterintuitive. O(N²) compute with O(N) memory transfers can beat O(N log N) compute with O(N²) memory transfers. The constants matter, and on modern GPUs, the constants favor keeping data close.

Our read: FlashAttention proved this wasn't just theory. It removed the memory wall that was holding back context windows, and every long-context model you use today is built on that insight.