FP8
An 8-bit floating-point number format used in AI training that theoretically doubles throughput compared to BF16, though practical gains are often smaller.
FP8 (8-bit floating point) is a reduced-precision numerical format supported by modern GPUs like the H100. It offers a theoretical 2× FLOPS advantage over BF16 (Brain Float 16) by halving the bit width of calculations. In practice, overhead from scale conversions and the characteristics of specific workloads (such as small matrix sizes in GPT-2-scale models) reduce the actual speedup. FP8's benefits tend to grow with matrix size, making it more effective for larger models.
Also known as
FP8 quantized training, 8-bit floating point