Muon optimizer
An optimizer used for weight matrices in transformer training that employs techniques like orthogonalization and factored variance reduction.
The Muon optimizer is used alongside AdamW (which handles embeddings) in nanochat's training setup. It incorporates Polar Express orthogonalization, factored variance reduction, and a 'cautious' weight decay strategy that only applies updates where the gradient and parameter signs align. This selective approach to optimization contributed measurable training efficiency gains in the GPT-2 speedrun.