Microsoft’s open source Dion optimizer aims to make huge AI models faster and cheaper to train

Futuristic data center with glowing turquoise “D” light trails symbolizing Dion optimizer.

Microsoft is pushing a new open-source optimizer called Dion, and it might be exactly what big AI labs have been waiting for. You see, Dion builds on the ideas behind Muon, the optimizer that stunned many last year by letting teams train huge models with half the GPUs compared to the old AdamW method.

Muon made waves after a nanoGPT “speedrun” proved it could deliver big efficiency gains, but it also came with a drawback. At massive scale, Muon’s math got bogged down in large matrix multiplications, eating up compute power and slowing distributed setups like Fully Sharded Data Parallel (FSDP) and tensor parallelism.

Dion attacks that problem by getting picky about what it orthonormalizes. Instead of processing an entire weight update matrix, it focuses on just the top ranked singular vectors, which cuts down the communication and computation overhead. According to Microsoft, the rank needed for strong performance grows much slower than model size, meaning even trillion parameter models do not need full rank updates.

It works through a method called amortized power iteration, which pulls out the largest singular values over several steps, only needing two matrix multiplications per optimization step. A QR decomposition then builds an approximate orthonormal basis, all without ever needing the full matrix. An error feedback system keeps track of what gets left out so it can be applied later, making sure nothing important is lost.

On smaller models, Dion can be a bit slower than Muon. But when the models get huge, like the 405B parameter LLaMA-3, Microsoft says Dion can run up to 10 times faster than Muon when using a rank fraction as low as 1/16. It also reportedly holds up better with large batch sizes, where update quality tends to slip with other optimizers.

The best part is that it is available now for anyone to try. Microsoft has released Dion’s PyTorch implementation with support for FSDP2 and tensor parallel setups, along with Muon for side by side testing. If the benchmarks hold up in the wild, Dion could become the go to optimizer for anyone looking to train large models without breaking the GPU budget.

Avatar of Brian Fagioli
Written by

Brian Fagioli

Technology journalist and founder of NERDS.xyz

Brian Fagioli is a technology journalist and founder of NERDS.xyz. A former BetaNews writer, he has spent over a decade covering Linux, hardware, software, cybersecurity, and AI with a no nonsense approach for real nerds.

Leave a Comment