Training large foundation models has traditionally required complex parallelism strategies to overcome memory limits and data movement bottlenecks. Once models reach a certain scale, teams often rely on expert or tensor sharding to distribute parameters across multiple devices, which adds latency and increases the risk of training instability. Zyphra’s latest work shows that there is another path. The company has trained its ZAYA1 Mixture-of-Experts model entirely on AMD Instinct MI300X GPUs without needing sharding, using the GPU’s high on-package memory and distributed I O to simplify the system architecture while maintaining throughput.
Reducing Complexity in Large-Scale Training
Avoiding expert sharding removes one of the most persistent operational challenges in MoE development. As parameter counts rise, synchronising experts across multiple accelerators can increase communication overhead to the point where scaling gains start to flatten. With 192 GB of high bandwidth memory per device, the MI300X provides enough capacity for ZAYA1-Base’s active parameter set, allowing Zyphra to keep routing local. The result is a model that matches or exceeds the performance of Llama-3-8B, Qwen3-4B and OLMoE while using only 760 million active parameters. For teams working on constrained scaling, this demonstrates the impact of memory architecture on training strategy rather than relying solely on software optimisation.
Memory Architecture and System Performance
ZAYA1 is an 8.3 billion parameter Mixture-of-Experts model designed to optimise reasoning, mathematical tasks and coding workloads using sparse activation. Only a fraction of the parameters are active at any given time, reducing compute demand while maintaining coverage across tasks. The MI300X’s memory capacity removes the need to distribute the full expert set across nodes, eliminating the communication steps associated with expert parallelism. Zyphra also reports more than ten times faster checkpoint save times using AMD optimised distributed I O, improving reliability during long training cycles. The model was deployed on a cluster jointly engineered by AMD and IBM using MI300X accelerators and Pensando networking, supported by IBM Cloud’s storage fabric to maintain data movement without creating bottlenecks.
Practical Integration and Workflows
For engineers evaluating MoE architectures, the takeaway is that platform selection can influence design choices earlier than expected. Storing full expert groups locally reduces scheduling overhead, which is often one of the hidden costs in sparse models. It also lowers the barrier for experimentation since early-stage research no longer depends on layered parallelism stacks. While ZAYA1 focuses on large language pretraining, the same approach applies to multimodal systems and industrial models where frequent checkpointing is required. The collaboration highlights a shift toward cloud hardware designed around accelerator memory rather than maximum compute density.
Shifts in AI Hardware Strategy
As foundation models continue to scale, performance bottlenecks are moving away from raw FLOPs and toward data movement and memory locality. Zyphra’s results suggest that high memory capacity GPUs may offer a simpler scaling path for MoE architectures than increasing parallelism depth. This could influence future infrastructure decisions, particularly for organisations aiming to reduce training complexity while maintaining performance ceilings. It also reinforces the value of co designing silicon, networking and software, since the efficiency gains came from the combined platform rather than a single component. For teams planning next generation AI systems, ZAYA1 provides a real world example of how memory centric hardware can reshape the design space for large models.
Learn more and read the original announcement at www.amd.com
Image credit: AMD