At Hot Chips 2025, AMD architects presented a deep dive into the CDNA 4 architecture powering the new MI350 accelerator family. Building on the MI300 foundation, MI350 introduces major architectural refinements and performance enhancements.
The AI Boom and Hardware Demands #
Large Language Models (LLMs) continue to scale rapidly, requiring longer context lengths and greater memory capacity.
To sustain performance, hardware must deliver:
- Higher memory bandwidth and capacity
- Better energy efficiency
- Scalable multi-GPU clustering for massive AI models
MI350 Series Launch #
The MI350 family is now shipping, with two platform options:
- MI350X → air-cooled
- MI355X → liquid-cooled
Architectural Highlights #
- 185 billion transistors
- Chiplet + 3D stacking design
- 8 compute dies stacked across 2 I/O dies
- Compute dies built on TSMC N3P 3nm
- I/O dies remain on 6nm
- Peak frequency: 2.4 GHz
- Liquid-cooled TDP: 1.4 kW
Infinity Fabric upgraded to IF 4:
- +2 TB/s bandwidth vs IF 3
- Fewer cross-die links → wider, lower-frequency D2D connections → higher efficiency
- 7 IF links per socket
Cache improvements:
- LDS doubled compared to MI300
- Each XCD has 4 MB L2 cache with coherence across dies
Data Formats and Compute Performance #
CDNA 4 introduces:
- New FP6 and FP4 formats
- Nearly 2× throughput for key data types
→ AI math performance is now over 2× faster than competing accelerators.
System and Platform Design #
- Configurable as single NUMA domain or dual NUMA domains
- XCDs can be partitioned into multiple logical GPUs
Connectivity:
- Up to 8 GPUs in a fully connected topology via Infinity Fabric
- PCIe connects GPUs to CPUs and NICs
OAM modules + universal baseboard (UBB):
- Supports 8 GPUs per board
- Air-cooled rack: up to 64 GPUs
- Liquid-cooled rack: up to 96–128 GPUs
Software and Performance #
The ROCm 7 software stack is maturing alongside hardware, improving overall performance.
Inference and training benchmarks show strong gains across workloads.
Roadmap Outlook #
AMD reaffirmed its roadmap:
- MI350 shipping now
- MI400 arriving next year with up to 10× AI performance uplift
Conclusion #
- MI350/CDNA 4 continues the chiplet + 3D stacking strategy
- Bandwidth, cache, and efficiency are significantly improved
- AI data formats expanded (FP6, FP4), nearly doubling math throughput
- Flexible system design: NUMA partitioning and large-scale GPU topologies
- ROCm software keeps pace with hardware gains
- Roadmap remains solid with MI400 on the horizon