Top
Back to All Events

Colloquium Series: Weiyang "Frank" Wang, "Workload-Aware Networks for Machine Learning"

  • Bahen Centre for Information Technology, Room 3200 40 Saint George Street Toronto, ON, M5S 2E4 Canada (map)
Weiyang "Frank" Wang is smiling looking off to the side. A blue sky is in the background.

Speaker:

Weiyang "Frank" Wang

Talk Title:

Workload-Aware Networks for Machine Learning

Date and Location:

Thursday, March 26, 2026

Bahen Centre for Information Technology, BA 3200

This lecture is open to the public. No registration is required, but space is limited.

The grad roundtable that follows the talk is open only to current University of Toronto Department of Computer Science graduate students.

Abstract:

Today's ML workloads require networks that connect tens to hundreds of thousands of GPUs. Existing GPU clusters rely on network designs offering any-to-any connectivity while remaining agnostic to the data they carry. These traits are carried over from legacy CPU datacenters, limiting scalability and hindering GPU utilization.

I will present workload-aware networking, a systematic approach that exploits structures inherent to machine learning traffic to co-design networks with ML workloads. I start by showing that large language model (LLM)’s network traffic exhibits a surprising property: it stays within the bottom layer of a switched network. This insight enables rail-only network designs that dramatically reduce cost and complexity. I then discuss TopoOpt, which uses reconfigurable networks to adapt to the repetitive, predictable traffic patterns of ML training, delivering performance improvements over today's network designs. Finally, I show that understanding traffic content in the network unlocks new functionalities. I introduce Checkmate, a system that embeds checkpointing into the network through gradient replication, enabling per-iteration checkpointing with zero GPU overhead. I conclude with future directions for extending these principles to emerging workloads like agentic AI, and building orchestration frameworks that automate network-workload co-design.

About Weiyang "Frank" Wang:

Weiyang "Frank" Wang is a final-year Ph.D. candidate at MIT CSAIL, working with Professor Manya Ghobadi. His research spans computer networking, machine learning (ML) systems, and reconfigurable networks. Frank designs network systems that co-optimize with ML workloads, revealing and exploiting the structure of ML traffic to improve performance and reduce cost and complexity. Frank is the author of Rail-only networks and TopoOpt, which have helped shape how the industry builds large-scale ML training networks. His Rail-only design has been supported in production systems at Juniper and Broadcom, and serves as a reference for recent network designs at Alibaba, ByteDance, and Meta. 

Frank’s webpage can be found at https://frank.csail.mit.edu.