Recently I attended the 50th golden anniversary of Ethernet at the Computer History Museum. It was a reminder of how familiar and widely deployed Ethernet is and how it has evolved by orders of magnitude. Since the 1970s, it has progressed from a shared collision network at 2.95 megabits in the file/print/share era to the promise of Terabit Ethernet switching in the AI/ML era. Legacy Ethernot* alternatives such as Token Ring, FDDI, and ATM generally get subsumed by Ethernet. I believe history is going to repeat itself for AI networks.
Intense Networking for AI Data Exchange
AI workloads are demanding on networks as they are both data and compute-intensive. The workloads are so large that the parameters are distributed across thousands of processors. Large Language Models (LLMs) such as GPT-3, Chinchilla, and PALM, as well as recommendation systems like DLRM and DHEN, are trained on clusters of many 1000s of GPUs sharing the “parameters” with other processors involved in the computation. In this compute-exchange-reduce cycle, the volume of data exchanged is so significant that any slowdown due to a poor/congested network can critically impact the AI application performance.
Ultra Ethernet Consortium
As AI applications drive inference and training with massive compute processors (GPUs/CPUs/TPUs), one must reimagine the high speed transit of mission-crucial workloads. The upshot is wire-rate delivery of large synchronized bursts of data using a familiar standards-based, Ethernet-based network.
Arista and the founding members (shown below) of the Ultra Ethernet Consortium, UEC, have set out on the mission to enhance the capabilities of Ethernet for AI and HPC. The advantage of Ethernet brings the economics of wide deployments, familiarity with tools, and support for merchant silicon track to Moore’s law and silicon geometries. UEC and the proven ability of the IEEE Ethernet standards will advance Ethernet across many L2/3, optical and physical layers.
AI at Scale Needs Ethernet at Scale
As AI jobs grow, the underlying Ethernet network needs to be designed for high speed and scale to increase the job completion rate. UEC is endorsing three improvements:
- Packet Spraying: 1990s networking topology was based on a spanning tree, ensuring one path from A to B to eliminate loops in the network. Then came multipathing – technologies like ECMP in the 2010 era where the network leverages as many links as possible between communicating partners across n-way. The next phase of AI network topology is packet spraying which allows every flow to access all paths to the destination simultaneously.
- Flexible Ordering: A key to AI job completion is a reliable bulk transfer from source to destination, where the AI application is only interested in the last part of a message and its arrival at the destination. The rigid ordering used by traditional InfiniBand technologies causes under-utilization of the available links and increases tail latencies for large-scale AI/ML. In contrast, flexible ordering uses all Ethernet links that are optimally balanced – ordering is only enforced when the AI workload requires it in bandwidth-intensive operations.
- Congestion Management: Network congestion in high performance networks is non-trivial. A common “incast” congestion problem can occur on the last link of the AI receiver when multiple uncoordinated senders simultaneously send traffic to it. This can become acute and exponential in an “All-to-All” AI operation across GPU clusters. Ethernet-based congestion control algorithms are critical for AI workloads to avoid hotspots and evenly spread the load across multipaths. They can be designed to work in conjunction with multipath packet spraying, enabling a reliable transport of AI traffic.
Time for a RDMA Reboot
AI workloads cannot tolerate delays; they can only complete a job after all flows are successfully delivered. It takes only one culprit worst-case link to throttle an entire AI workload. To predictably transfer vast amounts of data, AI networks need a transport protocol like TCP that works “out of the box.” Arista and Ultra Ethernet Consortium’s founding members believe it is time to reconsider and replace RDMA (Remote Direct Memory Access) limitations. Traditional RDMA, as defined by InfiniBand Trade Association (IBTA) decades ago, is showing its age in highly demanding AI/ML network traffic. RDMA transmits data in chunks of large flows, and these large flows can cause unbalanced and over-burdened links.
It is time to begin with a clean slate to build a modern transport protocol supporting RDMA for emerging applications. The UET (Ultra Ethernet Transport) protocol will incorporate the advantages of Ethernet/IP while addressing AI network scale for applications, endpoints and processes, and maintaining the goal of open standards and multi-vendor interoperability.
Summary: Ethernet Rises to the Occasion
Generative AI Applications are pushing the envelope of networking scale akin to using all highway lanes simultaneously and efficiently. Once again, Ethernet will ultimately emerge as the winner in networking for AI. Together with IP, Ethernet will drive numerous use cases for AI training and inference. Scalable and efficient mechanisms implemented with packet spraying, flexible ordering and modern congestion control algorithms will be infused into AI-based Ethernet and IP networks. Welcome to the new decade of AI Networking! I welcome your views at [email protected].
*Ethernot term coined by Bob Metcalfe, one of the original pioneers of Ethernet