Across industries, artificial intelligence (AI) is optimizing workflows, increasing efficiency, driving innovation—and prompting investments in accelerators, deep learning processors, and neural processing units (NPUs). Some organizations are starting small with retrieval-augmented generation (RAG) for inference tasks before progressively expanding to accommodate a larger number of users. Enterprises that handle large volumes of private data may prefer setting up their own training clusters to get the accuracy that custom models built on select data can deliver. Whether you’re investing in a small AI cluster with hundreds of accelerators or a massive setup with thousands, you’ll need a scale-out network to connect them all.
The key? Planning for and designing that network properly. A well-designed network ensures your accelerators hit peak performance, complete jobs faster, and keep tail latency to a minimum. To speed up job completion, the network needs to prevent congestion or, at the very least, catch it early. The network also needs to handle traffic smoothly, even during in-cast scenarios—in other words, it should manage congestion promptly once it occurs.
That’s where Data Center Quantized Congestion Notification (DCQCN) comes in. The concept of DCQCN works optimally when explicit congestion notification (ECN) and priority flow control (PFC) are used in combination. ECN reacts early on a per-flow basis while PFC serves as a hard mitigation measure to control congestion and prevent packet drops. Our Data Center Networking Blueprint for AI/ML Applications explains these concepts in detail. We have also introduced Nexus Dashboard AI fabric templates to facilitate deployment in accordance with the blueprint and best practices. In this blog, we’ll explain how Cisco Nexus 9000 Series Switches use a dynamic load-balancing approach to address congestion.
Traditional and dynamic approaches to load balancing
Traditional load balancing uses equal-cost multipath (ECMP), a routing strategy wherein once a flow chooses a path, it often persists for the duration of that flow. When multiple flows stick to the same persistent path, it can result in some links being overused while others are underused, creating congestion on the over-utilized links. In an AI training cluster, this can increase job completion times and even lead to higher tail latency, potentially jeopardizing the performance of training jobs.
As the network state is constantly changing, load balancing needs to be dynamic and driven by real-time feedback from network telemetry or user configurations. Dynamic load balancing (DLB) allows traffic to be distributed more efficiently and dynamically by considering changes in the network. As a result, congestion can be avoided and overall performance improved. By continuously monitoring the network state, it can adjust the path for a flow—switching to less-utilized paths if one becomes overburdened.
The Nexus 9000 Series uses link utilization as a parameter when deciding how to utilize multipath. Since link utilization is dynamic, rebalancing flows based on path utilization allows for more efficient forwarding and reduces congestion. When comparing ECMP and DLB, understand this key difference: With ECMP, once a quintuple flow is assigned to a particular path, it stays on that path, even if the link becomes congested or heavily utilized. On the other hand, DLB starts by placing the quintuple flow on the least used link. If that link becomes more utilized, DLB will dynamically shift the next set of packets (known as a flowlet) to a different, less congested link.
For those who like to be in control, the Nexus 9000 Series’ DLB lets you fine-tune load balancing between input and output ports. By manually configuring pairings between the input and output ports, you can gain greater flexibility and precision in managing traffic. This allows you to manage the load on output ports and reduce congestion. This approach can be implemented via command-line interface (CLI) or application programming interface (API), facilitating large-scale networks and allowing manual traffic distribution.
The Nexus 9000 Series can spray packets across the fabric using per-packet load balancing, sending each packet over a different path to optimize traffic flow. This should provide optimal link utilization as packets are distributed randomly. However, it’s important to note that packets may arrive out of order at the destination host. The host must be capable of reordering packets or must handle them as they arrive, maintaining correct processing in memory.
Performance enhancements on the way
Looking toward the future, new standards will further improve performance. Members of the Ultra Ethernet Consortium, including Cisco, have been working to develop standards spanning many layers of the ISO/OSI stack to enhance both AI and high-performance computing (HPC) workloads. Here is what this could mean for Nexus 9000 Series Switches and what might be expected.
Scalable transport, better control
We’ve been focused on creating standards for a more scalable, flexible, secure, and integrated transport solution—Ultra Ethernet Transport (UET). The UET protocol defines a new transport method as connectionless, meaning it does not require a “handshake” (the term for establishing a preliminary connection setup process between communication devices). Transport begins when a connection is established; the connection is then discarded once the transport is complete. This approach allows for better scalability and reduced latency and may even lower the cost of network interface cards (NICs).
Congestion control is built into the UET protocol, directing NICs to distribute traffic across all available paths in the fabric. Optionally, UET can use lightweight telemetry (round-trip time delay measurements) to collect information on network path utilization and congestion, delivering this data to the receiver. Packet trimming is another optional feature that helps detect congestion early. It works by sending only the header information for packets that will be dropped due to a full buffer. This provides a clear method for the receiver to notify the sender about congestion, helping reduce retransmission delays.
UET is an end-to-end transport where endpoints (or NICs) participate equally with the network in transport. Connectionless transport originates and terminates at the sender and receiver. The network for this transport requires two traffic classes: one for data traffic and one for control traffic, which is used to acknowledge that data traffic is received. For data traffic, explicit congestion notification (ECN) is used to signal congestion on the path. Data traffic can also be transported over a lossless network, allowing flexible transport.
Ready for UET adoption and more
Nexus 9000 Series Switches are UEC-ready, making it easy to adopt the new UET protocol quickly and seamlessly with both your existing and new infrastructure. All the mandatory features are supported today. The nice-to-have optional features, such as packet trimming, are supported in Cisco Silicon One-based Nexus products. Additional features will be supported on Nexus 9000 Series Switches in the future.
Build your network for ultimate reliability, precise control, and peak performance with the Nexus 9000 Series. You can begin today by enabling dynamic load balancing for AI workloads. Then, once the UEC standards are ratified, we’ll be ready to help you upgrade to Ultra Ethernet NICs, unlocking the full potential of Ultra Ethernet and optimizing your fabric to future-proof your infrastructure. Ready to optimize your future? Start building it with the Nexus 9000 Series.
Share: