quarta-feira, março 12, 2025
HomeCloud ComputingCisco IT deploys AI-ready data center in weeks, while scaling for the...

Cisco IT deploys AI-ready data center in weeks, while scaling for the future


Cisco IT designed AI-ready infrastructure with Cisco compute, best-in-class NVIDIA GPUs, and Cisco networking that supports AI model training and inferencing across dozens of use cases for Cisco product and engineering teams. 

It’s no secret that the pressure to implement AI across the business presents challenges for IT teams. It challenges us to deploy new technology faster than ever before and rethink how data centers are built to meet increasing demands across compute, networking, and storage. While the pace of innovation and business advancement is exhilarating, it can also feel daunting.  

How do you quickly build the data center infrastructure needed to power AI workloads and keep up with critical business needs? This is exactly what our team, Cisco IT, was facing. 

The ask from the business

We were approached by a product team that needed a way to run AI workloads which would be used to develop and test new AI capabilities for Cisco products. It would eventually support model training and inferencing for multiple teams and dozens of use cases across the business. And they needed it done quickly. need for the product teams to get innovations to our customers as quickly as possible, we had to deliver the new environment in just three months.  

The technology requirements

We began by mapping out the requirements for the new AI infrastructure. A non-blocking, lossless network was essential with the AI compute fabric to ensure reliable, predictable, and high-performance data transmission within the AI cluster. Ethernet was the first-class choice. Other requirements included: 

  • Intelligent buffering, low latency: Like any good data center, these are essential for maintaining smooth data flow and minimizing delays, as well as enhancing the responsiveness of the AI fabric. 
  • Dynamic congestion avoidance for various workloads: AI workloads can vary significantly in their demands on network and compute resources. Dynamic congestion avoidance would ensure that resources were allocated efficiently, prevent performance degradation during peak usage, maintain consistent service levels, and prevent bottlenecks that could disrupt operations. 
  • Dedicated front-end and back-end networks, non-blocking fabric: With a goal to build scalable infrastructure, a non-blocking fabric would ensure sufficient bandwidth for data to flow freely, as well as enable a high-speed data transfer — which is crucial for handling large data volumes typical with AI applications. By segregating our front-end and back-end networks, we could enhance security, performance, and reliability. 
  • Automation for Day 0 to Day 2 operations: From the day we deployed, configured, and tackled ongoing management, we had to reduce any manual intervention to keep processes quick and minimize human error. 
  • Telemetry and visibility: Together, these capabilities would provide insights into system performance and health, which would allow for proactive management and troubleshooting. 

The plan – with a few challenges to overcome

With the requirements in place, we began figuring out where the cluster could be built. The existing data center facilities were not designed to support AI workloads. We knew that building from scratch with a full data center refresh would take 18-24 months – which was not an option. We needed to deliver an operational AI infrastructure in a matter of weeks, so we leveraged an existing facility with minor changes to cabling and device distribution to accommodate. 

Our next concerns were around the data being used to train models. Since some of that data would not be stored locally in the same facility as our AI infrastructure, we decided to replicate data from other data centers into our AI infrastructure storage systems to avoid performance issues related to network latency. Our network team had to ensure sufficient network capacity to handle this data replication into the AI infrastructure.

Now, getting to the actual infrastructure. We designed the heart of the AI infrastructure with Cisco compute, best-in-class GPUs from NVIDIA, and Cisco networking. On the networking side, we built a front-end ethernet network and back-end lossless ethernet network. With this model, we were confident that we could quickly deploy advanced AI capabilities in any environment and continue to add them as we brought more facilities online.

Products: 

Supporting a growing environment

After making the initial infrastructure available, the business added more use cases each week and we added additional AI clusters to support them. We needed a way to make it all easier to manage, including managing the switch configurations and monitoring for packet loss. We used Cisco Nexus Dashboard, which dramatically streamlined operations and ensured we could grow and scale for the future. We were already using it in other parts of our data center operations, so it was easy to extend it to our AI infrastructure and didn’t require the team to learn an additional tool. 

The results

Our team was able to move fast and overcome several hurdles in designing the solution. We were able to design and deploy the backend of the AI fabric in under three hours and deploy the entire AI cluster and fabrics in 3 months, which was 80% faster than the alternative rebuild.  

Today, the environment supports more than 25 use cases across the business, with more added each week. This includes:

  • Webex Audio: Improving codec development for noise cancellation and lower bandwidth data prediction
  • Webex Video: Model training for background replacement, gesture recognition, and face landmarks
  • Custom LLM training for cybersecurity products and capabilities

Not only were we able to support the needs of the business today, but we’re designing how our data centers need to evolve for the future. We are actively building out more clusters and will share additional details on our journey in future blogs. The modularity and flexibility of Cisco’s networking, compute, and security gives us confidence that we can keep scaling with the business. 

 


Additional resources:

Share:

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments