Technology

Gartner: Issues when utilizing GPUs within the datacentre


CIOs anticipate intensive worth from their synthetic intelligence (AI) investments, together with elevated productiveness, enhanced buyer expertise (CX) and digital transformation. Because of this, Gartner consumer curiosity in deploying AI infrastructure – together with graphics processing items (GPUs) and AI servers – has grown considerably. 

Particularly, consumer enquiries concerning GPUs and AI infrastructure elevated practically fourfold yearly from October 2022 by way of October 2024. Shoppers are exploring using hosted, cloud and on-premise-based choices for GPU deployment. In some circumstances, enterprises will choose a “full-stack” AI providing that features GPU, compute, storage and networking in a bundled bundle. In different situations, enterprises will choose and deploy the items, individually chosen and built-in. The necessities of AI workloads are completely different from most present datacentre workloads.

A number of interconnect applied sciences can be found to help GPU connectivity. A standard query from Gartner shoppers is: “Ought to I exploit Ethernet, InfiniBand or NVLink to connect with GPU clusters?” All three approaches may be legitimate, relying on the state of affairs.

These applied sciences should not mutually unique. Enterprises can deploy them at the side of each other (for instance, InfiniBand or Ethernet) to scale out past a rack. A standard false impression is that solely InfiniBand or a supplier-proprietary interconnect know-how (resembling NVLink) can ship applicable efficiency and reliability.

Nevertheless, Gartner recommends that enterprises deploy Ethernet over different applied sciences, resembling InfiniBand, for GPU clusters as much as a number of thousand. Ethernet-based infrastructure can present the required reliability and efficiency, and there’s widespread enterprise expertise with the know-how. Moreover, a broad ecosystem of suppliers is related to Ethernet know-how. 

Optimise community deployments for GPU site visitors 

The present state of observe for pc processing unit (CPU)-based, general-purpose computing workloads is a leaf/backbone community topology.

Nevertheless, leaf-spine topologies should not all the time optimum for AI workloads. As well as, operating AI workloads colocated with present datacentre networks can create noisy neighbour results that degrade efficiency each for AI and present workloads. This will delay the processing and job completion time for AI workloads, which is extremely inefficient.

In a buildout of AI infrastructure, networking switches usually signify 15% or much less of the fee. Because of this, saving cash through the use of present switches typically results in suboptimal total worth/efficiency for the AI workload funding. Because of this, Gartner makes a number of suggestions. 

Because of the distinctive site visitors necessities and GPU prices, Gartner suggests constructing out devoted bodily switches for GPU connectivity. Moreover, somewhat than defaulting to a leaf-spine topology, Gartner additionally suggests utilizing a minimal variety of bodily switches to scale back bodily “hops”. This might in the end result in a leaf-spine topology, in addition to different topologies, together with single-switch, two-switch, full-mesh, cube-mesh and dragonfly.

Keep away from utilizing the identical switches for different generalised datacentre computing wants. For GPU clusters under 500 GPUs, one or two bodily switches is good. For organisations with greater than 500 GPUs, Gartner advises IT decision-makers to construct out a devoted AI Ethernet material. That is prone to require a deviation from the usual, state-of-practice, top-of-rack topologies in direction of middle-of-row and/or modular switching implementations. 

Improve Ethernet buildouts

Gartner recommends utilizing devoted switches for GPU connectivity. When deploying Ethernet (in contrast with InfiniBand or shelf/rack/row optimised), use switches with particular necessities. Switches have to help: 

  • Excessive-speed interface for GPUs, together with 400Gbps entry ports and above.
  • Assist for lossless Ethernet, together with superior, congestion-handling mechanisms – for instance, datacentre quantised congestion notification (DCQCN).
  • Superior traffic-balancing capabilities, together with congestion-aware load balancing.
  • Distant Direct Reminiscence Entry (RDMA)-aware load balancing and packet spraying.

Assist for static pinning of flows 

Moreover, the software program to handle AI networking materials should be enhanced as properly. This requires performance on the administration layer to alert, diagnose and remediate points shortly. Specifically, administration software program that gives superior granular telemetry (together with sub-second and sub-100 millisecond intervals) is good for troubleshooting and visibility. As well as, the power to watch and alert (in actual time) and supply historic reporting for bandwidth utilisation, packet loss, jitter, latency and availability on the sub-second stage is required. 

Extremely Ethernet (and accelerator) help

When constructing materials, Gartner advises IT leaders to think about {hardware} suppliers that pledge to help the Extremely Ethernet Consortium (UEC) and Extremely Accelerator Hyperlink (UAL) specs.

The UEC is growing an trade commonplace to help high-performance workloads on Ethernet. As of February 2025, there isn’t a proposed commonplace out there, however Gartner expects a proposal earlier than the tip of 2025. The necessity for the standard stems from the truth that suppliers at the moment use proprietary mechanisms to offer the high-performance Ethernet crucial for AI connectivity. 

Long run, this reduces interoperability for patrons because it locks them right into a single provider’s implementation. The advantage of suppliers confirming a constant UEC commonplace is the power to interoperate.

There may be additionally a separate, however associated, requirements effort for shelf/rack/row-optimised accelerator hyperlink referred to as the UAL. The purpose of UAL is to standardise a high-speed, scale-up accelerator interconnect know-how geared toward addressing scale-up community bandwidth wants which can be past what Ethernet and InfiniBand are at the moment able to. 

Cut back threat with co-certified implementations

Lastly, due to the stringent efficiency necessities for AI workloads, connectivity between GPU and community switches must be optimised and error-free from a {hardware} and software program perspective. This may be more and more difficult, given the speedy tempo of change related to each networking and GPU know-how.

To mitigate the potential for implementation challenges, Gartner recommends following validated implementation guides which can be co-certified (see field: Advantages of co-certification of networking GPUs) by the networking and GPU suppliers. The worth of following co-certified design is that each suppliers ought to stand by deployments which can be achieved in accordance with this specification, in the end lowering the probability of points and reducing imply time to restore (MTTR) within the occasion of a difficulty.


This text relies on an excerpt of the Gartner report, Key networking practices to help AI workloads within the knowledge heart. Andrew Lerner is a distinguished vice-president analyst at Gartner.