Jim Keller suggests Nvidia should have used Ethernet to stitch together Blackwell GPUs, saving billions

As a strong supporter of open standards, Jim Keller tweeted that Nvidia should have used the Ethernet protocol chip-to-chip connectivity in Blackwell-basedGB200GPUs for AI and HPC. Keller contends this could have saved Nvidia and users of its hardware a lot of money. It would have also made it a bit easier for those customers to migrate their software to different hardware platforms, which Nvidia doesn’t necessarily want.

When Nvidia introduced its GB200 GPU for AI and HPC applications, the company primarily focused on its AI performance and advanced memory subsystem, telling little about how the device was made. Meanwhile, Nvidia’s GB200 GPU comprises two compute processors stitched together using TSMC’s CoWoS-L packaging technology and the NVLink interconnection technology, which uses a proprietary protocol. This isn’t an issue for those who already use Nvidia’s hardware and software, but this poses a challenge for the industry in porting software from Nvidia’s platforms.

Jim Keller and Raja Koduri

There is a reason why Jim Keller, a legendary CPU designer and chief executive officer ofTenstorrent, an Nvidia rival, suggests that Nvidia should have used Ethernet instead of proprietaryNVLink. Nvidia’s platforms use proprietary low-latency NVLink for chip-to-chip and server-to-server communications (which compete against PCIe with theCXLprotocol on top) and proprietary InfiniBand connections for higher-tier comms. To maximize performance, the software is tuned for both technologies' peculiarities. For obvious reasons, this could somewhat complicate software porting to other hardware platforms, which is good for Nvidia and not exactly suitable for its competitors. (You can see his thread if you expand the tweet below.)

pic.twitter.com/RXMO7bRwEhApril 11, 2024

Keller

There is a catch, though. Ethernet is a ubiquitous technology both on the hardware and software level, and it is a competitor to Nvidia’s low-latency and high-bandwidth (up to 200 GbE) InfiniBand interconnection for data centers. Performance-wise, Ethernet (particularly next-generation 400 GbE and 800 GbE) can compete with InfiniBand.

However, InfiniBand still has some advantages regarding features for AI and HPC and superior tail latencies, so some might say that Ethernet’s capabilities don’t cater to emerging AI and HPC workloads. Meanwhile, the industry — spearheaded by AMD, Broadcom, Intel, Meta,Microsoft, and Oracle — is developing theUltra Ethernetinterconnection technology, poised to offer higher throughput and features for AI and HPC communications. Of course, Ultra Ethernet will become a more viable competitor to Nvidia’s InfiniBand for these sorts of workloads.

Anton Shilov

Nvidia also faces challenges with its CUDA software platform dominance, hence the advent of the widely industry-supportedUnified Accelerator Foundation (UXL), an industry consortium that includes Arm, Intel, Qualcomm, and Samsung, among others,that’s intended to provide an alternative to CUDA.

Of course, Nvidia needs to develop data center platforms to use here and now, which is probably at least part of its desire to spend billions on proprietary technologies. If open-standard technologies like PCIe with CXL and Ultra Ethernet will outpace Nvidia’s proprietary NVLink and InfiniBand technologies regarding performance and capabilities, Nvidia will have to redevelop its platforms, so Keller advises (or trolls) that Nvidia should adopt Ethernet. However, this may be years away, so for now, Nvidia’s designs continue to leverage proprietary interconnects.

Get Tom’s Hardware’s best news and in-depth reviews, straight to your inbox.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.