Nvidia has reportedly broken another AI world record, breaking the 1,000 tokens per second (TPS) barrierper userwith Meta’s Llama 4 Maverick large language model, according toArtificial Analysisin a post on LinkedIn. This breakthrough was achieved with Nvidia’s latest DGXB200node, which features eight Blackwell GPUs.
Nvidia outperformed the previous record holder, SambaNova, by 31%, achieving 1,038 TPS/user compared to AI chipmaker SambaNova’s prior record of 792 TPS/user. According to Artificial Analysis’s benchmark report, Nvidia and SambaNova are well ahead of everyone in this performance metric.Amazonand Groq achieved scores just shy of 300 TPS/user — the rest, Fireworks, Lambda Labs, Kluster.ai, CentML,GoogleVertex, Together.ai, Deepinfra, Novita, and Azure, all achieved scores below 200 TPS/user.
Blackwell’s record-breaking result was achieved using a plethora of performance optimizations tailor-made to theLlama 4Maverick architecture. Nvidia allegedly made extensive software optimizations usingTensorRTand trained a speculative decoding draft model using Eagle-3 techniques, which are designed to accelerate inference in LLMs by predicting tokens ahead of time. These two optimizations alone achieved a 4x performance uplift compared to Blackwell’s best prior results.
Accuracy was also improved using FP8 data types (rather than BF16), Attention operations, and the Mixture of Experts AI technique that took the world by storm when it was first introduced with theDeepSeek R1model. Nvidia also shared a variety of other optimizations its software engineers made to the CUDA kernel to optimize performance further, including techniques such as spatial partitioning and GEMM weight shuffling.
TPS/user is an AI performance metric that stands for tokens per second per user. Tokens are the foundation of LLM-powered software such asCopilotandChatGPT; when you type a question into ChatGPT or Copilot, your individual words and characters are tokens. The LLM takes these tokens and outputs an answer based on those tokens according to the LLM’s programming.
The user part (of TPS/user) is aimed at single-user-focusedbenchmarking, rather than batching. This method of benchmarking is important for AI chatbot developers to create a better experience for people. The faster a GPU cluster can process tokens per second per user, the faster an AI chatbot will respond to you.FollowTom’s Hardware on Google Newsto get our up-to-date news, analysis, and reviews in your feeds. ensure to click the Follow button.
Get Tom’s Hardware’s best news and in-depth reviews, straight to your inbox.
Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.