Nvidia co-organizes a contest to help build AI dataset to accelerate GPU design

Despite their impressive capabilities in generating content, large language models (LLMs) are not so great at designing hardware. Believing this weakness is due to a lack of hardware design data to train the models, Nvidia, Georgia Institute of Technology, and others have organized a contest to help create the needed open-source public dataset.

Nvidia’s director of design automation research, Haoxing (Mark) Ren, recently announced the collaboration on X (formerly known as Twitter). Ren said the shortage of high-quality data specific to hardware design was “one of the bottlenecks for LLM-Assisted Hardware Design.” To address these shortcomings, Nvidia and others organized the ICCAD Contest on LLM-Assisted Hardware Code Generation.

AI developing GPUs to develop AI

Current efforts to design GPUs and other hardware with LLM assistance require “extensive human interaction.” The designs created by the LLMs are often either non-synthesizable or non-functional, or they are too simplistic or impractical. Researchers believe this is because of insufficient exposure to high-quality hardware design data during pretraining.

The lack of high-quality data is commonly regarded as one of the bottlenecks for LLM-Assisted Hardware Design. To develop an open-source, large-scale, high-quality dataset, we are co-organizing the LLM4HWDesign contest @ @ICCAD with EIC Lab @GaTech. https://t.co/qvwLZUlOVhJuly 8, 2024

Jeff Butts

After noting one Nvidia project’s success using an in-house large-scale Verilog code dataset, the organizers decided to enrich the current Verilog code dataset. The contest aims to build a large-scale, high-quality hardware design code dataset that will eventually be open-source.

The LLM4HWDesign contest runs in two phases. The first, data sample collection, ends July 20, 2025. From August 20 until October. 1, the second phase will improve and fine-tune the data sets collected in Phase I. When collecting data sets in Phase I, contest participants will begin with the existing Verilog dataset and expand it.

In Phase II, participants will use data filtering to remove low-quality data and develop techniques to generate more accurate descriptions for the collected data samples automatically. Finally, they’ll create labeling strategies to help the learning process for LLMs.

Get Tom’s Hardware’s best news and in-depth reviews, straight to your inbox.

Jeff Butts has been covering tech news for more than a decade, and his IT experience predates the internet. Yes, he remembers when 9600 baud was “fast.” He especially enjoys covering DIY and Maker topics, along with anything on the bleeding edge of technology.