
Artificial intelligence is often tested on academic exams and coding challenges, but those do not always reflect what people actually do at work. OpenAI is trying something different with the launch of GDPval, a new benchmark designed to measure how well AI models perform on economically valuable, real-world tasks across 44 occupations.
The name “GDPval” is a play on Gross Domestic Product. OpenAI started by identifying the industries that contribute most to U.S. GDP and then pulled tasks from occupations with the largest wage impact in those sectors. The result is a test set of 1,320 specialized tasks, with 220 open-sourced in a gold set, drawn from fields ranging from nursing and law to software development and journalism.
Unlike traditional academic-style benchmarks, GDPval goes further by using tasks based on real deliverables. These are not just prompts to answer in text. They include things like creating a legal brief, drafting an engineering blueprint, building a nursing care plan, or making a presentation. Each was designed by seasoned professionals with an average of 14 years of experience, and every task went through multiple rounds of expert review.
Early results show that today’s top AI models are starting to approach the quality of work produced by industry experts. In blind evaluations, professional graders compared outputs from several models including GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4 against human deliverables. Claude Opus 4.1 performed best overall, particularly with polished presentation and formatting, while GPT-5 stood out for its accuracy and depth of domain-specific knowledge. OpenAI notes that performance has more than doubled between GPT-4o in 2024 and GPT-5 in 2025.
There is also a stark efficiency gap. Models can complete GDPval tasks up to 100 times faster and cheaper than human experts, though OpenAI admits this does not include the human oversight, iteration, and integration required in real workplaces. Still, the numbers hint at where companies may turn to AI first, routine but time-consuming assignments that do not demand as much judgment or creativity.
The benchmark spans 44 knowledge work occupations across nine industries, including health care, law, finance, government, media, real estate, and manufacturing. By starting with roles where AI can realistically help, OpenAI argues GDPval provides a clearer picture of how AI might complement professionals rather than replace them outright.
Personally, I think this is really cool. I know ChatGPT is great, for instance, but it is nice to see data backing it up. Having concrete evidence will also help people better compare models and track how they improve over time, instead of just relying on marketing claims or hype.
Like any early framework, GDPval has limitations. The current version only measures one-shot performance, so it does not capture workflows that involve multiple drafts or ongoing client feedback. Real-world jobs are often messy and ambiguous, and GDPval still simplifies that complexity. Future iterations will expand to more interactive and context-rich scenarios.
OpenAI is also inviting industry experts and organizations to contribute to the project, signaling that GDPval is meant to evolve through community input. The company has released part of the dataset and a public grading service at evals.openai.com so other researchers can build on it.
The takeaway is that AI models are no longer just passing trivia tests. They are beginning to compete with skilled professionals on real-world tasks. Whether that leads to productivity boosts, job reshaping, or broader economic shifts remains to be seen, but GDPval offers a more grounded way to measure the progress.