OpenAI just exposed how bad AI still is at real science

Artificial intelligence companies love to talk about how their models are transforming science. Depending on who you ask, AI is either on the verge of curing diseases, discovering new drugs, or completely reinventing research itself. OpenAI’s newly announced LifeSciBench benchmark, however, offers a much more grounded reality check.

LifeSciBench is a new evaluation framework designed to test how well AI systems handle actual life science research tasks. Rather than relying on simple multiple-choice questions or textbook biology trivia, the benchmark focuses on the sort of work scientists perform every day. That includes analyzing evidence, interpreting experimental results, designing studies, validating findings, communicating conclusions, and making decisions when the available data is incomplete or uncertain.

In other words, LifeSciBench attempts to measure whether AI can contribute to real scientific work rather than simply regurgitate facts.

The results are both encouraging and humbling.

OpenAI says its newest model, GPT-Rosalind, achieved the highest overall performance on the benchmark. That sounds impressive until you look at the numbers. The model posted a pass rate of just 36.1 percent. While that is a meaningful improvement over GPT-5.5’s 25.7 percent score, it also means the best-performing model failed nearly two-thirds of the benchmark’s tasks.

That matters because these aren’t trick questions. The benchmark was created by 173 scientists with Ph.D.-level training and industry experience in biotechnology and pharmaceuticals. More than 450 additional experts reviewed the tasks. Many require models to interpret figures, PDFs, spreadsheets, scientific data files, and other materials commonly used by researchers.

The benchmark also revealed a familiar weakness. AI systems generally perform better when everything is presented as text. Once they are forced to work with supporting documents, figures, or complex datasets, performance drops noticeably. GPT-Rosalind’s pass rate fell from 45.1 percent on text-only tasks to 28.1 percent on tasks involving artifacts or URLs.

To be fair, the benchmark is not intended to suggest AI is useless in research. Quite the opposite. OpenAI found that models are becoming increasingly capable of scientific communication, evidence synthesis, and translating research findings into practical explanations. Those are valuable skills, particularly for researchers drowning in information.

But LifeSciBench serves as a useful reminder that today’s AI systems are still far from autonomous scientists. They can help. They can assist. They can sometimes provide surprisingly useful insights. What they cannot reliably do, however, is replace the expertise, judgment, and skepticism that real scientific research requires.

Ironically, OpenAI may have created one of the strongest arguments against some of the industry’s more ambitious claims. If the company’s own benchmark shows that its best model struggles with a majority of realistic scientific tasks, perhaps we’re not quite as close to AI-powered scientific breakthroughs as some headlines would have you believe.

Support independent tech journalism

NERDS.xyz is independently owned and operated. If you enjoy my coverage of Linux, AI, hardware, cybersecurity, and tech culture, consider supporting the site on Ko-fi.

Support NERDS.xyz
Avatar of Brian Fagioli
Written by

Brian Fagioli

Technology journalist and founder of NERDS.xyz

Brian Fagioli is a technology journalist and founder of NERDS.xyz. A former BetaNews writer, he has spent over a decade covering Linux, hardware, software, cybersecurity, and AI with a no nonsense approach for real nerds.

Leave a Comment