OpenAI’s new GeneBench-Pro benchmark exposes AI’s biggest weakness

Artificial intelligence has gotten pretty good at writing code, answering questions, and cranking out summaries. Actual scientific research? Well, that’s a different animal entirely.

OpenAI today introduced GeneBench-Pro, a new benchmark designed to test whether AI models can handle the messy judgment calls that define real computational biology work. The company isn’t interested in whether a model can follow instructions or execute a predefined workflow. It wants to know whether AI can make the kinds of decisions scientists make every day when the data don’t cooperate.

That sounds simple enough until you remember that scientific data rarely comes with an instruction manual attached.

You see, researchers constantly have to decide whether a signal represents real biology or random noise, whether a dataset is good enough to answer the question being asked, and whether a surprising result means the experiment worked or something went horribly wrong. OpenAI refers to these chains of decisions as “research taste,” which is the sort of intuition and judgment that separates experienced researchers from people simply following a recipe.

GeneBench-Pro attempts to measure exactly that.

The benchmark contains 129 research-level problems spanning statistical genetics, cancer genomics, proteomics, pharmacogenomics, diagnostics, microbial genomics, population genetics, and several other disciplines. Instead of handing AI models neat datasets and step-by-step instructions, GeneBench-Pro throws them into situations that more closely resemble actual research projects.

The model receives messy data, a brief description of the problem, and a target outcome. From there, it has to figure out what questions the data can realistically answer, clean the data, choose the proper methods, identify problems, revise assumptions, and eventually arrive at a conclusion that could influence a scientific or clinical decision.

Interestingly, OpenAI built the entire benchmark using synthetic datasets rather than historical studies. That gives the company full visibility into the underlying causal structure while avoiding a common benchmark problem where multiple valid analytical approaches exist but only one receives credit from the grader.

Perhaps the most interesting idea in the paper is what OpenAI calls the “notice-act gap.”

According to the company, weaker AI models often recognize that something in the data looks wrong but fail to change their analysis as a result. A model might spot confounding variables, poor quality data, or questionable assumptions, only to continue down the exact same path anyway. Stronger models are getting better at carrying those observations forward and adjusting their methods accordingly.

The results show there is still a long way to go.

OpenAI says GPT-5.6 Sol achieved a pass rate of 28.7 percent at its highest reasoning setting, rising to 31.5 percent with Pro mode enabled. Claude Opus 4.8 was the strongest non-GPT competitor at 16 percent, while several other frontier models struggled to break into double digits.

In other words, OpenAI’s best model still failed more than two-thirds of the benchmarks. Frankly, that’s probably reassuring.

If an AI system suddenly started acing research tasks that human experts estimate require anywhere from 20 to 40 hours of work, scientists everywhere would have good reason to be nervous. Instead, OpenAI’s own findings suggest today’s models behave more like junior researchers than seasoned veterans. They can spot issues in the data, but they often struggle to understand what those issues mean for everything that comes next.

That doesn’t make these systems useless. Far from it.

OpenAI estimates that solving one of these problems with human experts can cost thousands of dollars in labor, while an AI model can attempt the same work for only a few dollars in inference costs. Even imperfect automation could become enormously valuable for drug companies, biotech firms, and research institutions looking to move faster.

If all of this sounds familiar, it should. Earlier this month, I wrote about how frontier AI models still struggle with real scientific work once problems move beyond textbook answers and tidy datasets. GeneBench-Pro doesn’t really contradict that conclusion. If anything, OpenAI’s own benchmark reinforces it. The models are improving quickly, but there is still a massive difference between producing an answer and demonstrating the judgment needed to trust it.

Look, folks, I’ve said it before and I’ll say it again: AI is becoming an incredibly powerful tool for experts, but replacing experts is a much taller order. Knowing how to press the button is easy. Knowing whether you should press it in the first place is where the real work begins.

☕

Support independent tech journalism

NERDS.xyz is independently owned and operated. If you enjoy my coverage of Linux, AI, hardware, cybersecurity, and tech culture, consider supporting the site on Ko-fi.

Support NERDS.xyz

Written by

Brian Fagioli ✔

Technology journalist and founder of NERDS.xyz

Brian Fagioli is a technology journalist and founder of NERDS.xyz. A former BetaNews writer, he has spent over a decade covering Linux, hardware, software, cybersecurity, and AI with a no nonsense approach for real nerds.

📄 More by Brian Fagioli ✖ Follow on X ▶ YouTube @ Threads 🐘 Mastodon

Support independent tech journalism

Brian Fagioli ✔

Leave a Comment Cancel reply