HOME > NEWS > BODY

Questions Raised Over OpenAI o3 Benchmark Result Transparency

Questions Raised Over OpenAI o3 Benchmark Result Transparency

A noticeable gap between OpenAI’s internal benchmarks and third-party testing results for its o3 AI model is stirring debate over the company's transparency and evaluation methods.

When o3 was first introduced in December, OpenAI reported that the model successfully tackled over 25% of the problems from FrontierMath—a notoriously difficult math benchmark. That figure dramatically outpaced competitors, none of which crossed the 2% mark at the time.

“Today, all offerings out there have less than 2% [on FrontierMath],” said OpenAI Chief Research Officer Mark Chen during a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%.”

But that impressive number appears to reflect a highly optimized version of o3—one powered by significantly more compute resources than the publicly released model.

Epoch AI, the organization behind FrontierMath, published its independent benchmark of o3 on Friday. Their tests pegged o3’s performance closer to 10%, considerably lower than OpenAI’s top-end claim.

That doesn't necessarily mean OpenAI misrepresented the results. The company’s December benchmark disclosure did include a lower-bound score that aligns with Epoch’s findings. Additionally, differences in evaluation setups and problem sets likely influenced the results.

“The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath,” Epoch explained, referencing two versions of the benchmark set.

The ARC Prize Foundation, which tested an early version of o3, added further context in a post on X. According to them, the public o3 model is a distinct variant tailored more for chat and product integration than raw benchmark performance.

“All released o3 compute tiers are smaller than the version we [benchmarked],” ARC wrote, noting that larger compute configurations tend to fare better in testing.

During a livestream last week, OpenAI’s Wenda Zhou acknowledged the performance gap. He explained that the production o3 model was tuned for speed and real-world efficiency, which could lead to “disparities” in benchmarking outcomes.

“We’ve done [optimizations] to make the [model] more cost efficient [and] more useful in general,” Zhou said. “You won’t have to wait as long when you’re asking for an answer, which is a real thing with these [types of] models.”

Still, some might view the discrepancy between internal and external benchmarks as a reflection of a broader industry trend—where benchmark results are often optimized for press attention rather than practical accuracy. In this case, the debate may be academic, as OpenAI’s o3-mini-high and o4-mini already outperform the public o3, and the upcoming o3-pro promises another step up in capability.

Yet the episode underscores a persistent issue in AI: benchmark scores should be approached with caution, especially when they come from companies with a product to promote.

Benchmark drama is becoming increasingly common as AI firms jostle for the spotlight with each model release.

Back in January, Epoch was criticized for not disclosing OpenAI funding until after o3’s announcement. Many researchers involved in FrontierMath reportedly learned of OpenAI’s involvement only after the fact.

More recently, Elon Musk’s xAI was accused of using skewed benchmarks to promote Grok 3, while Meta admitted to citing benchmark scores from a model variant different from the one it ultimately released.

FREE TRIAL
CONTACT