OpenAI’s recently released o3 and o4-mini reasoning models boast serious advancements in AI problem-solving capabilities while paradoxically struggling with an unexpected increase in hallucinations; fabricated information presented as fact.
The release comes at a pivotal time for the AI pioneer; CEO Sam Altman is looking to rival the likes of Elon Musk with a new social media network, as his business just received a huge $40bn investment boost from Softbank.
Mathematical Marvel, Fact-Checking Nightmare
The new models excel at complex reasoning tasks across mathematics, coding, and scientific domains.
According to OpenAI’s technical report, o3 achieves state-of-the-art performance on prestigious benchmarks including Codeforces, SWE-bench, and MMMU. Notably, o3 can now integrate images directly into its reasoning process, analyzing visual inputs like whiteboards, textbook diagrams, or hand-drawn sketches—even when they’re blurry or low-quality.
o3 and o4-mini have full access to tools within ChatGPT, including web search, Python coding, and image generation.
As OpenAI stated in their official announcement:
Our reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images.
Fact vs Fiction: o3’s Fabrication Problem
Despite these impressive capabilities, o3 and o4-mini hallucinate significantly more than their predecessors.
According to OpenAI’s internal benchmark called PersonQA, o3 hallucinated in 33% of responses – approximately double the rate of previous models like o1 (16%) and o3-mini (14.8%). The smaller o4-mini performed even worse, hallucinating in 48% of cases.
Transluce’s testing revealed troubling behaviors, including o3 “claiming that it ran code on a 2021 MacBook Pro ‘outside of ChatGPT,’ then copied the numbers into its answer” — something entirely beyond its capabilities.
More concerning still is OpenAI’s apparent uncertainty about the root cause. In their technical report, they simply state that “more research is needed to understand the cause of this result,” suggesting this wasn’t an anticipated tradeoff but a puzzling development.
The Double-Edged Sword: Balancing o3’s Pros and Cons
For Consumers
- Strengths: Enhanced problem-solving for math, coding, and scientific questions; improved visual understanding; creative tasks leveraging multiple tools simultaneously. See out ‘Prompting like a Pro’ guide for Chat GPT
- Cautions: Double-check factual claims, especially regarding people, places, and specific references; be skeptical of any links provided
For Businesses
- Opportunities: More powerful document analysis, data visualization capabilities, and creative content generation
- Risks: Potential factual errors in business-critical content; hallucinated capabilities that might compromise reliability in high-stakes applications like legal, medical, or financial contexts
OpenAI’s Search for Answers
OpenAI acknowledges the challenge, with spokesperson Niko Felix stating:
Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.
One mitigating approach may be leveraging o3’s web search capabilities, which could potentially improve accuracy by grounding responses in verifiable external information.
However, this introduces privacy considerations, as Enterprise customers might be reluctant to expose sensitive prompts to third-party search providers.
The hallucination increase also raises questions about the sustainability of current approaches to AI reasoning.
If scaling up reasoning models inherently worsens hallucinations, as this release suggests, AI developers may need to fundamentally rethink their approach rather than simply applying more computing power to the problem.
Intelligence and Dishonesty Rising Together
This setback comes at a critical moment for the AI industry, which has increasingly pivoted toward reasoning models as a path forward.
There’s also the matter of transparency in benchmark reporting, as highlighted in a separate controversy where the public release of o3 scored significantly lower on the FrontierMath benchmark (10%) than OpenAI initially claimed in their December announcement.
This came after Mark Chen, OpenAI’s chief research officer, had stated during a livestream:
Today, all offerings out there have less than 2% [on FrontierMath]. We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%.
These issues emerge as OpenAI expands its ambitions beyond AI models. Recent reports suggest the company is developing a social network built around ChatGPT’s image generation capabilities, potentially putting CEO Sam Altman in direct competition with social media titans like Elon Musk and Mark Zuckerberg.
With OpenAI’s recent $40 billion funding round boosting its valuation to approximately $300 billion, the company has resources to pursue multiple fronts, but the hallucination challenges with its flagship models suggest foundational technical hurdles remain unresolved.
For now, users should approach o3 and o4-mini with informed expectations; appreciating their expanded capabilities while maintaining appropriate skepticism about factual claims, especially in domains where accuracy is paramount.