Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more
OpenAI's latest o3 model has achieved a breakthrough that has surprised the AI research community. The o3 scored an unprecedented 75.7% on the super-tough ARC-AGI benchmark under standard compute conditions, with a high-compute version reaching 87.5%.
While the achievement in ARC-AGI is impressive, it still does not prove that the artificial general intelligence (AGI) code has been cracked.
Abstract Reasoning Corpus
Based on the ARC-AGI benchmark Abstract Reasoning Corpuswhich tests an AI system's ability to adapt to novel tasks and exhibit fluid intelligence. The ARC consists of a set of visual puzzles that require an understanding of basic concepts such as objects, boundaries, and spatial relationships. Although humans can easily solve ARC puzzles with very little display, current AI systems struggle with them. ARC has long been considered one of the most challenging systems in AI.
ARC is designed so that one cannot cheat by training the model on millions of examples hoping to cover all possible combinations of the puzzle.
The benchmark is composed of a public training set containing 400 simple examples. The training set is complemented by a universal evaluation set that contains 400 puzzles that are more challenging as a way to assess the generalization of the AI system. The ARC-AGI Challenge consists of private and semi-private test sets of 100 puzzles, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of leaking data to the public and contaminating future systems with prior knowledge. Furthermore, the competition sets a limit on the amount of calculations that participants can use to ensure that the puzzles are not solved by brute force methods.
A breakthrough in innovative task solutions
o1-preview and o1 scored the highest 32% in ARC-AGI. Another method developed by researchers Jeremy Berman Used a hybrid method, Claude 3.5 Sonnet with a genetic algorithm and a code interpreter to achieve 53%, the highest score before o3.
A Blog postARC's creator, François Chollet, described the o3's performance as “a surprising and significant step-function increase in AI capabilities, showing novel task adaptation capabilities never before seen in GPT-family models.”
It is important to note that these results could not be reached using more calculations in previous generations of models. For context, it took 4 years for the models to progress from 0% of GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don't know much about the o3's architecture, we can be sure that it's not an order of magnitude bigger than its predecessors.
“This is not merely an incremental improvement, but a real advance, marking a qualitative change in AI capabilities compared to the previous limitations of LLM,” Cholet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.”
It is worth noting that o3's performance in ARC-AGI comes at a steep cost In the low-compute configuration, the model costs between $17 and $20 and 33 million tokens to solve each puzzle, while in the high-compute budget, the model uses about 172X more computations and billions of tokens per problem. However, as the cost of estimation continues to decrease, we can expect these figures to become more reasonable.
A new paradigm in LLM reasoning?
The key to solving novel problems is what Cholett and other scientists refer to as “program synthesis.” A thinking system should be able to develop small programs to solve very specific problems, then combine these programs to tackle more complex problems. Classic language models have absorbed a lot of knowledge and have a rich set of internal programs. But they lack structure, which prevents them from finding puzzles outside of the training distribution.
Unfortunately, there is very little information about how o3 works under the hood, and scientists differ here. Chollet hypothesizes that o3 uses a type of program synthesis that uses chain-of-thought (CoT) logic and a search process combined with a reward model that evaluates and refines solutions as the model generates tokens. This is similar to what open source reasoning models have been exploring in the last few months.
Other scientists like Nathan Lambert The Allen Institute for AI suggests that “o1 and o3 may actually be passing forward from a language model.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, Posted in X that o1 was “only an LLM trained with RL. o3 is driven by scaling RL further beyond o1.”
On the same day, Denny Zhou from Google DeepMind's reasoning team called the current combination of search and reinforcement learning a “dead end”.
“The most beautiful thing about LLM reasoning is that the thought process is generated automatically, without relying on searches over the generation space (like mcts), by a well-equipped model or carefully designed prompts,” he said. Posted in X.
Detailing how o3 factors may seem insignificant compared to the progress of ARC-AGI, this may very well define the next paradigm shift in LLM training. There is currently a debate over whether the LLM scaling laws with training data and computation have hit a wall. Test-time scaling depends on better training data or different estimation architectures can determine the next path.
Not AGI
The name ARC-AGI is misleading and some consider it equivalent to AGI solutions. However, Cholet emphasizes that “ARC-AGI is not an acid test for AGI.”
“Passing ARC-AGI is not equivalent to achieving AGI, and in fact, I still don't consider o3 to be AGI,” he wrote. “o3 still fails at some very simple tasks, indicating fundamental differences with human intelligence.”
Furthermore, he points out that o3 cannot learn these skills autonomously and relies on external verifiers during inference and human-labeled reasoning chains during training.
Other scientists have pointed to flaws in OpenAI's reported results. For example, the model is fine-tuned on the ARC training set to achieve state-of-the-art results. “The solver will not need much specific 'training,' either in the domain itself or for each specific task,” the scientist wrote. Melanie Mitchell.
To test whether these models possess the abstraction and reasoning that the ARC benchmark was designed to measure, Mitchell proposed “whether these systems can be adapted to reasoning tasks using the same concepts or using the same concepts, but in domains other than ARC. “
Chollet and his team are currently working on a new benchmark that is challenging for o3, bringing its score below 30% even with a high-computing budget. Meanwhile, humans will be able to solve 95% of puzzles without any training.
“You'll know AGI is here when the practice of creating tasks that are easy for regular humans but difficult for AI becomes,” Chollet wrote.