Absolute Zero: Reinforced Self-play Reasoning with Zero Data

vclay · May 12, 2025, 2:16pm

Thanks for sharing @ricpruss
It looks a bit like other intrinsic reward approaches (like curiosity-driven learning Large-Scale Study of Curiosity-Driven Learning https://arxiv.org/pdf/1705.05363) but with a bit more explicit task setting. Definitely an interesting direction as humans for sure don’t learn like supervised DNNs, can set themselves goals, and learn from this kind of self-supervised exploration.
It doesn’t seem to contain many of Monty’s core principles (structure models, learning and inference through movement, a general repeatable computational unit) but it looks at one interesting aspect of learning without external supervision.
You wouldn’t be able to do this with Monty today, as Monty has not been used with language yet (I wrote a bit more on our thoughts on that here: Abstract Concept in Monty - #4 by vclay ) but the general concept of being able to set your own goals and then using internal models to achieve them (+using that to then update the internal models) is one of the ways Monty learns (for more details you could look at the goal state generator in each LM and our documentation on model-based/hypothesis-driven policies Policy)