Absolute Zero: Reinforced Self-play Reasoning with Zero Data

This is the most adjacent to Monty’s sensor motor inspired loop that I have seen.
Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

I wonder if something like this can be done with Monty.

2 Likes

Thanks for sharing @ricpruss
It looks a bit like other intrinsic reward approaches (like curiosity-driven learning Large-Scale Study of Curiosity-Driven Learning https://arxiv.org/pdf/1705.05363) but with a bit more explicit task setting. Definitely an interesting direction as humans for sure don’t learn like supervised DNNs, can set themselves goals, and learn from this kind of self-supervised exploration.
It doesn’t seem to contain many of Monty’s core principles (structure models, learning and inference through movement, a general repeatable computational unit) but it looks at one interesting aspect of learning without external supervision.
You wouldn’t be able to do this with Monty today, as Monty has not been used with language yet (I wrote a bit more on our thoughts on that here: Abstract Concept in Monty - #4 by vclay ) but the general concept of being able to set your own goals and then using internal models to achieve them (+using that to then update the internal models) is one of the ways Monty learns (for more details you could look at the goal state generator in each LM and our documentation on model-based/hypothesis-driven policies Policy)

2 Likes

It is an interesting paper but I wonder about the zero knowledge descriptor.

First, it uses the crystallized dataset memorised by Llama 3.1 and second the 3 task structures are hard-coded by the authors. The problem self-creation is human bootstrapped, and also steered using hard-coded reward metrics plus filters and limited to single arg functions.

Ultimately this is a fascinating Llama LLM fine-tuning mechanism, but I think it falls well short of the hyperbole it is currently getting in youtube etc.

Thanks Vivian I read your piece on language and its lack of grounding but is there some mechanism in Monty now that forms shallow hierarchies of reusable structure or where are your thoughts on how abstractions get self taught in Monty?

Hey @ricpruss great questions! We are close to publishing a paper on how we think the brain learns those shallow hierarchies of reusable structures and uses them to model compositional objects. On the Monty implementation side, we already have the infrastructure set up for stacking LMs hierarchically and have run some preliminary experiments with two levels of hierarchy. Getting all the intricacies to work and demonstrating Monty’s ability to model compositional objects is something our team is actively working on, so hopefully we’ll have more concrete results and demos of this soon! Regarding language and abstractions, this is currently still only conceptually and discussed in our research meetings.

@vclay,

That would be a pretty cool paper. Would you guys be willing to post something about it here once you publish?

Of course! We will definitely do that :slight_smile:

1 Like