Hello, and thoughts about leveraging LLMs in Monty

Thanks for adding some thoughts on this Sergey! I agree with you - in the long term, we would want some kind of mentor to teach Monty about the world. As noted earlier, this relates to the concept of joint attention, which is believed to be central to language acquisition in humans.

In terms of what form the tutor takes, I think there are a few options, although we have a lot of core things that need to be solved before we can add this.

One aspect that will influence our approach is Monty’s efficiency of learning. In particular, Monty learns extremely quickly, much more quickly than LLMs, and potentially even more quickly than humans. For example, we are not necessarily bottlenecked the way humans are to rely on the hippocampus for fast learning, and model sharing could easily be done in Monty, even though this is not possible in humans. This model sharing can happen across LMs, as well as - like you say, between instances of Monty.

As such, the more direct tutoring by humans that you mention might actually be quite possible! On the other hand, the main benefit of LLMs (being able to scale up) may not be necessary. In addition, I would be concerned about the risk of passing LLM biases to Monty. Finally, it’s worth noting that the “environments” that LLMs are generally good at labelling things in (static images) are quite different from the environments within which Monty learns (dynamic, interactive environments).

Another possibility to consider is a disembodied, environment-wide tutor in a virtual environment. For example, if Monty is learning in a video game style environment with virtual assets, then we might have ground-truth access to what those assets are. It might then be sufficient that, when Monty-the-child looks at something, a “disembodied voice” tells it what it is seeing, or a label pops up in writing about the object. This could serve as a preliminary learning environment, before Monty interacts in the real world. This would be similar to how supervision currently works in Monty. We know which object is rendered in what pose so we can simply give Monty those ground truth labels during supervised learning. The main difference would be feeding the label in via a sensory system that processes language.

Ultimately this is all a long way in the future, so I think the most appropriate solutions are hard to anticipate at this point. It’s definitely interesting to discuss though.

1 Like