Hello, and thoughts about leveraging LLMs in Monty

Hello ThousandBrains community!

To briefly introduce myself, I have been following the work of Jeff Hawkins and Numenta since I read ‘On Intelligence’ many years ago. I had the honor to Skype once with Jeff when I was a young dev consultant, advising one project to try Numenta’s HTM (that was before Deep Learning was a thing).

15 years later, I am thinking about topics for a postdoc after my Ph.D. on the participatory design of telerobotic puppets. My experience in machine learning is limited (mostly using, some training and fine-tuning), but it seems that there is a dire need for more human-centered and sustainable approaches to creating AI.

In my vision, I imagine a participatory workshop where participants try to explain their deep thoughts about life to a ThousandBrains robot (or puppet), maybe teaching it how to behave in their culture. I realize that this is a far stretch, even for a long-term project, but maybe the first step could be to try and leverage existing LLMs to fast-track Monty into understanding language. Specifically, I’ve been wondering (and of course discussing with LLMs) if it’s possible to hook up some middle layer of a pre-trained transformer (maybe Llama?) and convert it to CMP so that Monty understands semantics but learns context and action-predictions on its own. I have seen some discussions of similar ideas in this forum but didn’t find any concrete implementation suggestions.

My hope is that this approach could be sustainable, since the big pre-training of LLMs is already done and the Monty integration might even run on edge devices? Also, I feel that providing an interactive learning experience that uses language could show the main thing that Monty can do and LLMs cannot, which is actual learning of concepts and behavior, not just maintaining a stack of text prompts as ‘memory’ (like in the film Memento).

Looking forward to your thoughts!
/Avner

4 Likes

Hi @avner.peled , welcome to the forums!

It sounds like an interesting project. I don’t want to put you off combining Monty with LLMs - there could certainly be scope for some interesting demos there, but just a few thoughts that might be helpful.

Re. a system learning about someone’s culture through conversation - as you might be aware, deep learning systems (including LLMs), are notoriously bad at continual learning, and also learning with limited amounts of data. While there are approaches that can be taken with fine-tuning or in-context learning, having a system that learns dynamically and quickly from a conversation with a person is not something that current technology (Monty or deep-learning) would support well. However, it is the kind of thing we think Monty would excel at in the long term.

On the note of long-term research and language, @vclay made a really nice post about how we think about language and Monty. As described there, we think it would be a mistake to try to shortcut language understanding with LLMs, which do not have grounded, structured language concepts, but rather have become statistical text-prediction systems. Of course, I don’t want to stop you from exploring whether Monty could be controlled with voice commands using an LLM as an interface - that definitely sounds like it could make for some interesting demos. However, in terms of a long term solution to the problem you’re describing, I think it will unfortunately have to wait until we have language capabilities in a thousand-brains system like Monty.

If you’d like to contribute to our roadmap (including during your post-doc) to get us there quicker, please do check out our How You Can Contribute Page. We’d love to have you involved.

1 Like

Thank you Niels for the thorough answers and references.
I want to try to refine my question and get your feedback on a specific idea that I presented, that is, using only a middle layer of an LLM as a gateway for Monty to learn a linguistic model.
For example, here is one paper analyzing the roles of different layers in LLaMA. What I was hoping is to shortcut only the fundamental part of language, basic syntax, basic meanings of words and sentences (maybe also through audio). Then, let Monty develop real structural concepts and nuances that are a combination of the LLM’s basic processing of speech and Monty’s higher-level spatial reasoning. Does that make sense?

1 Like

No worries, yeah I think that taking an approach like this could have a variety of interesting use-cases.

To clarify, without grounding from the beginning, the system is unlikely to develop a robust mapping between language and sensorimotor concepts as diverse as “on”, “below”, “within”, “open” etc. However, I can imagine how it could map between language and things like object labels.

If this enables you to explore an application you are interested in, like issuing a voice command to have a Monty robot retrieve an object that is named, then that is great! (bearing in mind that manipulating the environment is also something that is still in our research pipeline)

Overall I think your project sounds really interesting, I’m just concerned it might be a bit ambitious given the capabilities of both Monty and LLMs today. If you are interested in an interactive voice+Monty demo, I would suggest simplifying at least the first version to something like getting Monty to point to a named object. We are close to supporting mutli-objects in Monty (see for example this policy that would need to be implemented first), so I think that would be doable.

Thank you, Niels.
The last line in that policy description, about creating an combined policy of reducing uncertainty sounds to me like a promising path. It’s the intersection point with the Free Energy Principle and Active Inference (perhaps ‘reducing complexity’ should also be considered in the policy?).
Thinking about curiosity and sensorimotor concepts, I wonder if the model could be embedded with the basic ability to ask. For example:

  • “Monty, look up”
  • “Where is up?”
  • manually move Monty’s camera to look up “There”.

Oh and congratulations to you, @vclay and others for the paper publishing! Looking forward to reading it.

2 Likes

Thank you!

And yeah, lots of interesting avenues to explore! What you describe fits well with the concept of joint attention in humans, which is meant to be critical for language learning.

3 Likes

Hi @avner.peled, @nleadholm (as well as @vclay) and everyone involved in this great conversation! I also considered the problem of language quite intensely and have come up with an idea, that intersects with what has been discussed here and in other topics.
In my view, regarding language, we should first separate this problem into two areas: 1) language as a reference structure to the real world and 2) language as a command interface. For now, I will leave the second area for future discussion/development and will focus on the first one.

Regarding the first problem, I believe we need to somehow tie the Monty’s inner representations with spoken and written words. I agree here with @vclay and @ak90 in Abstract Concept in Monty loudspeakerGeneral referencing the real human experience when we internalise language through interaction with the real world. This statement I consider as a key:

To understand what the words mean we should connect them to phenomena known to us. So, we need something points certain sounds and words to those phenomena. In real life it is usually other people, and since this development is most intense in the childhood, those are parents.
Now, let’s imagine (I guess – not for the first time :smile: ) that Monty is a child. So, we need to attach to it a kind of mentor. The child will show that person what they has in their memory and a mentor will name that thing as it sounds or appears in the written form.
Of course it is impossible to do that literally to assign a real person to the instance of Monty. But we can achieve this goal by exploiting (sorry for slight cynicism :wink: ) LLM. Monty can show its representations to it:

Now, Monty can reliably assign words that an LLM presents to its memorized objects and the same principle will apply to sounds. Also, I think that this process can be scaled if we use many instances of Monty and then merge their new experiences in a single model which we may consider analogous to the Foundation Model as used in transformers. That’s the idea in a nutshell. What do you think?

2 Likes

Thanks for adding some thoughts on this Sergey! I agree with you - in the long term, we would want some kind of mentor to teach Monty about the world. As noted earlier, this relates to the concept of joint attention, which is believed to be central to language acquisition in humans.

In terms of what form the tutor takes, I think there are a few options, although we have a lot of core things that need to be solved before we can add this.

One aspect that will influence our approach is Monty’s efficiency of learning. In particular, Monty learns extremely quickly, much more quickly than LLMs, and potentially even more quickly than humans. For example, we are not necessarily bottlenecked the way humans are to rely on the hippocampus for fast learning, and model sharing could easily be done in Monty, even though this is not possible in humans. This model sharing can happen across LMs, as well as - like you say, between instances of Monty.

As such, the more direct tutoring by humans that you mention might actually be quite possible! On the other hand, the main benefit of LLMs (being able to scale up) may not be necessary. In addition, I would be concerned about the risk of passing LLM biases to Monty. Finally, it’s worth noting that the “environments” that LLMs are generally good at labelling things in (static images) are quite different from the environments within which Monty learns (dynamic, interactive environments).

Another possibility to consider is a disembodied, environment-wide tutor in a virtual environment. For example, if Monty is learning in a video game style environment with virtual assets, then we might have ground-truth access to what those assets are. It might then be sufficient that, when Monty-the-child looks at something, a “disembodied voice” tells it what it is seeing, or a label pops up in writing about the object. This could serve as a preliminary learning environment, before Monty interacts in the real world. This would be similar to how supervision currently works in Monty. We know which object is rendered in what pose so we can simply give Monty those ground truth labels during supervised learning. The main difference would be feeding the label in via a sensory system that processes language.

Ultimately this is all a long way in the future, so I think the most appropriate solutions are hard to anticipate at this point. It’s definitely interesting to discuss though.

1 Like