The road to a Generally-Intelligent Monty?

Note: This post is a bit dense and speculative, it’s more of an exploratory topic than a formal proposal!

I wanted to follow up on the question I asked during the meetup Q&A and flesh out the details more.

Many users have been talking about multiple aspects of what a full-scale Thousand Brains system would entail, especially in terms of software stack and associative connections. One thing that’s been simmering in my mind since I discovered TBP is what would a complete system look like at the big-picture, architectural level, to achieve all project goals.

The team hasn’t deeply explored that angle, there are some hints in Long-Term Goals and Principles, Capabilities of the System, and the TBP Future Applications video, but those focus more on the “what” (goals) than the “how” (components). So, I took some time to try imagining what TBP might end up evolving to in the long term.

The interactions between the human brain regions are very complex, but fortunately, the Sensor and Learning Modules paradigm that the team adopted makes it easier to break it down into tractable components.

For instance, Monty’s current modules are an abstract, partial implementation of the following pathways, roughly speaking:

  • Visual Sensor Modules = Retina → lateral geniculate nucleus (LGN) → V1 → V2 → V4;
  • Object Learning Modules = Inferior temporal gyrus (IT) / lateral occipital cortex (LOC) → perirhinal cortex (PRC);
  • Action Spaces, Policies, Goal States = rudimentary version of V5/MT → posterior parietal cortex (PPC) → posteromedial cortex (PMC), in addition to basal ganglia.

From my understanding, the team is also doing early work on the following modules:

  • 2D Vision Sensor and Learning Modules for shape and texture detection, learning to read from scratch, and decreasing reliance on depth data [V1, V2, V4, visual word form area]
  • Touch Sensor and Learning Modules for prehensile capabilities [parietal lobe];
  • Motor Modules for physical movement and proprioception [PMC, M1, cerebellum].

For TBP to achieve most of its goals and perhaps reach general intelligence (which the team hinted at many times), I suspect more modules would be required for full “bootstrapping”, as I mentioned in the Q&A. Applying the TBP paradigm to the entire cortex, its other main features might also have to be implemented as modules, such as (but not limited to):

  • Visual Motion Sensor and Learning Modules for live change detection and object behavior modeling [V3, V5/MT];
  • Audio Sensor and Learning Modules to learn spoken language from scratch [auditory cortex, Broca’s and Wernicke’s areas, temporal gyri];
  • Scene Learning Module for simultaneous localization and mapping (SLAM) [parahippocampal place area, retrosplenial & entorhinal cortexes, place cells];
  • Social Module for affinity to human social cues and alignment [fusiform face area, extrastriate body area];
  • Attention Module for cross-module focus management [dorsal attention network]
  • Saliency Module for cross-module stimuli management [salience network];
  • Workspace Module for live data condolidation to address the binding problem [frontoparietal network];
  • Thinking Module for higher-level cognition, meta-association, simulation, and planning [frontal cortex, default mode network];
  • A distributed, compositional, hierarchical associative database of SDRs, as a form of associative memory [hippocampus];
  • Other optional modules, e.g. Digital Learning Modules to learn binary data, text encodings, and communication protocols from scratch, enabling text chat, agentic tool use, and machine interfacing.

In theory, this could all run within a single system process for minimal overhead (just like a videogame executable), with each module having one or more threads. Although, maybe that’s a bit beyond Monty and more leaning into “Vernon Operating System” territory… :laughing:

While it may seem like yet another “hardcoded” cognitive architecture at first glance, all of these modules could operate with some form of cortical voting, so it can be characterized as a targeted scaffolding of human developmental priors that we acquired through both genetic evolution and self-domestication. Such a broad scaffolding might prove necessary to truly reach the threshold of emergent general intelligence.

(Well, maybe not all of that, since born-blind and born-deaf people can still be very smart, so I suppose getting a “deaf” Monty to reach the threshold would be a good indicator of success)

The most interesting point here that can be actioned upon in the shorter term is definitely the associative memory (hippocampus). I have a few ideas for proofs of concept about that, but I won’t dive into the subject here, it deserves its own separate thread.


The ideas described above are loosely based on the Global Neuronal Workspace model (aka Dehaene–Changeux model):

(source)

Within the realm of Thousand Brains theory, (A) can be imagined as Hierarchical Temporal Memory, (B) as Monty modules of course, and (C) as system-wide interactions of all the modules.

The exact intermodule pathways remain to be determined, but what I’m thinking is that the Workspace Module would be the centerpiece, where live SDRs from other modules are streamed to and associated together in Hebbian fashion. The Thinking Module would be latched on top of the Workspace Module, for meta-association capabilities, multi-SDR prediction, and among other things, affordances. That’s what I was alluding to when I talked about an “associative engine” in this post.

So far, no other project / system / AI out there has properly conceptualized this kind of systemic approach. I think the closest would be BrainCog, but unfortunately it’s all deep learning with narrow-AI modules and computationally-expensive biological neuron models.


The Question

My Q&A question was unfortunately cut short, and now that I’ve explained myself a bit, I ask the team again:

I’m aware this represents a monumental workload with plenty of unknowns to tackle, but I was wondering if this is the direction that TBP might be headed to, and if you’ve already started thinking about these kind of longer-term technical requirements and roadmap to gradually augment your framework toward a full-scale system?

Just trying to figure out where you draw the line in the sand. :wink:

3 Likes

The goal is to build a platform that everyone can use to create anything they can imagine that they can find or make the parts for. Like DIY 3D printed yardwork robots that know what dogs are and how to clean up dog poop. I’m guessing that’s here by 2030. With different attachments it could do housekeeping and laundry, and it could swap them out on its own. It could know what the sound of an approaching hairball sounds like and what to do. That’s up to us, not the TBP.

In 94 I found Linux 0.99pl7 because I looking for a cheap unix clone to play with and my brother told me about the Internet, and this “Linux thing” he heard about. in April I founded the Portland Linux Users Group and in June started a dialup ISP, literally in my garden shed for $3000 running on LInux 1.2.13 that soon became an IT consulting and custom software company.

That’s what’s about to happen. Someone working a telephone hardware tech support job where they have Unix system support department wants a promotion, then finds Linux and creates a business instead. Substitute Monty.powered robots.

1 Like

Monty is sorta like a partial collection of “rocket parts”, and future applications are “payloads” that need to reach “orbit”. What I tried doing here is think about what remains to be built of the “rocket” before it becomes capable of reaching the “orbit” of generalization, if that makes sense.

I’m sure everyone has a different opinion about these things, and that’s okay, this is exactly what I’m trying to explore here. Also, I enjoy pushing ideas to the limit; sometimes, new pathways emerge from the thinking process.

1 Like

The TBP won’t be creating applications. It will be enabling applications. Like Jeff said, there’s not that much left in the research, it’s implementation now. Solve motion in space and everything else falls with it.

In a very simple way the MIDI standard is analogous to Monty. It established a common protocol and method for electronic musical instruments from different manufacturers to communicate with each other that remained unchanged for decades until new features were recently added. Nearly the entire electronic music industry is based around that primitive communication protocol.

1 Like

Yep I know, everything I talked about would be at the platform level, in support of third-party applications / “payloads”. Think of it as a cortical robotics middleware. Richer core components translates to better developer appeal and thus broader adoption potential.

1 Like

In the case of Monty, all the rich features will be in modules that follow the specifications of the CMP. There could be a calculator module that physically uses a calculator, visually reads the numbers and returns the results. Or does that virtually and returns the results. Monty doesn’t care about what happens in the module, only that it conforms to the protocol and plays nice with the other modules. The simplicity of the brain at the lowest level producing the complexity it has at scale is pretty amazing.

So, this post below partially answers my original question, i.e. the project might get there at some point, but it’s probably too early to approach the subject the way I did. Although, my focus is more on the overall heterarchy than causal modeling, so the topic might eventually become relevant.

1 Like

Hi @AgentRev ,

sorry for taking so long to respond to your post! Those are some good questions and thoughts. I think it may be useful to clarify a few potential misunderstandings first:

The analogies you mention here

    1. Visual Sensor Modules = Retina → lateral geniculate nucleus (LGN) → V1 → V2 → V4;
    2. Object Learning Modules = Inferior temporal gyrus (IT) / lateral occipital cortex (LOC) → perirhinal cortex (PRC);
    3. Action Spaces, Policies, Goal States = rudimentary version of V5/MT → posterior parietal cortex (PPC) → posteromedial cortex (PMC), in addition to basal ganglia.

are not exactly how we would think about the mapping of Monty onto the brain. Instead, everything that is in the neocortex would be learning modules (including columns in V1, V2, V4, as well as MT, IT, PPC,…). Sensor modules are whatever sensory processing happens before information goes to the learning modules. This processing does not require structured models and is much more basic. The thalamus (which includes LGN) is a place where all the information is initially routed through before it reaches the cortical columns. This is where we propose that the reference frame transformations happen (more details in this paper: [2507.05888] Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain particularly section 3.2). In Monty, this piece is included in each learning module, similarly to how the Thalamus is sometimes conceptually described as the seventh layer of the neocortex.

The whole premise that the TBT is built on is that no matter where in the neocortex a cortical column is located, it works the same way. The only thing that determines what kind of models it learns is where it gets its input from (I can talk in a lot more depth about this part if this is confusing). So it’s not like we plan to write custom learning modules or other components for the different areas of the neocortex. They will all work the same and just connect to different sensor modules and actuators.

Take the example of learning 2D surface models (one of our ongoing research projects): the LM is untouched, and we simply write a custom sensor module that extracts different features and 2D movement along the surface of objects.

Or learning behavior models (how objects move and change over time): One of the really big key insights we had last year was that we can use exactly the same mechanism we use for modeling static objects (+adding a timing input). The LM that learns behaviors simply receives changes instead of static features as input. This is described in a bit more detail here: Object Behaviors

Here is one of our research meetings where we talk about how the exact same column structure could lead to learning various types of models simply by providing different input: https://youtu.be/nWRWb5c3zJk?si=gLBu8sjpCyqYdIB0

The hippocampus is outside the neocortex and does very fast associative learning (for example, quickly building up a map of the room you are in). One of our proposals is that the basic mechanism which cortical columns are built on, first evolved in the entorhinal cortex and was then generalized into the cortical columns. We will have to see if some special modifications are necessary to model the role of the entorhinal cortex, but for now, our thinking is that we can just use a learning module that learns very quickly and forgets quickly again. Here is a research meeting video where we explore the role of the hippocampus a bit more https://www.youtube.com/watch?v=tmljWGLgM70 (there are some more videos around this about to be released in the next weeks).

So in Monty, we don’t plan to add fundamentally new components to the system. Instead, we will need to add a few more tweaks to the general-purpose learning module algorithm (such as adding a temporal component to its models and using the models more for action outputs and planning). As @evanuno mentioned, the large range of applications would then be enabled by people using this general-purpose learning unit (LMs) and providing it with different data in different settings. The idea is that when someone has a custom type of sensor, all they have to do is write a new sensor module that translates the raw sensor data into the CMP (the language that any learning module can understand) and potentially a custom motor system that translates LM output goal states into the raw action format the specific actuators understand.

In terms of fundamental capabilities that we will need to add to Monty, this video might be useful: https://youtu.be/Iap_sq1_BzE?si=sRAVtWKmm_qjfYp6 The TL;DR is that there are 3 more big items on our roadmap:

  • Modeling compositional objects (including scenes): We have a solid theory and built prototypes for this last year and are integrating them into Monty at the moment. Monty has a basic ability to learn compositional models already, and the first half of this year is dedicated to making it more robust and performant.
  • Modeling object behaviors: We developed a good theoretical understanding of how the brain could do this last year, as well as * a concrete plan of how we would implement this in Monty. This year, we plan to prototype those ideas.
  • Learning causal models & producing actions to achieve goals: This is the biggest open question in our theory at the moment and our main focus for brainstorming in the next year.

There are a couple of other capabilities that will need to be added (like scale invariance, learning associative connections between LMs, attention, and object deformations), but we have some good ideas around those as well, and hope that they will fully clarify once the main 3 items above are also solved. We wouldn’t expect those to be new Monty components either, but instead tweaks to the learning modules or the information routing between them. Those topics frequently come up in our research meetings when we talk about the big open questions, and we’ve also made some progress on them in the past year.

I can go into more detail on how any of the bullet points you listed are planned to be tackled in Monty and how each LM (or specialized SMs/motor systems) should implement them if you like.

I hope this makes sense and maybe simplifies the big picture view for you a bit more.

Best wishes,

Viviane

4 Likes