Modelling goal-defined behavior through model-based policy recursion

HumbleTraveller · January 13, 2025, 6:41pm

Hey there, this is going to be a little messy, but hopefully I can come back later to clean it up/fill in extra details.

Outlined below is a series of potential model-based policies as well as how they might interact to drive forward motor-behaivoral outputs. This topic will essentially be split into three sections: Diagram one will show policy interactions. The section immediatly following it will give a rough break down of the policy ideas. The second diagram will introduce something called the “Salience-behavioral Chain,” which will then be used to highlight how the aforementioned policies interact and drive system behaivor. So without further ado…

Modelling goal-defined behavior through model-based policy recursion:

Some light policy details:

Hierarchical Organization and Goal Decomposition

The Hierarchical Planning Policy, analogous to the prefrontal cortex, could initiate complex tasks and decompose them into sub-goals. These sub-goals could then be managed by lower-level LMs (or functional LM groups), which then deploy other more specific policies. For instance, a high-level goal of “making a cup of coffee” might be broken down into sub-goals like “find the coffee machine,” “add water,” and “add coffee,” each of which could be managed by different LMs employing different policies.

This policy sits at the top, initiating complex tasks and breaking them down into simpler goals. It also receives feedback from all the lower levels in the hierarchy.

Salience Modulation

The Salience Mapping Policy, similar to the salience network in the brain, could dynamically adjust the importance of different features based on the working context. This policy could modulate the influence of sensory inputs and dynamically allocate processing resources. This would, for example, modulate the information processed by the Hypothesis-Testing Policy, focusing on the features deemed most relevant for object recognition of a specific goal-defined task.

Predictive Coding and Error Minimization

The Predictive Coding Policy would continuously compare predicted sensory inputs with actual inputs, guiding model updates and exploration. This policy could work in conjunction with the Hypothesis-Testing Policy, driving actions to minimize those errors. For instance, mismatches between predicted and actual sensory input could lead the Hypothesis-Testing Policy to direct the sensor towards areas of an object that would reduce uncertainty.

In short, this policy continuously predicts future inputs and minimizes errors, guiding the system’s learning and exploration. It interacts with the default policy, using it to further refine its models, learning generalized representations of them. These lower-dimensional representations could then be used to “categorically align” models which once seemed disparate.

Central Executor Integration

The Hypothesis-testing Policy uses the system’s internal models to actively disambiguate the identity and pose of an object. It is designed to move a sensor to a location that will minimize uncertainty about a currently observed object. Its outputs are bifurcated into two directions: (1) it outputs a motor-behavior command signal which will then be received by downstream modules; (2) an efferent signal is relayed back up to the Predictive Coding Policy, as to further refine its predictive functioning.

**The Hypothesis-testing Policy is arguably going to be one of the most important policies in the TBP framework, apart from perhaps the Regulatory Policy (which ultimately is what drives system behavior). As such, I want to spend some time exploring how the policy may use its mechanisms to navigate/explore not only three-dimensional physical space, but any space.

Applying the Hypothesis-testing Policy to abstract spaces (ex: linguistic space):

The TBP framework is designed to model any space where “features” can be extracted and where movement through that space yields new “observations". Therefore, TBP can be applied to non-physical spaces by appropriately defining these “features” and “movements” as observed within these spaces.As an example, let’s look at linguistic space.

Defining features within Linguistic space:
- In linguistic space, features could be words, phrases, grammatical structures or perhaps even semantic concepts
- A sentence then can be modelled as an object composed of these linguistic features, where the relative location (or position) of a word can be represented within a common frame. Think of how a limb might be represented within the broader reference frame of the body, or a hand in reference to that limb. The effect is similar.
Defining movement:
- Movement through a linguistic space could involve transitions between related concepts or words, analogous to moving a sensor over a physical object. For example, moving from a general term to a specific instance, such as “fruit” to “apple,” could be considered such a movement.

Default Mode Integration

The Default Policy, drawing a parallel to the default mode network, could operate during periods of low external sensory input, allowing an LM to consolidate models and prepare for future tasks. This policy would, in a sense, create a background for processing when external input is low, and could make way for other policies when needed. For example, after an episode has terminated due to an object being recognized, an LM could enter a resting state of manifold learning, allowing it to extract and reduce the dimensionality of its most recently-learned object(s).

Modulation of Learning Policies

The Regulatory Policy could influence the entire system by associating sensory experiences with goal-state values. This could modify goal-directed behavior, giving priority to sensory inputs or internal representations associated with specific goal-defined significance. In a sense, this policy would modulate all other policies by changing their underlying states.

**This policy may also possess the ability to migrate consistently effective motor-behavioral responses into “model-free” storage. This will likely be the most complex policy to engineer.

My intuition wants to compare ML learning rates and regularization to a nervous system’s Glutamate and GABAergic systems. Where regularization is synonymous with GABA (reg pulls weights towards 0. GABA pull’s neuronal behavior towards inhibition), and learning rates are synonymous with Glutamate-based systems (learning rates pull weights towards 1. Glu excites). And the delta between the two represents the system’s total level of “stress”. This compressive push-pull interaction could then be used as an observable metric which helps drive system behavior.

The Salience-Behavioral Chain:
The above policies interact with one other, and their shared environment, in a cyclical fashion. To help visualize this I would like to introduce something called the “Salience-Behavioral Chain” (SBC). The SBC is a looping process that is initialized by the inception of an allocentric stimulus. Typically, the chain follows a sequence of Stimulus → Salience Detection → Attention → Response Selection → Behavioral execution. Each rotation in the chain is called a step.

With respect to our original policy diagram, we can imagine a single step to look something like this:

The Regulatory Policy receives initial sensory input and notices something in the environment to which it wants. This desire then generates an end-goal of reaching said object. The system’s current position is a known value (a starting position), and the desired object is the desired end state. This goal information is propagated system wide.
[Following along in the SBC, this would be steps 0 - 1]
The Hierarchical Planning Policy receives this goal state and decomposes it into potential sub goals. These subgoals are then passed onto the Salience Mapping Policy.
[Steps 1 - 2]
The Salience Mapping Policy then further decomposes the goal states as to establish a rank ordering of states. In a biological context we can imagine that the system has so far provided not only salience (whats important), but valence as well (how important). Personally, I like to imagine that we’ve established a kind of layered state-space, in which a series of concentric rings encircle the desired end-goal, which exist then within the center of that space. Each concentic ring is a step within the SBC; that is, a potential behavioral-response which may or may not drive us closer to our ultimate goal.
[Again, steps 1 - 2]
But anyways, the new desired features are passed along to the Predictive Coding Policy, which will then engage the Hypothesis-testing Policy. Ordinarily, the Predictive Coding Policy will collaborate with the Default Policy to generalize its own internal models. However, it will inhibit this functioning in favor of hypothesis testing at the behest of top down stimulation.
[Steps 2 - 3]
The Hypothesis-testing Policy will then execute a behavioral-motor command based on the top down information provided to it. This command may be decomposed further down into lower-level learning modules, or it may signal directly to a motor output path, driving movement.
[Steps 3 - 4]
This motor command passes outbound through the regulatory policy, to the motor system. This outbound signal is compared against the initial goal state generated, and the delta between the two points helps inform the system of future goal states. If the delta between the desired state and current (updated) state diminishes, then we can be considered to be “closer” to our desired end-goal, and the next step in the SCB is encouraged/sustained. However, should the delta remain as it were, or even increase, then the behavior should become inhibited, and new approaches considered.
[Outputs to environment, leading to next step in the chain]

Alrighty, thats it for now. Like I said, this was a little rushed, but I plan on revisiting the post later today/this week to expand some of my thoughts. But anyways, I’d love to hear what the rest of you guys think.Be sure to let me know your thoughts, concerns and critiques. Until then, have a good day!

nleadholm · January 24, 2025, 3:48pm

Thank you @HumbleTraveller while I haven’t had a chance to look through your entire proposal, I thought it would be helpful to paraphrase the approach we are currently taking. In particular, we are seeking to find a universal algorithm that is reused by cortical columns across the brain, from primary sensory cortex, all the way up to and including prefrontal cortex. One familiar aspect of this is that we believe that cortical columns throughout the brain each are modeling entire “objects”, that is discrete, structured entities composed of other objects. These could be everything from the physical model of a coffee mug, to the conceptual model of how you plan a day.

As part of this, we believe that every learning module contains model-based policies, and is able to generate goal-states based on the objects (i.e. models) it knows about. As such, we don’t think there will be a single part of the brain responsible for model-based policies (like PFC or motor cortex), but rather that these will be found throughout the brain. This is why a Goal-State Generator (GSG) exists within each Learning Module, where the GSG may map onto layer 5 in a cortical column, although that is speculative. This is also an important aspect of how complex tasks like making coffee can be broken down by learning modules that know about different objects (day planning, kitchen layout, coffee machines, power buttons, etc.), where each learning module can be sequentially recruited when necessary.

I can recommend checking out our recently posted videos on compositional policies if you haven’t already seen them:
Part 1
Part 2
Part 3

Hope that makes sense and thanks for your ongoing interest in the Thousand Brains Project!

HumbleTraveller · January 24, 2025, 7:49pm

Hey there @nleadholm, thank you for this!

You guys had begun posting those compositional policy videos a day or two after I had posted this. I’m slowly making my way through them now and will revise what I have her shortly thereafter.

That said, I do believe I’m adhering to your guys’ design principals. None of these model-based action policies would be specific to any one type of LM. Rather, all LMs would be capable executing all of the above described policies.

For instance, in the middle section of the policy flow chart, we see the Hypothesis-testing Policy. Obviously this policy is something already usable by all LMs, regardless of their “location.” The same applies to the other policies.

I’m currently trying to work through how the LMs might coordinate their efforts, however. My current thinking is to look at oscillatory coupling for inspiration. With this in mind, the LMs could synchronize policy executions to specific spatiotemporal windows. For example, global ‘goal-state’ information might be broadcast/updated within the opening tick of a “Delta Window,” wherein further work can be conducted by other “lower-level” policies. For example:

Hierarchical planning/Salience mapping policies might be executed within the opening of a theta window.
Hypothesis-testing might be conducted with gamma windows, et cetera.
All LM timekeeping would then be sync’d to the processor’s HW clock cycles, allowing coordinated behavior to occur.

That said, I do have a question I was hoping you could help me with. I was thinking of modelling GABA-Glu dynamics into the Regulatory policy (what you guys might describe as the GSG). I know Monty doesn’t use backprop, but I was curious, does it use traditional learning rate techniques and/or regularization?

nleadholm · January 28, 2025, 6:30pm

Ah ok, thank you for clarifying that - I think it would be helpful in that case to update the first figure to make it clearer that the model-based policies you are describing exist with any given learning module, rather than interfacing with them. And I think it is interesting to think about how different policies can work together, so it’s cool you are thinking about that.

Re. regularization of learning, Monty learns via Hebbian-like associative weights. Currently Monty just learns 1-shot and accumulates information unbounded. However we will soon be making use of the GridObjectModel (found in our code here). This is a way of learning objects that gradually accumulates information and refines the internal models. Once we have tested how it performs, I imagine we might experiment with hyper-parameters that have similar effects to learning rates and regularization.

HumbleTraveller · January 28, 2025, 11:53pm

I would have thought that was implied, just given Monty’s broader principals. But I see your point. To be honest, I probably shouldn’t have made on post on this quite yet. It was all just fresh in my mind and I wanted to get it out somewhere. I’ll revise that first diagram like you suggested.

Re. We will soon be making use of the GridObjectModel
Nice! I like the move to the voxel based system. Looking at it now, everything appears to be in cartesian-grid format. Have you guys considered using a hexagonal grid system?

I know that reworking the code would be a bit of a pain (I assume all your sensors output in cartesian), but hex coordinates would net you quite a few nice benefits: more uniform distances between grid spaces, more efficient sampling, et cetera. Plus its more biologically aligned.

vclay · January 29, 2025, 7:37am

That’s a cool idea about using hexagonal grid systems! We currently have a lot of other tasks on our roadmap that we want to tackle first but I will definitely keep that in the back of my mind.
We actually regularly talk about space representations in the brain (see for instance this recent research meeting: https://www.youtube.com/watch?v=zRRazfFstvY) and whether we should make adjustments to the way space is represented in Monty. We started with cartesian since it is the easiest to visualize and debug and also operations on that space are very well suited for current hardware and software implementations. It’s served use really well so far to iron out a lot of the other aspects of the Monty framework but we may eventually transition to something more akin to how gridcells represent space if we notice that this gives us significant advantages.
If you have some concrete ideas, you could play around with creating a custom ObjectModel class, similar to the GridObjectModel. You can then just switch it out in the existing experiments and see how well it does.

nleadholm · January 29, 2025, 10:52am

Yeah that’s a cool idea @HumbleTraveller about using hexagons for the voxel system! Since we are tiling 3D space, we might have to use something like a truncated octahedron, unless there are other approaches that you had in mind? Stacking hexagonal prisms would force us to make a decision about treating one dimension differently from others, although this might not be that different from entorhinal grid cells which have a 2D bias. As Viviane says it would be really cool to see one of these in action if you ever take a stab at implementing it.

HumbleTraveller · January 29, 2025, 8:05pm

Drat. My biggest weakness: Actual concrete work!

In all seriouness, I was thinking of spinning up my old home lab this weekend and messing around with some Monty stuff. Perhaps I’ll tinker with the spatial coordinate system.

I’m not sure that you would need to use truncated octahedrons, though I could be missing something. Hexigonal latticing actually translates into into x,y,z coordinate space pretty cleanly, being it naturally operates using a three-axis system.

Instead of relying on x, y, z coordinates, you just use q, r, s (though they functionally represent the same thing).

I guess what I’m trying to say is that regular hexagons can handle all your typical vector operations pretty nicely, even in a three dimensional space. So a more complicated tiling method may be overkill.

Heck, if you constrain the hex-grid spaces such that ‘q + r + s = 0,’ you can even do 2d array storage (via an axial coordinate system). There was an amazing (albiet older) guide outlining everything from pathfinding and nearest neighbors, to line drawing and distance calculation. You can find it here: Hexagonal Grids

nleadholm · January 30, 2025, 2:25pm

Thanks for sharing that page, those are some cool demos. And that’s a really interesting possibility to think about. Please do let us know if you look into this more, it would be awesome to see how it might improve Monty’s coordinate system.

Topic		Replies	Views
2024/10 - Overview of Action Policies in Monty, Part II - Model-Based Policies Video Discussions core-video	3	81	December 6, 2024
2024/11 - Brainstorming on Compositional Policies - Part 4 Video Discussions brainstorming-video	1	42	January 25, 2025
2024/11 - Brainstorming on Compositional Policies - Part 1 Video Discussions brainstorming-video	1	81	January 15, 2025
2024/10 - Overview of Action Policies in Monty, Part I - Model-Free Policies Video Discussions core-video	1	50	January 5, 2025
The Thousand Brains Project: A New Paradigm for Sensorimotor Intelligence General papers	8	249	January 24, 2025

Modelling goal-defined behavior through model-based policy recursion

Related topics