Hey there, this is going to be a little messy, but hopefully I can come back later to clean it up/fill in extra details.
Outlined below is a series of potential model-based policies as well as how they might interact to drive forward motor-behaivoral outputs. This topic will essentially be split into three sections: Diagram one will show policy interactions. The section immediatly following it will give a rough break down of the policy ideas. The second diagram will introduce something called the âSalience-behavioral Chain,â which will then be used to highlight how the aforementioned policies interact and drive system behaivor. So without further adoâŚ
Modelling goal-defined behavior through model-based policy recursion:
Some light policy details:
Hierarchical Organization and Goal Decomposition
The Hierarchical Planning Policy, analogous to the prefrontal cortex, could initiate complex tasks and decompose them into sub-goals. These sub-goals could then be managed by lower-level LMs (or functional LM groups), which then deploy other more specific policies. For instance, a high-level goal of âmaking a cup of coffeeâ might be broken down into sub-goals like âfind the coffee machine,â âadd water,â and âadd coffee,â each of which could be managed by different LMs employing different policies.
This policy sits at the top, initiating complex tasks and breaking them down into simpler goals. It also receives feedback from all the lower levels in the hierarchy.
Salience Modulation
The Salience Mapping Policy, similar to the salience network in the brain, could dynamically adjust the importance of different features based on the working context. This policy could modulate the influence of sensory inputs and dynamically allocate processing resources. This would, for example, modulate the information processed by the Hypothesis-Testing Policy, focusing on the features deemed most relevant for object recognition of a specific goal-defined task.
Predictive Coding and Error Minimization
The Predictive Coding Policy would continuously compare predicted sensory inputs with actual inputs, guiding model updates and exploration. This policy could work in conjunction with the Hypothesis-Testing Policy, driving actions to minimize those errors. For instance, mismatches between predicted and actual sensory input could lead the Hypothesis-Testing Policy to direct the sensor towards areas of an object that would reduce uncertainty.
In short, this policy continuously predicts future inputs and minimizes errors, guiding the systemâs learning and exploration. It interacts with the default policy, using it to further refine its models, learning generalized representations of them. These lower-dimensional representations could then be used to âcategorically alignâ models which once seemed disparate.
Central Executor Integration
The Hypothesis-testing Policy uses the systemâs internal models to actively disambiguate the identity and pose of an object. It is designed to move a sensor to a location that will minimize uncertainty about a currently observed object. Its outputs are bifurcated into two directions: (1) it outputs a motor-behavior command signal which will then be received by downstream modules; (2) an efferent signal is relayed back up to the Predictive Coding Policy, as to further refine its predictive functioning.
**The Hypothesis-testing Policy is arguably going to be one of the most important policies in the TBP framework, apart from perhaps the Regulatory Policy (which ultimately is what drives system behavior). As such, I want to spend some time exploring how the policy may use its mechanisms to navigate/explore not only three-dimensional physical space, but any space.
Applying the Hypothesis-testing Policy to abstract spaces (ex: linguistic space):
The TBP framework is designed to model any space where âfeaturesâ can be extracted and where movement through that space yields new âobservations". Therefore, TBP can be applied to non-physical spaces by appropriately defining these âfeaturesâ and âmovementsâ as observed within these spaces.As an example, letâs look at linguistic space.
-
Defining features within Linguistic space:
-
In linguistic space, features could be words, phrases, grammatical structures or perhaps even semantic concepts
-
A sentence then can be modelled as an object composed of these linguistic features, where the relative location (or position) of a word can be represented within a common frame. Think of how a limb might be represented within the broader reference frame of the body, or a hand in reference to that limb. The effect is similar.
-
-
Defining movement:
- Movement through a linguistic space could involve transitions between related concepts or words, analogous to moving a sensor over a physical object. For example, moving from a general term to a specific instance, such as âfruitâ to âapple,â could be considered such a movement.
Default Mode Integration
The Default Policy, drawing a parallel to the default mode network, could operate during periods of low external sensory input, allowing an LM to consolidate models and prepare for future tasks. This policy would, in a sense, create a background for processing when external input is low, and could make way for other policies when needed. For example, after an episode has terminated due to an object being recognized, an LM could enter a resting state of manifold learning, allowing it to extract and reduce the dimensionality of its most recently-learned object(s).
Modulation of Learning Policies
The Regulatory Policy could influence the entire system by associating sensory experiences with goal-state values. This could modify goal-directed behavior, giving priority to sensory inputs or internal representations associated with specific goal-defined significance. In a sense, this policy would modulate all other policies by changing their underlying states.
**This policy may also possess the ability to migrate consistently effective motor-behavioral responses into âmodel-freeâ storage. This will likely be the most complex policy to engineer.
My intuition wants to compare ML learning rates and regularization to a nervous systemâs Glutamate and GABAergic systems. Where regularization is synonymous with GABA (reg pulls weights towards 0. GABA pullâs neuronal behavior towards inhibition), and learning rates are synonymous with Glutamate-based systems (learning rates pull weights towards 1. Glu excites). And the delta between the two represents the systemâs total level of âstressâ. This compressive push-pull interaction could then be used as an observable metric which helps drive system behavior.
The Salience-Behavioral Chain:
The above policies interact with one other, and their shared environment, in a cyclical fashion. To help visualize this I would like to introduce something called the âSalience-Behavioral Chainâ (SBC). The SBC is a looping process that is initialized by the inception of an allocentric stimulus. Typically, the chain follows a sequence of Stimulus â Salience Detection â Attention â Response Selection â Behavioral execution. Each rotation in the chain is called a step.
With respect to our original policy diagram, we can imagine a single step to look something like this:
-
The Regulatory Policy receives initial sensory input and notices something in the environment to which it wants. This desire then generates an end-goal of reaching said object. The systemâs current position is a known value (a starting position), and the desired object is the desired end state. This goal information is propagated system wide.
[Following along in the SBC, this would be steps 0 - 1] -
The Hierarchical Planning Policy receives this goal state and decomposes it into potential sub goals. These subgoals are then passed onto the Salience Mapping Policy.
[Steps 1 - 2] -
The Salience Mapping Policy then further decomposes the goal states as to establish a rank ordering of states. In a biological context we can imagine that the system has so far provided not only salience (whats important), but valence as well (how important). Personally, I like to imagine that weâve established a kind of layered state-space, in which a series of concentric rings encircle the desired end-goal, which exist then within the center of that space. Each concentic ring is a step within the SBC; that is, a potential behavioral-response which may or may not drive us closer to our ultimate goal.
[Again, steps 1 - 2] -
But anyways, the new desired features are passed along to the Predictive Coding Policy, which will then engage the Hypothesis-testing Policy. Ordinarily, the Predictive Coding Policy will collaborate with the Default Policy to generalize its own internal models. However, it will inhibit this functioning in favor of hypothesis testing at the behest of top down stimulation.
[Steps 2 - 3] -
The Hypothesis-testing Policy will then execute a behavioral-motor command based on the top down information provided to it. This command may be decomposed further down into lower-level learning modules, or it may signal directly to a motor output path, driving movement.
[Steps 3 - 4] -
This motor command passes outbound through the regulatory policy, to the motor system. This outbound signal is compared against the initial goal state generated, and the delta between the two points helps inform the system of future goal states. If the delta between the desired state and current (updated) state diminishes, then we can be considered to be âcloserâ to our desired end-goal, and the next step in the SCB is encouraged/sustained. However, should the delta remain as it were, or even increase, then the behavior should become inhibited, and new approaches considered.
[Outputs to environment, leading to next step in the chain]
Alrighty, thats it for now. Like I said, this was a little rushed, but I plan on revisiting the post later today/this week to expand some of my thoughts. But anyways, Iâd love to hear what the rest of you guys think.Be sure to let me know your thoughts, concerns and critiques. Until then, have a good day!