2025/07 - Hierarchy or Heterarchy?

If the Clay theory is correct about sight, then the thalamus is very useful for many other modalitiea:

. For sound, it transforms sound inputs to, eg. let us recognise a song in any key, follow a conversation in a noisy room,..

. For taste, it transforms inputs from the tongue into fundamental taste components

. For touch, it allows us to have basic categories of touch and temperature sensations plus variations

. For interoception, it transforms the very rich and detailed inputs from internal organs into simpler information required for homeostasis

All of this tremendous usefulness gives me hope that the Clay theory is either correct or is very close to it.

5 Likes

Hi @Trung_Doan yes this is a great observation. We have often discussed the question of how we recognize songs in any key, or when played by any instrument, and how this would fit in with these theories. You might find this recording particularly interesting, as it includes some discussion of music.

While we’ve often discussed it, we have never quite nailed down and agreed on how it should work. Hopefully this is something that we finally figure out as we continue working on object ā€œbehaviorsā€ (i.e. time-varying models).

2 Likes

Yes, those are great observations. How to think about the thalamic transformations a bit more broadly/generally comes up every once in a while in our research meetings. Yes, they might represent orientation transforms, but they could learn other useful transforms that map observations from one space into another.

Btw, I wouldn’t refer to it as the Clay theory. The heterarchy paper is an extension of the Thousand Brains Theory, which was largely developed by Jeff Hawkins :slight_smile:

2 Likes

When the thalamus transforms a sight object, how could it also retain that object’s spatial relationship with other objects in the scene which are not currently transformed?

It seems to me that it can’t, that’s why I worry that this theory is close but not yet the last word.

In my ATB book review on Amazon, I wrote that Hawkins was up there with Einstein. But it’s Ok to have many other theories to broaden ATB. Besides, a short name makes it easier to talk and think.

1 Like

Hi @Trung_Doan I hope I am understanding your question right. Maybe it helps if I clarify the transformation process as we hypothesize it:

  • The relative spatial relationship of features (or objects) is maintained within a cortical column. This column has a model of features relative to each other and while it can only sense one of them at any time, it can integrate their spatial relationships using the sensor movement between sensations.
  • The thalamus transforms the incoming sensor movement and sensed features from ego centric coordinates to object centric coordinates. For example it would translate sensed locations and orientations relative to the body into locations and orientations relative to the model of a coffee mug (which is learned in the column it connects to)
  • The L6b connection from the cortical column to the thalamus informs the thalamus which transformation needs to be applied to the input. This is the pose hypothesis that the column has (how the object is rotated in the world → how incoming sensations need to be rotated to match the learned model of the object)
  • This transform will be the same, no matter where on the object the sensor currently senses (there is one orientation that needs to be applied to all features of the object)
  • Spatial relationships of complete objects (like a scene) are represented in a column different from the individual object model. For temporary spatial arrangements, this is likely just temporarily modeled in the hippocampus. Inputs to columns representing a scene are transformed by the thalamus using the same mechanism.

Does this clarify things?

1 Like

Insightful work as always, and fascinating!

Without contesting anything you present; I sometime feel the very granular, bottom-up approach used in your work and discussions may miss key aspects from normal brains that I think play an inherent role… when building compositional objects for example.

Because our brain resides in a body full of sensors that grew and expanded with it, it necessarily had to start with* building an intimate ā€œknowledgeā€ of joints, levers, hinges, rotation, deformations, spring, etc. at the same time the sensors began to interface with the world.

All of these local behaviors, in their abstract (top level) form, are therefore available when assembling external compositional objects.

…

Although compositional objects can sometimes only be learned location by location, the rich tapestry of top-level behaviors, including deformation (as in for example, separate skin movement from muscle contraction), would allow rapid formulation of an expected end result… even if a little approximate (faster than location by location).

…

Also, I read and heard much about building-up compositional objects; but did I miss your thoughts on the reverse process (divisional composition)? Countless objects around us are first apprehended as singular, but then we discover ā€œitā€ does more. It opens (hinge, slide, screw…), it lights-up, warms-up, emit sound, etc. Your stapler that started as 1, and became 2, 3, 4… distinct elements, each with specific behavior at object-specific location. In such cases, not everything will have to be re-learned location by location. Surely, the representation for the full-object still exist. Understanding the parts will requires some initial ā€œeffortā€, but the process should generally be faster and more efficient than starting from the separate elements. At the same time, these child objects are more complex because they originate from a behavior.

*The TBP discussions started with exploring external objects; perhaps dedicating a learning phase to exploring how the tools and sensors behave would be fruitful at some point.

2 Likes

Hi @Jacques

great questions and thoughts!

One important thing to clarify is that (as the paper title suggests) we don’t think of the brain modeling objects in the world in a strict hierarchy. The same object is learned at several levels of the hierarchy (in different resolutions as they receive different input). For instance, we can learn a non-compositional model of the stapler at the lowest level of the hierarchy and also a compositional model at the second level. We can also learn a non-compositional model at the second level using lower-frequency direct sensory input.

While the world can have a near infinite depth of composition (think of how the universe is made of solar systems, which have planets, which have continents, which have countries,… and you can keep zooming in), the brain does not need the same amount of hierarchical levels to model this. There is no prescribed level at which a certain object or concept needs to be modeled. In past research meetings, we actually tried to think through whether there is a theoretical need for more than two levels, since you would be able to model any level of composition simply by shifting your attention between different parent and child objects.

So, to get back to the last part of your question, you can certainly learn a rough model of a tree first and then later on associate a model of leaves with certain locations on the tree. Also, during inference, you can first recognize the basic shape and features of the tree based on direct sensory input (which does not just go to the lowest hierarchical region) without the need to recognize individual leaves first.

If you haven’t seen it yet, I recommend having a look at our recent preprint: [2507.05888] Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain

Regarding the first part of your question, can you elaborate some more on what you mean with ā€œlocal behaviors, in their abstract (top level) formā€? What makes models of body parts and how they move abstract, and what requires them to be modeled at the top level?

-Viviane

1 Like

Hi Viviane

Your response triggered an immediate question in my mind - I wonder to what extent recognition of an object is supplemented by the understanding of what it does or what it is used for.

Perhaps we easily recognise a cup shaped object because of the fundamental significance of cup shaped objects being able to hold water and food. Although a stapler may not have quite such primal significance, we need to know a lot about paper and stapling in order to recognise an object as a stapler.

If we land on an alien planet and none of the objects are familiar all of our attempts at classification would fail and our attempts to recognise objects would be in disarray. We would likely require a period of learning what things do before we could begin to recognise them.

Another aspect of this is the difference between recognising a cup versus recognising my cup. Are there different processes at work when categorising an object that may or may not have been seen before and a familiar object or personal possession, or is it just a matter of adding finer detail to the recognition process.

Alex

I understand the same object is learned across different sensory modalities; and also that within a modality, inference can happen out-of-order or in parallel across multiple hierarchical regions - based on the sensory inputs reaching each levels.

The next sentence you offer is challenging: ā€œwe tried to think through whether there is a theoretical need for more than two levels to capture the world’s near infinite depth of composition.ā€ I probably understand what you mean but did not realize that’s where you suggest it stops.

Speculations: I would have thought the limit of just 2 levels is met when attempting to model dynamic systems in one’s brain. Concrete examples would be high-level sailing, skying, etc (where multiple interacting ā€œbehaviorsā€ must be optimised). I do not know if even more abstract dynamic thinking works along the same path or if a more mechanical and mathematical approach is always required.

Because of the need to use biological resources efficiently, there remain limits to the depth of thinking structure that can be mobilised at once, even within the most intelligent individuals.

…

Because the models built in the lower level are assembled from sensory inputs that are narrowly focused, they are unlikely to succeed in capturing behavior - as was pointed-out in one of your discussions, (possibly ā€œBrainstorming around behavior and deformationā€). Therefore at least one higher level is involved.

What remains less clear to me is how one object (let say: stapler) becomes 2 objects: Original bottom half + original top half + 1 behavior. The point-by-point requirement initially makes sense; especially upon discovering the original object looses cohesion (while a portion remains unchanged, another portion is no longer ā€œthereā€ but in movement – within object centric space). Point-by-point comparison establish the top half moves as a block, and that block moves according to a specific type of transformation. An SDR for just the top block and just the bottom one is assumed (?), and an SDR for that type of movement (pivoting hinge) is then be applied at the correct object specific location.

In what follows, I well be over-reaching with the theory, in part because it is how I think, and to be able to answer your question:

As in the above example, I think the brain also find patterns in objects behaviors. Even if the hinged behavior of a door, a stapler, a book is different in the specific, at the top level, there is a pure concept.

For the human animal, much of the first 2 years is dedicated to learning to mobilize the body appendages.

ā€œLocal behaviorsā€ are what you can do with your arms, legs, mouth… In the abstract, there are hinges, levers, pivot points… in action throughout the body.

If compositional behavior is possible, the top-level ā€œpivotā€ or ā€œhingeā€ concept is readily available to extract from some external stimulus.

1 Like

Hi @Alex and @Jacques

I love the discussion and in-depth thoughts here! Sorry I didn’t get around to replying earlier. I’ll try to address all your questions but if I skip some parts it is because we also haven’t fully figured out all of those points you touch on and actively brainstorm on those topics.

Perhaps we easily recognise a cup shaped object because of the fundamental significance of cup shaped objects being able to hold water and food.

I agree. We don’t just categorize objects by their shape but also by how we interact with them and how they behave. For example, think about all the crazy shapes chairs can take, but we can tell with all of them that they are places we can sit down on. I don’t think we would have one unified model for all possible chairs or coffee cups, but instead, if there is a lot of variety, we would have multiple models that cover that variety, and at a higher level in the hierarchy, they can be interpreted in similar ways. We may also use short mental simulation for some of those categorizations (like think what would happen if you pour water into a novel-shaped object or how you would sit down on a chair with a shape you have never encountered before.

If we land on an alien planet and none of the objects are familiar all of our attempts at classification would fail and our attempts to recognise objects would be in disarray. We would likely require a period of learning what things do before we could begin to recognise them.

I don’t think we wouldn’t be able to learn about the alien planet right away without knowing how things work. We could already start recognizing shapes of new objects. But I agree that to be able to act intelligently in this new world one would also have to test hypotheses and interact with objects (like children do in our world) to figure out how things work and learn the object behaviors and similarities beyond shape.

Another aspect of this is the difference between recognising a cup versus recognising my cup. Are there different processes at work when categorising an object that may or may not have been seen before and a familiar object or personal possession, or is it just a matter of adding finer detail to the recognition process.

The idea is that you don’t just have one model of a mug in your brain but many models. Some columns might model a generic shape of a mug while others model your favorite mug with the specific print and dent in the corner. They can be associatively connected (when you see your favorite mug you also recognize the generic mug model since it is consistent with that as well).

On number of hierarchical levels:

I should clarify that we don’t claim that the human brain only has 2 levels of hierarchy, neuroanatomy clearly indicates that there are more than two. However, many mammals have a much shallower hierarchy (like 2-3 levels) and even the hierarchy in the human neocortex has much fewer hierarchical levels than for instance most DNNs (a bit more discussion on this topic here What sets Monty apart? - #5 by vclay ). The two levels are more of a minimal requirement we talk about, and that might be supported by attentional constraints (it seems like we don’t ever pay attention to a deeper nesting than a parent-child relationship). But, as I think you are getting at as well, it may require more levels to be able to learn models of more complex/abstract things that are not necessarily rooted in direct sensory input but are instead compositions of other objects recognized at lower levels and hence contain more condensed and processed information.

In terms of complex behaviors like skiing or even walking, a lot of this would likely be shifted to subcortical regions like the cerebellum once we get good at it. Doing a complex, potentially nested, sequence of actions initially requires a lot of attention and conscious thought (recruiting models in the neocortex), but as they get more practiced, you don’t have to think about them anymore, and they can work more as a model-free policy.

How one object becomes two:

This is one of those topics we are still actively discussing and don’t have a conclusion on yet. One idea is that initially, an attentional mask could be applied (like masking the bottom of the stapler as the top moves) to continue to make correct predictions as part of the object starts moving for the first time. This mask could be based on model-free information, such as areas where movement is detected. Over time, the masked area might then be learned as a separate model.

On learning general behavior models & models of how the agent moves:

I hope I understand everything you are saying correctly but I think I agree with all those points. We are definitely thinking of the system as being capable of learning general behavior models (like hinge or ball-and-socket-joint) and applying them to different object morphologies (like laptop, door, stapler, arm). I also agree that we would learn behavior and morphology models of our own body and how it can move.

I hope this all makes sense. Let me know if I missed any critical points or questions!

  • Viviane
1 Like

Hi Viviane

Thank you for your answers, looks like we are thinking along the same lines. I am keen to test these theories in the real world. To this end I have been designing a flexible sensorimotor architecture on which to experiment. I read your thesis ā€˜Learning as an Active Sensorimotor Process’. An almost overwhelming number of possibilities covered there.

At first I was doubtful that I could connect my sensorimotor architecture to Monty, but having read more about Monty I think it is worth a try if it means I am able to build on all of the great progress the TBP team have achieved. I am more of a hardware engineer than software engineer by profession so it will be challenging, I will need help along the way !

Coincidentally my sensorimotor architecture supports all of the sensors mentioned in the TBP Overview video and more. I am just about to build the first artificial creature using the architecture. Assuming I hard code some basic motor control for movement I think the first learning task for the creature should be to try to learn its surroundings (initially based on the lidar data). As mentioned in the video, Monty is currently focussed on object recognition but I am hoping that the lidar image of the surroundings can somehow be treated as an object? Lots to learn on my part.

I am happy to make the sensorimotor architecture available to others if it proves useful. My emphasis has been on making it low cost so that it is accessible to anyone wanting to experiment. The mobile part is based on raspberry pi and custom circuit boards but the sensory data will be transferred via wifi to a PC running Monty.

Any thoughts (or doubts) on the viability of this project are welcome.

Alex

2 Likes

Hi @Alex

that sounds cool! I’m definitely looking forward to hearing more about what you are building. Feel free to open another thread in the Projects category to elaborate more and get specific feedback & input.

Also, I am not sure if you have seen this yet but we have a tutorial on Monty for robotics here: Using Monty for Robotics plus some example projects on our project showcase page from our recent robot hackathon. Maybe this is useful.

Best wishes,

Viviane

3 Likes