Inferring and learning non-rigid objects

carver · November 21, 2025, 3:04pm

The Monty approach to modeling object morphology is very cool. I love the connection to environment localization, that an object is like a tiny piece of environment that you are localizing yourself against. I suppose another way to think of it is that an environment is a giant object that you are inside of. Anyway…

Everything I’ve understood so far about object inference has seemed to lean pretty heavily on the rigidity of object morphology. Even the behavioral examples seem to make limited deviations, like a stapler.

Of course there are many things that are not rigid and could have a huge number of distinct shapes. I’m thinking of objects like:

a long, flexible rope
a napkin hanging off the side of a table
a liquid that takes the shape of its container

Perhaps even tougher are behavioral things, like a waterfall, a campfire, or a wriggling worm. But I’ll leave aside behavior/movement for now. Let’s say we want to identify an object that is sitting still while you observe it.

How might the brain handle it? How about Monty? I’m having trouble seeing how point-clouds will be of much use here.

nleadholm · November 28, 2025, 6:14pm

@carver great question, yes this is an important problem that we still need to resolve. In some of our videos on object behaviors, you will see us discuss examples like this, for example folding a T-shirt, filling a vessel with fluid, etc - we’ve even started calling it the “can of worms” problem as a sort of meta reference (which is apt given your own example ).

The short answer is we don’t know for sure how to resolve this, but we think a few things will be important.

One is that there is likely a way to compensate existing models with some form of displacement or movement. So while the baseline, point-cloud-like model may have some degree of rigidity, different signals might enable us to deviate from this, or rather, appropriately integrate deviations from this rigidity.

Another point is that it is very unlikely that the brain has detailed models of these kinds of objects in all their possible forms - rather, intelligent behavior likely relies on quickly constructing a temporary, “good enough” model of the current morphology, after which we can use this to inform actions. For example, the brain likely has certain models of a T-shirt in canonical states (like when it is lying completely flat, and when it is folded neatly). When you see a crumpled t-shirt, you do not have an existing, detailed model that enables you to make predictions about exactly what part of a T-shirt you will see when you move your sensors across it. Rather, you would rapidly learn a temporary, coarse model of that T-shirt in its current pose - enough of a model that you can identify the location of key sub-objects like the collar or the sleeve. This is a sufficient model to then pick it up and start interacting with it. After picking it up, the just-learned model is already irrelevant, but you would then rapidly learn a new temporary model to support the next necessary action, etc.

This still leaves lots of open questions, but hopefully that gives a sense of the direction our thinking is going.

AgentRev · November 30, 2025, 10:13pm

[stares at can opener]

So, the visual cortex is insanely complex; you have two high-resolution retinas transmitting massive datastreams into several brain regions (V1 to V5) that detect orientations, spatial frequencies, texture, depth, color, shape, location, motion direction, speed, etc. and produce sets of SDR “fingerprints” / sparse codes as to what they’re currently perceiving, possibly involving columnar voting at each step of the way.

All this info is forwarded into the ventral and dorsal streams, which bind the fingerprints together to form a multi-sensorial, compositional representation of the environment, allowing the neocortex to infer and learn higher-order identities. In theory, the cortex votes on these sets of fingerprints to form hypotheses. It’s a huge saliency-driven segmentation and prediction pipeline.

For the sake of simplicity and prototyping, Monty approximates the visual cortex as a color-biased 3D mesh recognition system. In the recent video meetings, we can see that the team has been throwing a lot of ideas around on how to model object behaviors (e.g. stapler movement). Their work on compositionality also involves identifying simple deformations like a distorted logo.

Identifying deformed objects and modeling possible object distortions / behaviors are separate (but related) problems. The vision learning module should be able to identify a closed/open stapler, a normal/distorted logo, and a coiled/extended rope, but I believe modeling should be handled in one or more separate module(s), because it’s a higher-order reasoning task more than vision one. There’s also the whole question of whether or not the modeling itself should be emergent or handcrafted.

Let’s focus on identification. How could the current version of Monty identify a rope regardless of its shape? We have access to 2 fingerprints: 3D shape, and color.

Suppose you train it on a straight piece of rope. Monty remembers the shape and color. Now, you rearrange the rope in a “Z” shape, i.e. 3 straights with 2 bends. During inference, since each column only sees part of the object and not the whole thing, the columns aimed at the bends will be confused and have low confidence, but the columns that sees the straight parts will choose “rope” as their most likely hypothesis with high confidence.

Thus, the beauty of the voting process (and the columnar resolution) is that it would enable emergent generalization, even if still rudimentary in this form. That example would translate well to the closed/open stapler too. Columns aimed at the hinge would be confused, but the columns aimed at the two halves would have high confidence. It has limitations, it won’t work if you tie the rope into a pretzel shape for example, but it would be a good test to try with minimal effort.

But how to push it further?

To truly reach human-level identification capabilities, there’s certainly a need for high-definition, dimensionally-reduced visual feature detection. The brain is very nimble at dimensional reduction, and this capability must be replicated for maximum generalization.

One question that looms in my mind is how could an SDR encode dimensionally-reduced info without brittleness? And what form would the reduction pipeline take? Maybe some sort of content-aware scaling algorithm that “unfurls” shapes along detected edges and only preserves salient color transitions? Hmm, probably not biologically plausible.

A crude way to maybe perform this would be with the Shazam method I described here on a Sobel-filtered image or something along those lines. However, hashes are also brittle by design, and not biologically plausible.

The docs suggest neural networks as a potential pathway for feature extraction, but that’s backprop-based and too rigid… There has to be something more flexible. Among other things, I’ve been going over the work of Bruno Olshausen to gather clues, I’ll keep digging. He seems to know a thing or two about that kinda stuff.

I also stumbled upon Sparsey, an experimental vision AI based on column theory. It doesn’t seem to have been pushed beyond the proof-of-concept stage, but there’s definitely some worthwhile ideas in there that could be applied more broadly. Maybe its author Gerard Rinkus might even be interested in contributing to Monty. He mentioned having discussed with Hawkins a few times in his papers.

Additional thoughts in regards to visual range

The higher-order identity of a “rope” is bound to many sensorial fingerprints, but most especially texture I would believe, and shape to a lesser extent. At close visual range, we recognize ropes due to their braided fiber patterns and how light interacts with them. Imagine a 2-inch-long piece of 1-inch-diameter hemp rope; its shape and color are very similar to a wine cork, but you can still recognize that it is a piece of rope via its texture.

But now, imagine that a skilled artist took an actual wine cork, and meticulously painted it to look like a piece of rope. The neocortex might initially predict that it’s a piece of rope, but then you touch it and realize it’s actually a rigid object that doesn’t feel like a rope, and you’re confused. You look closer, noticing subtle cues like cork pores and paintbrush strokes, and that it weights and feels like a cork between your fingers.

The brain then resettles the object’s associative memory trace to a set of fingerprints like “wine cork, paint layer, rope texture, located at XYZ art gallery, year 20XX”. You’ve never seen anything like this before, resulting in high saliency, causing those fingerprints to be recorded as a unique oddity, distinct from all other corks or ropes you might remember.

At long visual range, imagine you have a car tire hanging from a tree branch. You can see that the tire is connected to the branch via a thin line, but you might not be able to see the texture because you’re too far away, or even the color, if the sun is setting behind the tree for example. In this scenario, the brain is making a prediction that it is likely a rope. But maybe it’s not a rope, it could theoretically be a fixed metal pole. The prefrontal cortex will most likely not even consider “metal pole” as a viable prediction, because it was probably never exposed to such a possibility.

If the tree is just a silhouette on the edge of your peripheral vision while you’re talking to someone in front of you, the visual cortex will quickly assess “tree with tire swing” based on the overall shape, and not really pay anymore attention to it because it’s not salient and it’s just a background fixture.

So, this whole example highlights how object identification isn’t strictly a visual process; it also involves various context-dependent degrees of reasoning. New kinds of modules would be needed to properly address the can of worms.

Regarding liquids; a one-year-old kid can recognize a surprising amount of objects and animals, but you better have a mop nearby if any unsealed container is within arm’s reach.

P.S.: If anything I said seems confusing or deviates from TBT, please let me know!

nleadholm · December 1, 2025, 11:35am

@AgentRev yeah it’s helpful to distinguish between recognition and modelling. In the example I gave of the t-shirt, previously learned models are going to be sufficient to recognize the t-shirt, but the temporary model built on the fly (together with previously learned models) is important in order to “model” (i.e. predict results of sensory movements + actions) the particular t-shirt that is in front of you.

I agree that things like texture and local morphology would be more important for classification the more that an object deviates from its standard shape/form. One other important element for classification which we sometimes discuss but are a ways from implementing is affordance/context. E.g. fruit do not share common morphology, nor do many chairs or vessels. Something that is rope-like/a cord would also have elements of affordance tied up (sorry ) in its classification. We think it should be possible to introduce this kind of context through associative or top-down connections and mental simulation, but it’s not something we have a concrete model for yet (we need to figure out affordances / actions first!).

Re. dimensionality reduction, I’d be curious to hear more about some of the dimensions you had in mind that might emerge. For example, this is the kind of things that DNNs are good at, but they are notoriously non-human in their recognition patterns (not taking sufficient consideration for shape etc). What kind of dimensions are you thinking for a human/primate? Things like heaviness/brittleness/sliminess?

AdamLD · December 1, 2025, 3:53pm

Just brainstorming, I think maybe tackling some simpler tasks might help with figuring out the stapler behavior.

I think we could work on some preliminary steps before tackling the stapler behavior, steps that might make it easy to solve the can of worms.

Imagine back to when you saw a stapler for the first time. You didn’t know what it was. But you did know something at glance.

The first thing you noticed was that it was an object. But the second thing you noticed was that it was a “COMPOSITE OBJECT”.

What if we get the columns to also vote if the object is a solid object made of a single material, or if it’s made of other objects?

The second step would be for the columns to create a 3D internal model of the different parts just when viewing the stapler. The columns could predict the volume of the composite parts just by seeing the stapler from the outside.

But I think this goes back to texture. The different components of a stapler are made of plastic and metal. And the columns would need to identify the different textures in the same object.

Now it could learn how the separate parts can move depending on which part you put pressure on, and which part of the stapler you hold fix. Different forces on the stapler would cause different behaviors from stapling a paper, to opening the stapler to refill it.

AgentRev · December 2, 2025, 9:12am

When I wrote “dimensionality reduction” above, I was more referring to sparse coding, as described in this paper: Neural correlates of sparse coding and dimensionality reduction

(My wires probably got crossed because of that title!)

So I’m thinking more about raw visual features, rather than abstract things like “heaviness/brittleness/sliminess”. Stuff I listed at the beginning of my previous post like spatial frequency, texture, etc. but learned and inferred thru columnar voting rather than DNNs. Think of it as compositional features / categories to help further disambiguate objects or entities. The idea is not fully formed yet in my mind as to how this would be computed efficiently, especially in an orientation-agnostic way, however one thing is certain:

It’s kinda normal that DNNs don’t take “shape” into account because the datasets are curated to be orientation-agnostic, which dilutes shape representation altogether. And I’m thinking that animals probably deal with shapes at a higher cognitive level beyond what DNNs can achieve. So if you successfully mix texture recognition with point clouds, you’d theoretically achieve better results than both DNNs and 3D-only Monty. I’m not certain that even this would go far enough to qualify as human-level, but it would be a great stepping stone.

About affordances, is it realistic to think about them at this stage? To me, that would be more of a high-level reasoning task to come later down the road, since it requires more advanced understanding of environmental relationships and world physics. Gotta build an associative predictive coding engine for the whole Monty machinery. Once you get that engine running smooth, then you can power the affordances drivetrain.

Unless you meant that you intend to start tackling this in the short term, or you’re taking another approach?

@AdamLD I think composite objects are also a higher-level task that Monty might not have the pre-requisites for right now. In my opinion, the way we think in terms of “parts” is more of a cultural skill than an evolutionary cortical function. Does it make sense to try “hardcoding” it rather than teaching it?

AdamLD · December 2, 2025, 2:39pm

I think “parts” are a subjective thing. And different cultures and different people can create different “parts” for the same thing.

I believe this is a central mechanism to human learning that is not talked about. I call it “chunking”. The best example to understand this is when learning to drive a car. At first it feels like you have to learn many different things, like to watch the road, to put on the seat belt, to watch the mirrors, to change gears, to watch the road signs, etc. But after a while you “chunk” all these things together and the whole thing becomes just “driving”.

In a way, I think this probably creates a new circuit (or model) in the brain that manages all the other little parts of driving. So all the little parts got chunked together into one single thing.

This would be a model of a hierarchy I think. Maybe some columns can specialize in the different small parts of driving, then higher level cortical columns can learn to manage them.

I think this goes back to the stapler because I get the feeling that everyone is subconsciously trying to solve the “can of worms” problem at the first level of the cortical columns.

But maybe some problems can only be solved using hierarchy, hence the lack of progress.

Topic		Replies	Views
2025/06 - Brainstorming Around Behavior and Deformations Video Discussions brainstorming-video	9	184	August 1, 2025
2025/07 - Hierarchy or Heterarchy? Video Discussions	11	208	August 18, 2025
New Tutorials on Using Monty in Custom Applications Monty Code	16	441	July 21, 2025
2024/12 - Brainstorming on Compositional Policies - Part 5 Video Discussions brainstorming-video	14	248	February 8, 2025
Some Questions from the Documentation General	21	370	January 10, 2025

Inferring and learning non-rigid objects

But how to push it further?

Additional thoughts in regards to visual range

Related topics