[stares at can opener]
So, the visual cortex is insanely complex; you have two high-resolution retinas transmitting massive datastreams into several brain regions (V1 to V5) that detect orientations, spatial frequencies, texture, depth, color, shape, location, motion direction, speed, etc. and produce sets of SDR âfingerprintsâ / sparse codes as to what theyâre currently perceiving, possibly involving columnar voting at each step of the way.
All this info is forwarded into the ventral and dorsal streams, which bind the fingerprints together to form a multi-sensorial, compositional representation of the environment, allowing the neocortex to infer and learn higher-order identities. In theory, the cortex votes on these sets of fingerprints to form hypotheses. Itâs a huge saliency-driven segmentation and prediction pipeline.
For the sake of simplicity and prototyping, Monty approximates the visual cortex as a color-biased 3D mesh recognition system. In the recent video meetings, we can see that the team has been throwing a lot of ideas around on how to model object behaviors (e.g. stapler movement). Their work on compositionality also involves identifying simple deformations like a distorted logo.
Identifying deformed objects and modeling possible object distortions / behaviors are separate (but related) problems. The vision learning module should be able to identify a closed/open stapler, a normal/distorted logo, and a coiled/extended rope, but I believe modeling should be handled in one or more separate module(s), because itâs a higher-order reasoning task more than vision one. Thereâs also the whole question of whether or not the modeling itself should be emergent or handcrafted.
Letâs focus on identification. How could the current version of Monty identify a rope regardless of its shape? We have access to 2 fingerprints: 3D shape, and color.
Suppose you train it on a straight piece of rope. Monty remembers the shape and color. Now, you rearrange the rope in a âZâ shape, i.e. 3 straights with 2 bends. During inference, since each column only sees part of the object and not the whole thing, the columns aimed at the bends will be confused and have low confidence, but the columns that sees the straight parts will choose âropeâ as their most likely hypothesis with high confidence.
Thus, the beauty of the voting process (and the columnar resolution) is that it would enable emergent generalization, even if still rudimentary in this form. That example would translate well to the closed/open stapler too. Columns aimed at the hinge would be confused, but the columns aimed at the two halves would have high confidence. It has limitations, it wonât work if you tie the rope into a pretzel shape for example, but it would be a good test to try with minimal effort.
But how to push it further?
To truly reach human-level identification capabilities, thereâs certainly a need for high-definition, dimensionally-reduced visual feature detection. The brain is very nimble at dimensional reduction, and this capability must be replicated for maximum generalization.
One question that looms in my mind is how could an SDR encode dimensionally-reduced info without brittleness? And what form would the reduction pipeline take? Maybe some sort of content-aware scaling algorithm that âunfurlsâ shapes along detected edges and only preserves salient color transitions? Hmm, probably not biologically plausible.
A crude way to maybe perform this would be with the Shazam method I described here on a Sobel-filtered image or something along those lines. However, hashes are also brittle by design, and not biologically plausible.
The docs suggest neural networks as a potential pathway for feature extraction, but thatâs backprop-based and too rigid⌠There has to be something more flexible. Among other things, Iâve been going over the work of Bruno Olshausen to gather clues, Iâll keep digging. He seems to know a thing or two about that kinda stuff.
I also stumbled upon Sparsey, an experimental vision AI based on column theory. It doesnât seem to have been pushed beyond the proof-of-concept stage, but thereâs definitely some worthwhile ideas in there that could be applied more broadly. Maybe its author Gerard Rinkus might even be interested in contributing to Monty. He mentioned having discussed with Hawkins a few times in his papers.

Additional thoughts in regards to visual range
The higher-order identity of a âropeâ is bound to many sensorial fingerprints, but most especially texture I would believe, and shape to a lesser extent. At close visual range, we recognize ropes due to their braided fiber patterns and how light interacts with them. Imagine a 2-inch-long piece of 1-inch-diameter hemp rope; its shape and color are very similar to a wine cork, but you can still recognize that it is a piece of rope via its texture.
But now, imagine that a skilled artist took an actual wine cork, and meticulously painted it to look like a piece of rope. The neocortex might initially predict that itâs a piece of rope, but then you touch it and realize itâs actually a rigid object that doesnât feel like a rope, and youâre confused. You look closer, noticing subtle cues like cork pores and paintbrush strokes, and that it weights and feels like a cork between your fingers.
The brain then resettles the objectâs associative memory trace to a set of fingerprints like âwine cork, paint layer, rope texture, located at XYZ art gallery, year 20XXâ. Youâve never seen anything like this before, resulting in high saliency, causing those fingerprints to be recorded as a unique oddity, distinct from all other corks or ropes you might remember.
At long visual range, imagine you have a car tire hanging from a tree branch. You can see that the tire is connected to the branch via a thin line, but you might not be able to see the texture because youâre too far away, or even the color, if the sun is setting behind the tree for example. In this scenario, the brain is making a prediction that it is likely a rope. But maybe itâs not a rope, it could theoretically be a fixed metal pole. The prefrontal cortex will most likely not even consider âmetal poleâ as a viable prediction, because it was probably never exposed to such a possibility.
If the tree is just a silhouette on the edge of your peripheral vision while youâre talking to someone in front of you, the visual cortex will quickly assess âtree with tire swingâ based on the overall shape, and not really pay anymore attention to it because itâs not salient and itâs just a background fixture.
So, this whole example highlights how object identification isnât strictly a visual process; it also involves various context-dependent degrees of reasoning. New kinds of modules would be needed to properly address the can of worms.
Regarding liquids; a one-year-old kid can recognize a surprising amount of objects and animals, but you better have a mop nearby if any unsealed container is within armâs reach. 
P.S.: If anything I said seems confusing or deviates from TBT, please let me know!