Hi @Alex and @Jacques
I love the discussion and in-depth thoughts here! Sorry I didnāt get around to replying earlier. Iāll try to address all your questions but if I skip some parts it is because we also havenāt fully figured out all of those points you touch on and actively brainstorm on those topics.
Perhaps we easily recognise a cup shaped object because of the fundamental significance of cup shaped objects being able to hold water and food.
I agree. We donāt just categorize objects by their shape but also by how we interact with them and how they behave. For example, think about all the crazy shapes chairs can take, but we can tell with all of them that they are places we can sit down on. I donāt think we would have one unified model for all possible chairs or coffee cups, but instead, if there is a lot of variety, we would have multiple models that cover that variety, and at a higher level in the hierarchy, they can be interpreted in similar ways. We may also use short mental simulation for some of those categorizations (like think what would happen if you pour water into a novel-shaped object or how you would sit down on a chair with a shape you have never encountered before.
If we land on an alien planet and none of the objects are familiar all of our attempts at classification would fail and our attempts to recognise objects would be in disarray. We would likely require a period of learning what things do before we could begin to recognise them.
I donāt think we wouldnāt be able to learn about the alien planet right away without knowing how things work. We could already start recognizing shapes of new objects. But I agree that to be able to act intelligently in this new world one would also have to test hypotheses and interact with objects (like children do in our world) to figure out how things work and learn the object behaviors and similarities beyond shape.
Another aspect of this is the difference between recognising a cup versus recognising my cup. Are there different processes at work when categorising an object that may or may not have been seen before and a familiar object or personal possession, or is it just a matter of adding finer detail to the recognition process.
The idea is that you donāt just have one model of a mug in your brain but many models. Some columns might model a generic shape of a mug while others model your favorite mug with the specific print and dent in the corner. They can be associatively connected (when you see your favorite mug you also recognize the generic mug model since it is consistent with that as well).
On number of hierarchical levels:
I should clarify that we donāt claim that the human brain only has 2 levels of hierarchy, neuroanatomy clearly indicates that there are more than two. However, many mammals have a much shallower hierarchy (like 2-3 levels) and even the hierarchy in the human neocortex has much fewer hierarchical levels than for instance most DNNs (a bit more discussion on this topic here What sets Monty apart? - #5 by vclay ). The two levels are more of a minimal requirement we talk about, and that might be supported by attentional constraints (it seems like we donāt ever pay attention to a deeper nesting than a parent-child relationship). But, as I think you are getting at as well, it may require more levels to be able to learn models of more complex/abstract things that are not necessarily rooted in direct sensory input but are instead compositions of other objects recognized at lower levels and hence contain more condensed and processed information.
In terms of complex behaviors like skiing or even walking, a lot of this would likely be shifted to subcortical regions like the cerebellum once we get good at it. Doing a complex, potentially nested, sequence of actions initially requires a lot of attention and conscious thought (recruiting models in the neocortex), but as they get more practiced, you donāt have to think about them anymore, and they can work more as a model-free policy.
How one object becomes two:
This is one of those topics we are still actively discussing and donāt have a conclusion on yet. One idea is that initially, an attentional mask could be applied (like masking the bottom of the stapler as the top moves) to continue to make correct predictions as part of the object starts moving for the first time. This mask could be based on model-free information, such as areas where movement is detected. Over time, the masked area might then be learned as a separate model.
On learning general behavior models & models of how the agent moves:
I hope I understand everything you are saying correctly but I think I agree with all those points. We are definitely thinking of the system as being capable of learning general behavior models (like hinge or ball-and-socket-joint) and applying them to different object morphologies (like laptop, door, stapler, arm). I also agree that we would learn behavior and morphology models of our own body and how it can move.
I hope this all makes sense. Let me know if I missed any critical points or questions!