2021/11 - Continued Discussion of the Requirements of Monty Modules

Part two of the video where Jeff provides the definition of terms: Pose, Body, Sensor, Feature, Module (now called Sensor Module), Objects.
He explains how voting explains one-shot Object recognition as well as considering Modules are arranged in a hierarchy.

Additional discussion focuses on open questions of understanding hierarchy, motor behavior, “stretchy” graph, states, models in “where” columns, and feature descrepencies.

3 Likes

Thanks for sharing. The audio quality is not that great. But the videos together provide a good context on the design thought process that went in which helps a lot :slight_smile:

Best
Anitha

Thanks Anitha! Yes, the sound quality is hit or miss on some of the videos. The videos do have captions and the captions, though generated by AI, were manually checked and fixed by a human, so they’re a decent way of increasing clarity.

Very interesting discussion!

“I think we should take this very abstract view of it and say, first of all, determine the location, then take these bits, run them through some sort of algorithm that does classification of some sort, like a spatial pooler or something else, to quantize them, the input, and then perhaps, perhaps even at the first level here, the orientation is unknown.” (at 31’)

Have you tried SOM-like algorithms (Self-Organized Map) to classify and quantize/discretize the “features at a pose" data at the level of a single learning module (LM)? It would allow for two desired properties discussed in the video:

  • Continuous (vs discrete) representations: neighboring buckets represent similar things
  • Accurate vs fuzzy representations: if only a few buckets match well the inputs, then the output could be considered as accurate. On the contrary, we have fuzzy outputs when many buckets match a bit the inputs.

An interesting property of such coding is that each LM could communicate its output in a very lightweight manner to the other LMs. The message could be composed of:

  • The 2D coordinates of the winner bucket in the map of this LM (represented as two scalars or a single complex number). This bucket would represent the object ID of the given LM in the common reference frame (the reference frame transformation is already done at this stage).
  • The ID of the LM communicating the message (could be coded as a single scalar, or as a 2D coordinate to take into account the topography between LMs)
  • The level of accuracy/fuzziness given by a proxy called the SOM quantization error (a scalar between 0 and 1).
  • An intensity level representing the weight/confidence we should attribute to this message, so that LMs using this message can weigh this information compared to the other messages (a scalar between 0 and 1).
  • Motor/action & pose-related information (???)

In this sense, the output of a LM would mainly be a 2D coordinate, not a full SDR. Obviously, it would mean that a single LM is only able to represent a few dozens of different objects in a continuous manner (representing a coffee cup would then be out of reach for LMs in V1). Representing complex objects would rely on composition between different LMs. Still, we can use the SDR abstraction at another level: each LM could only sample a subset of other LM outputs (this subset can evolve a bit through time, but it is mainly learned at the beginning during the critical periods. It needs to crystallize at some point to stabilize the learning of downstream LMs) and this subset can be seen as a SDR. LMs with similar SDR will represent similar objects within their own perspective/specialization.

Not sure we are on the same wavelength here. I need to think more about this and watch the upcoming videos. Anyway, it is already very refreshing and helpful to see your approach.

1 Like