2023/01 - A Comprehensive Overview of Monty and the Evidence-Based Learning Module

@vclay presents a comprehensive overview of how Monty currently works and goes into depth on how the EvidenceLM works in particular. She also explains how voting is implemented using the EvidenceLM.

2 Likes

Thanks again for sharing this presentation. The drawings are great to explain your algorithm and your ideas about reference frames are very inspiring. I have a couple of questions that I hope will fuel discussions.

1/ Are you trying to model a V1 cortical column with your artificial visual LM that processes small patches?

I am surprised you use RGBD data with depth as input data. Taking depth into account adds a superpower to this visual artificial LM compared to its biological counterpart! I guess that the depth information is key to extract the 3D curvature values in the patch, and more generally to get a point-cloud modeling of the object. Are you hypothesizing that biological V1 columns get the depth information from somewhere (top-down inputs?) and you use it here directly as a bottom-up input in your first experiments as a temporary shortcut before replacing it in the future? If not, are you hypothesizing that biological V1 columns are able to infer the depth information by themselves (by mixing inputs from different eyes if you consider that a single LM overlaps two ocular dominance columns?)?

I think those points are relevant to assess the level of 3D-awareness a single canonical LM (vs a network of multiple LMs) can address. Maybe we don’t need that much 3D computations inside the canonical LM algorithm?

That being said, I agree we could and should directly leverage this depth information in primary visual LMs of a future artificial brain built on your principles (as for UV and IR vision). My question is on the minimum requirements for the canonical LM and its 3D capabilities.

2/ Do you expect a single visual LM to model rotation and scale-invariant objects?

You said that your graph-based LMs are not scale-invariant. Is it something you would like to include inside each visual LM or will you rely on hierarchy to handle this invariance?

Same question for rotation-invariance. Even if your graph-based LMs are already rotation-invariant, the current explicit test of the different rotations to recognize an object (if I understand correctly) does not sound biologically plausible. Is it something you would like to improve at the LM-level to make it biologically plausible (and maybe more efficient computationally). Or is it an option to rely on hierarchy to handle this invariance?

3/ LM memory representation

It seems that the memory of each LM is using structured data representations for storing the graph of objects. Have you found a way to flatten those representations into a SDR from which we can reconstruct the graph of objects required by the LM algorithm? Ideally, do you see the LM memory as a collection of SDRs that could be directly shared with other LMs to vote on object IDs? Do we have a maximum number of SDRs every LM memory can store then?

Related to this, how would you represent a continuum of similar objects?

4/ Do LMs share all the local evidence of each potential object instead of only object IDs when they vote? What is transmitted between LMs during voting?

In the presentation, I get the feeling that the voting process communicates all the detailed evidence of every potential object (+ object pose so that receiving LMs can reconstruct this data into their own reference frame). This is a lot of information. I was expecting to see voting directly on object IDs and global pose, but I may have missed something here. On the other hand, voting directly on object IDs would imply an additional process to map object IDs between LMs. This is something you don’t need to do with your current approach, right?

4 Likes

Hi @mthiboust
thanks for all the detailed questions, those are amazing! I could talk about this stuff all day but try to keep it brief:

1. Are you trying to model a V1 cortical column with your artificial visual LM that processes small patches?

Good questions! One of the requirements for any sensor that connects to Monty is that it can infer it’s pose (location & orientation) in space. The pose is extracted with the sensor module and then sent to the learning module using the CMP. For many sensors, like touch, lidar, echolocation, etc. figuring out the location of the sensor patch in space is pretty straight forward. But for vision it is not that easy and it requires depth perception. We know that the brain uses many different cues (besides stereopsis) to infer depth from the inputs that hit the retina. Where exactly this happens is unclear. One interesting thing about columns in primate V1 is that they have some extra sublayers in L4. One might speculate that these could be used to extract depth information. But this is just speculation.

2. Do you expect a single visual LM to model rotation and scale-invariant objects?

Yes, we expect any LM to be able to recognize objects invariant of rotation, location, and scale. We definitely don’t want to have to rely on hierarchy for this (although LMs at different levels in the hierarchy may have different spatial resolutions and limits to the size of objects they can model). The current testing for rotation hypotheses could be accomplished by the L6b feedback projection to the thalamus communicating the rotation hypothesis. The thalamus uses this to rotate the incoming sensory information into the LM’s object’s reference frame. The LM can then directly compare whether it is consistent with its model of the object. For scale we think that the brain may be using theta frequencies in the grid cell mechanism (we have a whole meeting recording on this, I’ll see about pulling it forward in the release schedule) but we are not sure yet how to translate this to Monty. If you have any ideas on how to achieve scale invariance in Monty I’d love to talk more about that.

3. LM memory representation

No, we don’t think that graphs should be encoded/compressed as SDRs. We want to have explicit reference frame representations which can be used to path integrate through. We would easily loose this property if we compressed entire graphs into SDRs. In the brain we hypothesize that we have grid cells (or a gridcell-like mechanism) in L6. These can be used to produce unique location SDRs for each location on each object that can then be associated with features in L4.

Another important point is that LMs never communicate their models to other LMs. They only communicate the model ID (SDR in layer 2/3). Models might be very modality specific whereas the model ID SDR can be associated with any other arbitrary SDR in other LMs from other modalities.

For representing a continuum of similar objects one could use the evidence accumulated to determine object similarity (similar objects will have similar amounts of evidence give a sequence of observations). We implemented this approach here: https://thousandbrainsproject.readme.io/docs/evidence-as-a-similarity-measure But we are not sure if brains can actually robustly encode similarities using SDRs or if we need to come up with a different approach.

4. Do LMs share all the local evidence of each potential object instead of only object IDs when they vote? What is transmitted between LMs during voting?

Voting currently happens on object ID and pose. We could vote only on object ID but it is a lot more informative to also vote on where on the object we are and how the object is rotated. We do this since object and pose recognition are one and the same process in Monty. There is no object detection without pose detection. So when we vote each LM communicates the possible objects and poses and then, like you said, these are translated into the receiving LM’s reference frame since that LM is likely sensing a different part of the object (pose detection is not just about rotation but also location). You are right that technically we also would need to learn object ID associations. Right now we simply assume that all LMs have associated the same object ID with the same object but we don’t want to keep relying on this assumption in the future. This is something we plan to add to our implementation.

I hope this helps! Let me know if you have more questions :slight_smile:

3 Likes

Many thanks Viviane for your thorough reply (also knowing that you all already had a crazy week)! This is very valuable information for me, I need to take some time to digest it a bit before answering, but will definitely do later.

1 Like

Thanks for the clarification, I understand your reasoning better now. It seems that motion parallax would be a good candidate to compute the depth information inside a visual LM without relying on the other LMs. I guess this capability will come later with the integration of temporal dynamics in your algorithm, but I understand that you can already use a temporary shortcut by directly feeding the depth information as a first step.

As for your speculation about where this computation happens in the brain, I would not target the extra sublayers in V1 L4 if we want to extend this capability to mice that don’t have this layer subdivision but still have many depth-selective neurons in V1 L2/3 (a recent reference that you probably already know about: A depth map of visual space in the primary visual cortex ). Said differently, the depth information could be a product of the canonical LM algorithm itself. That being said, maybe the extra sublayers in L4 can enhance this process but it would be nicer if they are not strictly required.

Personally, I feel that recognizing complex objects (like the shape of a coffee cup) in a rotation- & scale-invariant way at the level of a single LM is a strong bet. On my side, I have no clue how a cortical column could achieve this by itself (in fact, this is the main reason why I am so keen on relying on inter-LM interactions with “less capable”-LMs to recognize rotation- & scale-invariant cup-like objects). Still, I am curious to see where this path leads.

You mention some hypotheses about how it could be implemented in the brain. I currently have different speculations for the L6b modulatory feedbacks and a grid-cell-like phase coding mechanism in the neocortex:

  • Thalamo-cortical projections to L4 convey information about upcoming motor commands (either explicit like “contract a given neck muscle” or implicit when it is encoded as desired goals / expected outcomes like “turn the head 30° left”; the thalamus gets those signals from other cortical areas via their L5 PT projections, cerebellum or other subcortical motor nuclei like the superior colliculus) and also sensory stimuli for first-order thalamic nuclei. The L6 cortico-thalamic feedbacks of a cortical column dynamically adapt the gain of those motor-command-related thalamo-cortical projections in order to keep its reference frame in sync with the upcoming changes. When the reference frame is in sync with the upcoming changes (the prediction is accurate), then the thalamo-cortical activity is gated; if there is a difference, then only the delta is transmitted to the cortex.

  • The grid cell phase precession phenomenon (where grid cells fire at progressively earlier phases of the local theta rhythm as an animal moves through the spatial field of the grid cell) is something that could be at play in neocortical columns as well: imagine if a LM sequentially outputs 4 object IDs in each cycle (4 gamma periods inside an alpha period for the cortex, instead of 6-7 gamma periods inside a theta period for the medial entorhinal cortex where grid cells are). Those 4 sequential object IDs could represent dynamic trajectories of past, present and future of the matched object. There is already some evidence of this in the PFC but I haven’t found any such evidence for other cortical areas yet (wrong speculation or maybe not fully tested yet by experimentalists?).

Not sure to understand what you mean by gridcell-like mechanism in L6. For me, the analogy between the mEC and the neocortex is as follows: grid cells (primarily located in L2 of mEC) directly represent object IDs, similar to how neurons in L2/3 of a LM represent object IDs. The “object IDs” of mEC represent allocentric locations whereas “object IDs” of other cortical areas represent objects. Maybe I should use the term “concept ID” instead of “object ID” to make it clearer. L2/3 concept IDs related to allocentric, egocentric, arm-centric locations (computed in the temporal and parietal lobes) are then used by other LMs in their deep layers as cues for their reference frame computations. I know that mEC is an evolutionary-ancient cortex that has differences with the neocortex but I hypothesize that the main framework is still at play here (if you don’t agree with this, where would you put the limit in this “cortical continuum” from mesocortex to neocortex?). Also, I am keen on seeing an analogy between the potential “phase pinwheels” of grid cells in mEC with the orientation pinwheels in primate V1, but it is very speculative and I am going off-topic here.

Generally speaking, I am very interested in your understanding and vision of how the biological cortex works at a macro & micro level and the biological evidence that supports it. I know it is a lot to ask (obviously!), but I hope that we can have some insightful discussions about it here. Thanks again for doing this research in the open!

4 Likes

Thanks for the follow-up Matthieu, I might not get to all of your points, but just wanted to make sure we responded to some of them.

Re. emergence of 3D representations.

It is possible that a degree of 3D representation emerges in V1 column models. The key part of a Monty LM is that the inputs are defined by an orientation and a location, but it doesn’t necessarily need to be in 3D space - it could be lower dimensional, which we would indeed expect for certain properties of e.g. sound, or even higher dimensional in a non-biological system.

One other point that might be interesting is that there is evidence that lower layers in L3 (i.e. L3B) receive information from both eyes, and so this may be a particular source of depth information for a low-level column. In particular, these cells receive direct ascending input, and also project to L6 (discussed in e.g. Shipp 2007 and Grossberg 2021). Any feed-forward information that is passed to the higher column (L3–>L4) might be more like a “skip connection”, so overall it could be acting something like an “L4+” in terms of providing depth-based features to the parent column.

Re. invariance in a single column:

It’s worth noting that we believe a lot of these properties would require a degree of serial computation in real neurons, while in Monty, we can do much of this in parallel. For example if there is a particular rotation hypothesis, this would transform the incoming sensory input, but a column might only be able to test a given rotation hypothesis (or a small subset with e.g. different phases) at a given point in time. This would be consistent with the fact that humans recognize objects in their typical orientation faster than in atypical orientations.

Re. the location of grid-cells/reference frames

There are a variety of reasons, based on the intracortical and intercortical connections, that we believe reference frames would be located in L6. These would be implemented by grid-cells or some other neural mechanism for path integration. If you’ve not already come across it, a great place to read about why we believe L6 specifically is the location of references frames is found under “Mapping to Biology” in Lewis et al, 2019.

3 Likes

Thanks Niels for your reply. The terms “reference frames” and “grid cell” may be ambiguous here, but I guess I am following a different path. I’ll write down my ideas and reasoning in a more comprehensive form in the coming days/weeks so that we could mutually learn from our respective view.

In short, I think that each cortical module receives 2 main kinds of inputs (features & context) and learns to perform both:

  • Recognition on stable representations (object IDs in TBP terminology) materialized in superficial layers
  • Path integration on those stable representations via interaction between superficial & deep layers using the context input

I hypothesize that the path integration mechanism at play for grid cells in the entorhinal cortex is the same than the one used in neocortical modules (coupling between superficial layers and deep layers in both cases).

In the entorhinal cortex, the stable representations would be encoded by grid cells in the supericial layers, input features would be sensory cues such as landmarks, input context would be allocentric head direction & body velocity in deep layers, the reference frame would be a toroidal allocentric reference frame.

2 Likes

Thanks Matthieu, yeah that would be great if you manage to write some of it up. At a certain point it becomes difficult to discuss these kinds of things without figures haha, so looking forward to revisiting it then just so we can be 100% clear on where we agree and where our approaches might differ or be able to borrow from one another.

2 Likes