About reference frames

Following a discussion on twitter about this nice blog post “Mapping Reality: From Ancient Navigation to AI’s Spatial Innovation”

I’m a bit confused about what you mean by reference frame in the TBT. Are those reference frames at the cortical-column-level or are they global? Are these reference frames classic continuous 3D coordinate systems? Do they have an explicit origin and fixed axes? Do they have to be 3D? Are there learned or imposed?

From my understanding, it seems that your reference frames are classic 3D coordinate-systems with an origin and you use that term to refer to both:

  • The object-centric reference frames that are specific to each object modeled by a cortical column. In that sense, there are as many object-centric reference frames than the number of objects a cortical column models.

  • A common reference frame (either ego/body-centric or allo-centric) that is used to share information between cortical columns in a “shared language”. This one is explicitely enforced, not learned by the system.

Each column uses internal object-centric reference frames to infer an object and track pose, then translates its current object identity and pose into a common reference frame when exchanging information with other columns.

Is it correct?


If I map the term reference frame onto my ideas (see my previous post), I also differentiate two levels (cortical column vs cortical area) but in a different manner:

Cortical column level

The low‑dimensional latent space of a cortical column is the column’s reference frame. Only one per column; it’s the most important thing a column learns, acting as a knowledge scaffold. This latent space holds stable representations that support:

  • Lateral coupling with neighbors (bilateral voting in your terminology)
  • Path integration inside the column (update position in the latent from movement info)
  • Reverse path integration (which movement to produce a desired change in latent)
  • "Sensory” inputs for other columns.

Rather than a classic continuous 3D coordinate system, this reference frame is a bounded, finite‑resolution map (my bet: it can be modelled as a 2D map with ~100 nodes/minicolumns but this is a different topic). This latent frame is learned, not enforced. Example: for a grid‑cell module (akin to a cortical column in my view), the reference frame would be its finite set of phases at a given scale/orientation - useful, but insufficient alone to represent continuous 2D space.

Cortical area level

At a larger scale, combining neighboring columns’ related latents yields a more consistent, near‑continuous frame for the area (e.g., ego‑retinotopic,*ego‑head‑centric, ego‑body‑centric, ego‑right‑arm‑centric, allocentric, …). These frames are learned as well. Continuing the grid‑cell example: multiple grid cell modules in the entorhinal cortex with different scales and orientations together provide an allocentric 2D coordinate system, but still not classical (no explicit origin; overlapping variables rather than two fixed axes). Importantly, those reference frames are not always spatial depending on the cortical area (e.g. objectness vs animacy, rounded vs sharp shapes in some parts of IT cortex).

One fundamental difference with TBT is that I limit bilateral voting to neighboring cortical columns (consistent with evidence that monosynaptic reciprocal L2/3 connections are predominantly intra‑areal). Because neighboring columns learn similar latent spaces, they can communicate effectively without a global common reference frame.

I’m curious if you consider the global voting scale (supported by a common reference frame used in the CMP) as a strong hypothesis, or a working assumption you might revise in TBT?

4 Likes

I’m no expert (although I am the author of the Medium post), just learning my way through the finer details. I think you’re right that reference frames operate at multiple levels in TBT. There are object-centric reference frames specific to each object a column models, and there’s also a more common reference frame for inter-column communication. As you point out that columns use internal object-centric frames for things like inference and pose tracking, then translate to a common frame when sharing information.

As far as I understand, reference frames are not classic continuous 3D coordinate systems with explicit origins and fixed axes, since they based on grid cells, they don’t have an origin. Also, TBT suggests both local and long-range voting, so voting can occur between modules that might classically be at different hierarchical levels. Your distinction between spatial and non-spatial reference frames sounds spot on to me. The location-based framework can be applied to concepts and high-level thought in the same way it can to physical objects (one of my next blog posts). And I think you idea of limiting bilateral voting to neighboring columns with similar latent spaces is very interesting!

Perhaps someone else with a deeper understanding of reference frames can chime in with better information for you.

4 Likes

Scott here, researcher at TBP. Nice writeup, and I think you’ve got the right idea.

  • Reference frames, as implemented in Monty right now, are classical 3D euclidean spaces, but this doesn’t have to be true in general. Reference frames could be 2D, and in principle, they could be just about any navigable space, like a tree or graph with nodes that can be navigated to through movement. But R^n, and 3D in particular, is the most well-understood and fleshed-out system we work with right now.

  • Origins in these reference frames are somewhat arbitrary, but it is important for voting that an object’s reference frame for column A has the same origin and orientation as the object’s reference frame in column B. This is basically guaranteed by having both object models learned at the same time.

  • We have discussed whether there might be shortcuts for neighboring columns in places like V1 where RF-offsets should be consistent and would shake out automatically from retinotopic organization. However, we work mostly in the “hard” regime where we can’t rely on this fact. For example, two fingertips on different hands will need a full transformation that map between sensors that move independently and have no fixed displacement between them.

While I’m personally pretty interested in what kinds of shortcuts or simplifications we might get from neighboring columns in a retinotopic/tonotoptic/etc. array, it’s more of a special case we haven’t really delved into much. Since we do have to solve the issue of voting between different modalities and sensors with non-fixed displacements, I’d say common reference frames are a strong hypothesis.

Note also that we’ve been discussing voting in our latest research meetings, and those will be uploaded shortly. There are also some new docs regarding voting and reference frames which will also be merged and available soon.

Cheers,

Scott

5 Likes

Hi @mthiboust Just a quick update that the documentation on reference frame transformations in Monty is now live :slight_smile: here is the link Reference Frame Transformations

I tried to add a couple of examples as well as references to where these transforms happen in the code. I hope this helps!

5 Likes

Thanks @sknudstrup and @vclay. I’m impressed by the quality of your documentation, I’ll take a closer look at it.

4 Likes