Following a discussion on twitter about this nice blog post “Mapping Reality: From Ancient Navigation to AI’s Spatial Innovation”
I’m a bit confused about what you mean by reference frame in the TBT. Are those reference frames at the cortical-column-level or are they global? Are these reference frames classic continuous 3D coordinate systems? Do they have an explicit origin and fixed axes? Do they have to be 3D? Are there learned or imposed?
From my understanding, it seems that your reference frames are classic 3D coordinate-systems with an origin and you use that term to refer to both:
-
The object-centric reference frames that are specific to each object modeled by a cortical column. In that sense, there are as many object-centric reference frames than the number of objects a cortical column models.
-
A common reference frame (either ego/body-centric or allo-centric) that is used to share information between cortical columns in a “shared language”. This one is explicitely enforced, not learned by the system.
Each column uses internal object-centric reference frames to infer an object and track pose, then translates its current object identity and pose into a common reference frame when exchanging information with other columns.
Is it correct?
If I map the term reference frame onto my ideas (see my previous post), I also differentiate two levels (cortical column vs cortical area) but in a different manner:
Cortical column level
The low‑dimensional latent space of a cortical column is the column’s reference frame. Only one per column; it’s the most important thing a column learns, acting as a knowledge scaffold. This latent space holds stable representations that support:
- Lateral coupling with neighbors (bilateral voting in your terminology)
- Path integration inside the column (update position in the latent from movement info)
- Reverse path integration (which movement to produce a desired change in latent)
- "Sensory” inputs for other columns.
Rather than a classic continuous 3D coordinate system, this reference frame is a bounded, finite‑resolution map (my bet: it can be modelled as a 2D map with ~100 nodes/minicolumns but this is a different topic). This latent frame is learned, not enforced. Example: for a grid‑cell module (akin to a cortical column in my view), the reference frame would be its finite set of phases at a given scale/orientation - useful, but insufficient alone to represent continuous 2D space.
Cortical area level
At a larger scale, combining neighboring columns’ related latents yields a more consistent, near‑continuous frame for the area (e.g., ego‑retinotopic,*ego‑head‑centric, ego‑body‑centric, ego‑right‑arm‑centric, allocentric, …). These frames are learned as well. Continuing the grid‑cell example: multiple grid cell modules in the entorhinal cortex with different scales and orientations together provide an allocentric 2D coordinate system, but still not classical (no explicit origin; overlapping variables rather than two fixed axes). Importantly, those reference frames are not always spatial depending on the cortical area (e.g. objectness vs animacy, rounded vs sharp shapes in some parts of IT cortex).
One fundamental difference with TBT is that I limit bilateral voting to neighboring cortical columns (consistent with evidence that monosynaptic reciprocal L2/3 connections are predominantly intra‑areal). Because neighboring columns learn similar latent spaces, they can communicate effectively without a global common reference frame.
I’m curious if you consider the global voting scale (supported by a common reference frame used in the CMP) as a strong hypothesis, or a working assumption you might revise in TBT?