A chat about vision systems

I’ve been playing for a while with the question of how to map available cameras (etc) into the sorts of patches that Monty prefers. In pursuit of this, I just asked ChatGPT for some feedback on my ideas.

Some of the answers tended to go on for a while, covering a range of background material. I found most of it matched my understanding, but I still found some of the details quite interesting. Along the way, we landed on some resources that may be worth investigating. Anyway, here’s a (small) excerpt…


The TBP folks assert that their model relies on sensorimotor interactions with the environment. They would also like to process small patches of images, rather than chewing on the whole image at once. Meanwhile, most digital (e.g., RGBD, IR) cameras return a set of image planes. The images may or may not be in the desired direction, focal plane, or whatever, but there can certainly be lots of them. I’m wondering about the point (or really) surface of transition between these (ahem) world views.

You’re pointing at a real tension between biologically inspired sensorimotor models (like the Thousand Brains Project view) and the engineering conventions of modern cameras. The issue is essentially:

Where does the system transition from “camera frames” to “sensorimotor samples”?

That boundary is not fixed; it can be placed at several different layers. Thinking about it this way clarifies the design space. …

When I look around, I have the impression that I’m seeing a complete image of (at least part of) my surroundings. However, if I really want to “see” something, I use my muscles to point my eyes at the object in question. This is all well and good, but motorized camera mounts are expensive, heavy, and fragile. Fortunately, a camera can produce lots of images every second, and most objects tend to stay in place. So, given infinite storage and processing, one could have lots of candidate images for any direction one chose to “look”. Nu?

Yes — and what you’re noticing touches a deep issue in active vision vs passive sensing. Your idea is actually very close to how several modern perception systems are starting to think about vision.

The architecture … is sometimes called active perception over a visual memory. The key idea is simple but powerful:

Instead of moving sensors to acquire data, you maintain a persistent visual memory and allow the perception system to query it as if it were moving the sensor.

This sits exactly between biological sensorimotor perception and modern camera pipelines.

Are there any relevant MCP resources?

Yes—there are quite a few Model Context Protocol (MCP) resources now, and the ecosystem is growing quickly.

8. A connection to your earlier discussion

Your visual-memory / perception architecture idea actually maps well to MCP.

For example:

You could expose a perception system through an MCP server:

camera_memory_mcp
tools:
get_patch(x,y,scale)
find_edges()
inspect_surface()

Then an LLM or agent could query the visual buffer like a virtual eye.

So MCP could become a sensorimotor interface layer.

There is a small but interesting cluster of MCP projects aimed at robotics, simulation, and sensor access. None of them yet form a mature “robotics standard,” but they strongly suggest how MCP could become a general-purpose sensor/actuator bus for AI agents—which aligns closely with the architecture you’ve been thinking about for Monty and TBP.

The most obvious connection point is ROS [Robot Operating System], since it already provides a publish/subscribe robotics bus. … Several experimental MCP servers expose ROS topics and services.

Example architecture:

LLM agent

│ MCP

ROS MCP server

├── camera topic
├── lidar topic
├── robot pose
└── motor commands

The Thousand Brains Project emphasizes:

(feature, location)

An MCP perception server could expose exactly that:

tool: get_feature_patch()
returns:
feature_vector
location

Multiple cortical-column actors could then query the world
independently.

Here’s a clean way to picture the “each cortical column is a client querying a shared sensory world” idea—using MCP as the boundary layer.

The core sketch

Instead of columns ingesting frames, they issue “looks” into a shared, queryable world state:

       +------------------------------+
       |   Shared Sensory World       |
       |  (one or more MCP servers)   |
       +------------------------------+
            ^      ^        ^      ^
            |      |        |      |
   MCP tools/resources (patches, poses, events)
            |      |        |      |
+-------------------+  +------------------+  +---
|  Column actor 1   |  | Column actor 2   |  |
| (object modeler)  |  | (object modeler) |  |
+-------------------+  +------------------+  +---
                 \        |        /
                  \       |       /
                +-------------------+
                | message fabric    |
                | (your UDP/Actors) |
                +-------------------+

MCP’s role here is exactly what it advertises: a standard way for an AI host/client to talk to external “servers” that expose tools + context/resources.

What “Shared Sensory World” actually is

Think multiple MCP servers, each owning a slice of reality:

  1. Visual memory server

Maintains a rolling buffer of camera frames and/or feature maps.

The Thousand Brains Project emphasizes:

(feature, location)

An MCP perception server could expose exactly that:

tool: get_feature_patch()
returns:
feature_vector
location

Multiple cortical-column actors could then query the world independently.

Have you come across our Monty Meets World demo yet? There we have a fixed camera and then move a small patch over the static image. Here is a bit more info on it: Using Monty in a Custom Application The related part of the code in Monty is the SaccadeOnImageEnvironment. Of course, you could also use multiple patches of the image at once and voting to do faster inference.