Using MCP as the cognitive layer in a robotics stack

From some postings by members of the TBP team, I gather that the team has no plans to work on robotics issues, leaving that to the folks who have that as their specialty. This makes perfect sense to me, but it means that the project will need ways to leverage existing work in the field. The best way to do this, IMHO, is to adopt popular and/or emerging standards for exchanging control and detected information.

So, I asked ChatGPT to tell me about possibilities, starting with standards such as Model Context Protocol (MCP) and JavaScript Object Notation (JSON). It said:

1. What MCP Actually Is (and Isn’t)

Model Context Protocol (MCP) is not a robotics control bus or a real-time protocol.

Instead, MCP is:

  • A tool- and data-exchange protocol
  • Designed to let models, tools, and environments share structured context
  • Strongly oriented around:
    • JSON (canonical form)
    • Tool invocation
    • Schema-defined inputs/outputs
    • Language neutrality

Think of MCP as operating at the cognitive / coordination layer, not the servo loop.

MCP is closer to “how do intelligent components talk about the world and ask for actions” than “how do motors move”.

That makes it very relevant to modern robotics architectures, just not at the lowest levels.


2. Where MCP Fits in a Robotics Stack

A useful mental model is a four-layer robotics stack:

┌──────────────────────────────────────┐
│ Cognitive / Planning / Reasoning     │  ← MCP lives here
├──────────────────────────────────────┤
│ Behavioral / Task Coordination       │
├──────────────────────────────────────┤
│ Middleware / Messaging               │  (ROS 2, DDS, MQTT)
├──────────────────────────────────────┤
│ Real-Time Control / Fieldbus         │  (CAN, EtherCAT)
└──────────────────────────────────────┘

MCP’s natural role:

  • Context exchange
  • Tool discovery
  • Semantic commands
  • Cross-language orchestration
  • Human ↔ Robot ↔ Model interfaces

3. MCP + JSON in Robotics: Why This Is Appealing

Strengths MCP brings

1. Semantic clarity

Instead of:

{"cmd":42,"x":1.2}

You get:

{
  "tool": "move_to_pose",
  "arguments": {
    "frame": "map",
    "position": [1.2, 0.4, 0.0],
    "tolerance_mm": 5
  }
}

That’s gold for:

  • Multi-agent robotics
  • Debuggability
  • Human-in-the-loop systems
  • Model-based reasoning (Monty-esque systems)

2. Language-neutral, actor-friendly

MCP meshes very cleanly with:

  • Python (AI, perception)
  • Elixir (orchestration, supervision)
  • Rust / Go (performance nodes)
  • Web UIs

Which mirrors your interest in heterogeneous actor systems.


3. Tool abstraction over implementation

A robot exposes capabilities, not drivers:

{
  "name": "scan_surface",
  "input_schema": {
    "resolution_mm": "number",
    "area": "polygon"
  }
}

Whether that’s backed by:

  • ROS 2 nodes
  • A CNC controller
  • A Maslow sled
  • A simulator

…is hidden behind MCP.


4. MCP Is Not a Replacement for ROS / DDS

This is the key architectural insight:

MCP complements robotics middleware — it does not replace it.

What MCP should NOT do:

  • Real-time control loops
  • High-rate sensor streaming
  • Deterministic motion control
  • Safety-critical actuation

JSON alone makes it unsuitable for:

  • 1 kHz servo loops
  • Tight latency budgets
  • Fieldbus replacement

5. MCP as a “Cognitive Bus” for Robotics

Where MCP shines is above ROS, not instead of it.

Pattern: MCP ↔ ROS Bridge

[MCP Client / Model]
        │
        │ MCP (JSON)
        ▼
[MCP↔ROS Adapter Node]
        │
        │ ROS 2 / DDS
        ▼
[Navigation, Motion, Sensors]

The adapter:

  • Translates MCP tools into ROS actions/services
  • Translates ROS state into MCP-readable context
  • Enforces safety & constraints

This mirrors how OPC UA gateways wrap PLCs — but for cognition.


6. Comparison with Existing Robotics Standards

Aspect MCP ROS 2 DDS
Level Cognitive / Semantic Middleware Transport
Serialization JSON Binary msgs Binary
Real-time :cross_mark: :warning: :white_check_mark:
Human-readable :white_check_mark: :cross_mark: :cross_mark:
Tool discovery :white_check_mark: Partial :cross_mark:
LLM-native :white_check_mark: :cross_mark: :cross_mark:

MCP is LLM-native robotics glue.


7. MCP vs Other “High-Level” Protocols

MCP vs gRPC

  • gRPC: tight coupling, codegen, binary
  • MCP: loose coupling, schema-first, introspectable

For robotics:

  • gRPC is good for services
  • MCP is better for reasoning systems and dynamic tools

MCP vs JSON-RPC

MCP adds:

  • Tool metadata
  • Context accumulation
  • Strong conventions around intent

It’s effectively JSON-RPC + semantic discipline.


8. Likely Near-Future Robotics Architectures (2025–)

What’s emerging now:

:brain: Cognitive Plane (NEW)

  • MCP
  • JSON / JSON Schema
  • Context graphs
  • Model-driven planning

:puzzle_piece: Coordination Plane

  • Actor systems (Elixir/Erlang)
  • Task graphs
  • Event sourcing

:vertical_traffic_light: Execution Plane

  • ROS 2 + DDS
  • Behavior trees
  • Motion planners

:gear: Control Plane

  • EtherCAT
  • CAN
  • Vendor motor drives

MCP lives at the top two planes.


9. Why MCP Is Especially Relevant to Your Interests

Given your background:

  • Monty
  • Large numbers of actors
  • Language-neutral messaging
  • JSON + UDP curiosity
  • Avoiding invasive rewrites

MCP offers:

  • A semantic contract between actors
  • A way to layer cognition on top of existing robotics stacks
  • A future-proof interface for AI-enhanced robotics

You can treat robots as tool-bearing agents, not just devices.


10. A Concrete Example

Imagine a Maslow CNC + laser + vision system:

{
  "tool": "generate_tactile_surface",
  "arguments": {
    "source_image": "map.png",
    "depth_mm": 2.5,
    "material": "foam"
  }
}

MCP:

  • Doesn’t care how it happens
  • Allows reasoning, planning, and validation
  • Delegates execution to robotics middleware

That’s the architectural sweet spot.


Bottom Line

  • MCP + JSON will not replace ROS, DDS, or EtherCAT
  • MCP will become a dominant standard for cognitive-level robotics
  • Think of MCP as:

The semantic nervous system above the motor cortex

MCP might be useful beyond v1, probably once the team gets some prehensible capabilities going on, but you need a stable internal API first to add MCP hooks, and there’s still too much breaking changes being made for it to be currently viable. Platform’s in experimental stage after all!

I’ve been speculating about modules (i.e., support services) that might be useful for Monty in a robotic context. Some of these will be found in (or adapted from) Elixir or MCP archive offerings; others will need to be created from scratch. Comments and suggestions welcome…

Modules

Image Grabber

Monty will use a digital camera for image acquisition. The Image Grabber can collect either single images or timed sequences (e.g., video).

The returned output will be sets of rectangular pixel arrays for specific image planes, e.g.: RGBD (red, green, blue, depth), infra-red, ultra-violet, …

Temporal Manager

This serves as an indexible queue (FIFO) and/or time-series database, allowing Monty to request images taken at specified times. One use for this is to emulate cortical transmission delays, but it can also remove glitches caused by asynchronous message handling.

Patch Grabber

Monty isn’t prepared to handle large arrays of pixels, so the Patch Grabber will retrieve small patches of pixels at designated locations from (much larger) pixel arrays.

Sensor Module

This is a coordination and interface module, allowing Monty’s Learning Modules to access desired data in their preferred manner (e.g., CMP).

Transform Manager

There are all sorts of data transformations that could make the input images more usable by Monty. These include Fourier transforms, limiting, log scaling, smoothing, etc. The Transform Manager can construct and manage a data transformation pipeline for any needed processing.

Discussion

Back in Mermaid musings: simple graphs of actors, I presented this diagram:

graph LR;
  EP_MM["Eye Position<br>Motor Module"];
  LE_SH["Left Eye<br>Sensor Hardware"];
  LE_SM["Left Eye<br>Sensor Module"];
  RE_SH["Right Eye<br>Sensor Hardware"];
  RE_SM["Right Eye<br>Sensor Module"];
  BV_LM["Binocular Vision<br>Learning Modules"]

  LE_SH <-- Raw --> LE_SM;
  RE_SH <-- Raw --> RE_SM;
  
  LE_SM <-- CMP --> BV_LM;
  RE_SM <-- CMP --> BV_LM;
  
  BV_LM <-- CMP --> EP_MM;

Let’s decompose an Eye Sensor Module, using the modules described above:

graph LR;
  E1a["<br>"];
  E1b["<br>"];
  IG["Image<br>Grabber"];
  PG["Patch<br>Grabber"];
  SH["Sensor<br>Hardware"];
  SM["Sensor<br>Module"];
  TM1["Temporal<br>Manager"];
  TM2["Transform<br>Manager"];

  SH --> IG --> TM1 --> E1a;
  E1b --> TM2 --> PG --> SM;

Now, let’s add some training wheels. The basic idea is that we can have some (non-TBT) image recognition code pay attention to the same input images as Monty, then provide hints as to what (e.g., cup, logo) might be “in the frame”.

Diagram

graph LR;
  E1["...<br>"];
  AT["Attention<br>Tracker"];
  IG["Image<br>Grabber"];
  IT["Image<br>Tagger"];
  PG["Patch<br>Grabber"];
  SM["Sensor<br>Module"];

  E1 --> IG
  IG --> PG
  IG --> IT
  IT --> SM
  PG --> SM

Modules

Attention Tracker

This module collects and redistributes information on which part(s) of the image are receiving attention. For example, this lets the Patch Grabber and Image Tagger know what each other is examining.

Image Tagger

This module uses non-TBT methods (e.g., edge detection, folding) to recognize and “tag” objects. Its results are fed (via Sensor Modules) to Learning Modules to annotate and ground their results.

1 Like

The diagrams above obviously leave a lot of questions unanswered. For instance, how should the messages be addressed, transmitted, etc? Since a need for addressing is inescapable (and transmission modes may vary), let’s look at some existing and vaguely possible addressing APIs.

Elixir - Nodes, PIDs, etc.

I expect (or at least hope) that Elixir will play a role in at least some Monty implementations, but its addressing model is more robust than semantically convenient. So, we’ll probably want to “paint over it”, rather than using it directly.

Still, it can be useful to keep this model in mind, if only because its design reflects the typical architectures and constraints of systems and networks:

  • Millions of Elixir processes can run on a single BEAM instance.
  • Hundreds of OS processes can run on a single processor.
  • An arbitrary number of processors can be networked together.

As you might expect, sending messages within a single BEAM instance is very fast. Messages to other BEAM instances takes substantially longer and messages to BEAM instances on other processors can take arbitrarily longer. So, as in the cortex, we probably want to keep highly related activity at close hand.

In any event, Elixir messages are addressed using a “PID” (Process ID), which is displayed (something) like <0.123.0>. The PID is a largely opaque integer triple which identifies the intended recipient:

  • the (lightweight) Elixir process
  • the “node” (i.e., BEAM instance, OS process)
  • the processor (e.g., DNS name, IP address)

Given a valid PID, addressing and transmission are handled automagically and robustly. There are also ways to make things a bit less clunky. For example, BEAM processes can be “registered” under specified names, either locally or within a cluster.

However, as noted above, this isn’t a semantically convenient way for Monty’s modules to refer to each other. So, let’s start from the other end, asking ourselves about Monty’s message addressing needs.

Leveraging Locality and Infrastructure

AIUI, the neurons in cortical columns mostly direct (really, broadcast) messages to nearby targets. However, messages may also be sent to distant targets and/or in response to a received message from a more distant neuron. So, the target neuron might be in:

  • a specific level in the current (or a nearby) column or minicolumn
  • a “higher level” column or a “lower level” service in the vicinity
  • a distant target, in response to a coincident event and/or message

Clearly, it would be handy to have a convenient and flexible way to address messages. To simplify things, let’s consider a topology that roughly matches brain structures and some naming conventions that we can leverage with pattern matching.

As a starting point, we can use the rectangular grid of LMs that I proposed over in Pre-wiring Patches of Pixels?:

For simplicity, let’s consider only a single grayscale camera, generating square (1024x1024 pixel) images. We’d like each LM to concentrate on a relatively small (32x32 pixel) patch. This gives us 1024 LMs, which we can arrange as a 32x32 array.

Conceptually, each LM receives data from the patch “below” it. The LM then exchanges information with other (nearby) LMs, in a manner that diminishes over distance. This allows model (etc) sharing, while keeping the overall connectivity manageable.

This “array of LMs” is a very simplistic model of a cortical region. It also assumes the presence of assorted infrastructure (e.g., distant targets, “higher level” columns, “lower level” services).

Let’s set up a few (Sci-Fi!) addressing patterns to see how this might work, using Bash-style brace expansion:

all_around:     "{N,NE,E,SE,S,SW,W,NW}"
northerly:      "{N,NE,NW}"
southerly:      "{S,SE,SW}"

# Send a vote to 24 neighboring LMs,
# up to three (constant) hops away.

ex1_all_around:
  directions:   "{all_around}/{1..3}"
  message_type: "vote"
  target_type:  "LM"

# Send a vote to nine neighboring LMs,
# up to three (constant, northerly) hops away.

ex2_northerly:
  directions:    "{northerly}/{1..3}"
  message_type:  "vote"
  target_type:   "LM"

# Send a vote to nine neighboring LMs,
# up to three (increasing, southerly) hops away.

ex3_southerly:
  directions:    "{southerly}/{1,3,6}"
  message_type:  "vote"
  target_type:   "LM"

# Send a vote to the neighboring LM "above".

ex4_LM_1_above:
  directions:    "above/1"
  message_type:  "vote"
  target_type:   "LM"

# Send a vote to the neighboring SM "below".

ex5_SM_1_below:
  directions:    "below/1"
  message_type:  "vote"
  target_type:   "SM"

Note: Careful Reader may observe that a pattern could easily “fall off the edge” of the region. I’m assuming that this can be handled in a “reasonable” manner…

1 Like

I just took a peek and Elixir is a very readable language.

1 Like

I think so, as well (and have opined at length on the topic <g>). If you’re interested, search this forum for references to Elixir. Meanwhile, here’s an (opinionated) overview.

For obvious reasons, my posts in this forum tend to promote the Elixir features (e.g., actor model, concurrency, distribution, fail-soft robustness) that seem most relevant to Monty. However, there are many other features (e.g., first-class documentation, extensibility, readability, syntax) that I find attractive. Read on, for a sampling…

Extensibility

Elixir has first-class (Lisp-style) syntactic macros, which provide flexible and powerful ways to manipulate Elixir’s Abstract Syntax Tree (AST). Indeed, most of the compiler is written as a set of macros (!).

This compile-time preprocessing allows the language to support various cool features, e.g.:

More generally, well-controlled metaprogramming is pervasive in the Elixir ecosystem. For example, frameworks such as Ecto, ExUnit, and Phoenix use macros to provide embedded domain specific languages (eDSLs) for convenience, safety, etc.

Syntax, etc.

Elixir’s programming syntax is based on Ruby, which follow Matz’s taste (rather than Python, which follows Guido’s). As a long-time opponent of semantically-significant whitespace, I approve.

I also like the way that José Valim has made functional programming (e.g., pure functions, immutability, and the avoidance of side effects) consumable (and even enjoyable) by mortals like me. More generally, I simply find Elixir easy (nay, pleasant!) to read and write.

'nuff said…

MCP (and more generally, JSON) is far from the only available encoding format; let’s explore the neighborhood a bit…

Map-based data structures (e.g., Python objects) are very flexible and forgiving. Indeed, they are a poster child for semi-structured data representations. Per the Goog:

Semi-structured data is data with some organizational properties (like tags or markers) but lacks the rigid, fixed schema of traditional relational databases, existing between structured (tables) and unstructured (text, media) data. Common examples include JSON, [MCP,] XML, and emails, which use key-value pairs or tags to define hierarchies and relationships, offering flexibility for data exchange and modern applications.

If Monty’s messages are designed as semi-structured data, various possibilities become feasible. Some of these might help us bridge the semantic gap between expression of attention (and/or intent) and the generation of actionable, specific module addresses. They might also serve as a way to model the brain’s built-in and emergent wiring patterns.

Speculations…

Let’s assume that each Monty message is constructed as a collection of tags and values. By playing games with cosine vectors and such (e.g., following social network links, using genetic algorithms), we could generate measures of topic commonality between a message and its possible recipients. Although this can (and should!) be done dynamically, it needn’t performed for every message.

For example, sample messages could be constructed and evaluated (i.e., mapped into addresses) during Monty’s initialization and (perhaps) on an ongoing, occasional basis. The address mappings could then be saved (and later, used) by sending modules. Indeed, receiving modules could also construct sample messages that they would be interested in getting, ala the publish-subscribe (pub/sub) pattern.