Automagical annotation of Monty's results?

I’m a big fan of adding annotations (e.g., data, tags) to the runtime data stored in Monty’s modules. These annotations (e.g., object_type: "cup") could be sourced in various ways (e.g., configuration file, independent analysis) and added during either production or training runs.

I believe that these annotations could be useful both for interoperability (e.g., with LLMs) and observability (e.g., by researchers). So, my question is whether they can be created and added automagically.

LLM-based harnesses (take 1)

In support of this (over in Mermaid musings: simple graphs of actors), I introduced the notion of an “LLM-based harness”:

Let’s add some LLM-based harnesses (LHs) to our instance. The idea is that the LHs will “hear” the same things as Monty’s LMs do, then report their analysis of the sounds. This information could be used as a form of supervised learning, to annotate (i.e. “tag”) and/or tune Monty’s models with textual descriptions (e.g., “coin on glass”), directional information, etc:

graph LR;
  AM_MM["Asst. Monty<br>(LM)"]
  EP_MM["Ear Position<br>(MM)"]
  LE_HW["Left Ear<br>(HW)"];
  LE_LH["Left Ear<br>(LH)"];
  LE_SM["Left Ear<br>(SM)"];
  RE_HW["Right Ear<br>(HW)"];
  RE_LH["Right Ear<br>(LH)"];
  RE_SM["Right Ear<br>(SM)"]
  SA_LM["Stereo Audio<br>(LM)"]

  LE_HW-- Raw -->LE_SM;
  LE_HW-- Raw -->LE_LH;
  RE_HW-- Raw -->RE_SM;
  RE_HW-- Raw -->RE_LH;
  
  LE_LH<-- MCP -->SA_LM;
  RE_LH<-- MCP -->SA_LM;
 
  LE_SM<-- CMP -->SA_LM;
  RE_SM<-- CMP -->SA_LM;
  
  SA_LM<-- CMP -->AM_MM;
  SA_LM-- CMP -->EP_MM;

These harnesses could use the Model Context Protocol (MCP) to inform Monty’s modules of the added (or perhaps changed) information. With appropriate substitutions (e.g., Eye for Ear), this approach could also be used for vision.

However, the approach has serious limitations, based on its use of “raw data”. For example, the LLM would need to extract the relevant object from the surrounding clutter and background noise, then create an appropriate set of tags.

LLM-based harnesses (take 2)

To (partially) finesse these limitations, let’s have the LLM base its annotations on predictions made by the Learning Module’s model(s).

In the diagram below, the camera hardware feeds its information (somehow) to the Learning Module, but not to the LLM Harness. Instead, the Harness uses MCP to request sample images from the Learning Module, depicting the LM’s prediction(s) concerning the object(s) it is currently modeling.

Based on these images, the LLM makes its own assessment (e.g., object_type: "cup") and feeds the results back to the Learning Module. Because the images are based on a model’s predictions, they should be relatively free of background noise. This should make the LLM’s job easier and its results (at least potentially) more accurate.

graph LR;
  LC_HW["Left Camera<br>Hardware"];
  LC_LM["Left Camera<br>Learning Module"];
  LC_LH["Left Camera<br>LLM Harness"];

  LC_HW <-- ... --> LC_LM;
  LC_LM <-- MCP --> LC_LH;

Comments, clues, suggestions? Inquiring gnomes need to mine…

Apologia: I’m not a Monty (or even Python) programmer and certainly not an expert on MCP. Consequently, I’m not in a position to comment on how to:

  • make a Monty module into an MCP server
  • construct a Python-based “LLM Harness”

That said, it appears that there’s a full-blown MCP Python SDK. Enjoy…