This is really impressive work by Google. Definitely a step in the right direction for general availability.
It’s important to define what “learning” means in the context of their platform. I skimmed through their 2 papers to better understand the whole shebang:
The training regimen consists of:
- Gemini 2.0 multimodal data (text, images, video, web data) for reasoning and language;
- Large-scale, human-operated demonstrations on ALOHA robots, with between 2000 and 5000 episodes of high-quality video for each labeled task (cloth folding, object manipulation, etc.);
- Massive fine-tuning (1+ million steps) to bridge action prediction to reasoning capabilities.
I mean, this is a Herculean effort; props to them.
However, at the end of the day, what you’re left with is a rigid, cloud-based, latency-limited, power-hungry model with no long-term learning. (I’m not taking a jab at Google here, but at the harsh mistress that deep learning is!) They do use the terms “in-context learning” and “Motion Transfer” to describe how the model can “learn”, so let’s break it down.
In-context learning is essentially about putting videos of new human-operated demonstrations inside the context window of the model, which allows it to leverage existing data as a baseline to try and replicate a new action. This isn’t really learning, but rather temporary imitation. That approach would probably be useful for like, a limited-purpose robot; not really scalable.
As for “Motion Transfer”, they don’t really dive into technical details that much, this is what they say:
In Gemini Robotics 1.5, we also introduce a new model architecture and training
recipe for [Vision-Language-Action]. These enable the model to learn from different robots and data sources, to form a unified understanding of motions and the effects of physical interactions, enabling skills to transfer across very different robot embodiments.
[…]
The efficacy of leveraging cross-embodiment data with Motion Transfer (MT) depends on the initial quantity of data available for a given robotic platform. For the ALOHA platform, which already possesses a large dataset, merely introducing data from other embodiments appears to be less effective; however, MT amplifies the positive transfer from this data by aligning the different embodiments and extracting commonalities, thereby aiding the learning process.
In other words, a large diversity of robot anatomies in the initial dataset enables faster training of new robots, as they can rely on existing anatomies most similar to theirs as a baseline.
So yeah, gargantuous data for small tasks.
Now, what about Monty?
At the moment, it’s an object recognition software with learning capabilities. It doesn’t have high-level reasoning or understanding of language, other than Python calls. An AI agent could technically use Monty to enhance its object recognition capabilities. It’s more flexible at dealing with object shapes independent from their texture.
A possible short-term application would be pairing it with a vision transformer (ViT) on a robot, allowing it to fall back on Monty for unknown stuff, while simultaneously training Monty in real time with labeled data. Example: ViT sees objects A and B, recognizes A but not B. Monty learns A and its ViT label, then detects that B is actually a variant of A. It associates that label to B, and feeds it back to the robot. A human reviewer could annotate new unknown objects recorded by Monty if deemed useful, and deploy live without any retraining.
This data could later be shared between other Monty-enabled robots, enhancing the recognition accuracy of an entire fleet over time. That’s very hard to efficiently and reliably achieve with black-box neural nets, but will eventually be possible with Monty.
TBP is still very much in early development, current capabilities are limited, but the beauty of the cortical voting system and reference frames is that they can technically be adapted to virtually anything. Right now, the team’s focus is on vision, but with additional development, I’m sure it can and will be adapted to movement, sound, text, language, reasoning, multi-dimensional data, and more.
Blend in more features like pattern recognition, memory scaffolds, simulation buffers, a prediction engine, saliency modulation, a curiosity drive, facial perception. Grow and optimize it alongside hardware advances and new insights about the brain. Diligently sew it together with scientific rigor and utmost patience. Over time, things will click together. 
Anyhow, here’s a better breakdown of what Monty can do at the moment, and what the team is cooking: Capabilities of the System