Monty vs Gemini Robotics and other LLM-based approaches

Hi everyone,

I’m creating this thread to discuss the release of Gemini Robotics and other future LLM-based approaches to robotics:
Gemini Robotics 1.5 brings AI agents into the physical world - Google DeepMind

My understanding is that the LLM is responsible for the higher-level “thinking” and general knowledge. Then it is providing commands to the robotic arm with the necessary instructions. Such a system can complete complex tasks without the need for online learning and other features of Monty, of course with some limitations. Any thoughts?

Another point I was thinking about is that Monty is lacking high-level thought and there is still research needed in that area. Meanwhile an LLM AI agent can potentially provide commands to Monty about something it needs Monty to do.

I’m new to the framework so please excuse me if I’m wrong with my assumptions.

Thank you,
Alex Kamburov

1 Like

Hi @Aleksandar_Kamburov such an interesting article you shared here.

Firstly I’m no expert.

My thoughts are that it doesn’t need online learning because it’s an LLM, it’s not designed for online learning and I’m pretty sure you know this. Ok, but the multi-agent works without learning and I believe this is the interesting point.

My analogy of this is that imagine the LLM is your “high-level” thoughts making decisions from other inputs. Then that LLM’s knowledge has an expiry. IOW, some of it’s knowledge will not apply anymore due to a dynamically changing environment or chaotic - which is the reality. When will those knowledge expire? IDK, it really depends on how generalized the LLM’s model is. For this experiment, I can see it’s simply moving, relocating, identifying objects - convert this to language their triviality becomes clear. But when that knowledge expires, the LLM requires to be at least fine-tuned offline.

For example, as an LLM - “I see a pen holder, a pen, an eraser, and a crumpled paper. Sorting means putting them in the right place (thinking), ok I’ll put the pen in the pen holder, the crumpled paper on the bin. Where’s the bin (thinking)….” and so on, the task is trivial in language terms. Now I’m not downplaying this as I know difficult to achieve this with real pyhsical robots. In fact, I am amazed with this advancement by Google.

Fast forward, kids are playing and thew a ball on the pen holder then another kid took the pen, while the LLM was thinking or the robot was moving - and so on. This is the challenge for offline thinkers IMHO, how will the LLM react? IDK, but I’m pretty sure that in robotics, a highly changing environment such as the world is one of the biggest challenges. Of course, robots would have scope with respect to tasks they should only do.

Something related in online/offline learning are autnomous cars, I’m interested how it is able to navigate a very chaotic environment - well at least not very chaotic. Does it have a online learning module?

2 Likes

This is really impressive work by Google. Definitely a step in the right direction for general availability.

It’s important to define what “learning” means in the context of their platform. I skimmed through their 2 papers to better understand the whole shebang:

The training regimen consists of:

  • Gemini 2.0 multimodal data (text, images, video, web data) for reasoning and language;
  • Large-scale, human-operated demonstrations on ALOHA robots, with between 2000 and 5000 episodes of high-quality video for each labeled task (cloth folding, object manipulation, etc.);
  • Massive fine-tuning (1+ million steps) to bridge action prediction to reasoning capabilities.

I mean, this is a Herculean effort; props to them.

However, at the end of the day, what you’re left with is a rigid, cloud-based, latency-limited, power-hungry model with no long-term learning. (I’m not taking a jab at Google here, but at the harsh mistress that deep learning is!) They do use the terms “in-context learning” and “Motion Transfer” to describe how the model can “learn”, so let’s break it down.

In-context learning is essentially about putting videos of new human-operated demonstrations inside the context window of the model, which allows it to leverage existing data as a baseline to try and replicate a new action. This isn’t really learning, but rather temporary imitation. That approach would probably be useful for like, a limited-purpose robot; not really scalable.

As for “Motion Transfer”, they don’t really dive into technical details that much, this is what they say:

In Gemini Robotics 1.5, we also introduce a new model architecture and training
recipe for [Vision-Language-Action]. These enable the model to learn from different robots and data sources, to form a unified understanding of motions and the effects of physical interactions, enabling skills to transfer across very different robot embodiments.
[…]
The efficacy of leveraging cross-embodiment data with Motion Transfer (MT) depends on the initial quantity of data available for a given robotic platform. For the ALOHA platform, which already possesses a large dataset, merely introducing data from other embodiments appears to be less effective; however, MT amplifies the positive transfer from this data by aligning the different embodiments and extracting commonalities, thereby aiding the learning process.

In other words, a large diversity of robot anatomies in the initial dataset enables faster training of new robots, as they can rely on existing anatomies most similar to theirs as a baseline.

So yeah, gargantuous data for small tasks.

Now, what about Monty?

At the moment, it’s an object recognition software with learning capabilities. It doesn’t have high-level reasoning or understanding of language, other than Python calls. An AI agent could technically use Monty to enhance its object recognition capabilities. It’s more flexible at dealing with object shapes independent from their texture.

A possible short-term application would be pairing it with a vision transformer (ViT) on a robot, allowing it to fall back on Monty for unknown stuff, while simultaneously training Monty in real time with labeled data. Example: ViT sees objects A and B, recognizes A but not B. Monty learns A and its ViT label, then detects that B is actually a variant of A. It associates that label to B, and feeds it back to the robot. A human reviewer could annotate new unknown objects recorded by Monty if deemed useful, and deploy live without any retraining.

This data could later be shared between other Monty-enabled robots, enhancing the recognition accuracy of an entire fleet over time. That’s very hard to efficiently and reliably achieve with black-box neural nets, but will eventually be possible with Monty.

TBP is still very much in early development, current capabilities are limited, but the beauty of the cortical voting system and reference frames is that they can technically be adapted to virtually anything. Right now, the team’s focus is on vision, but with additional development, I’m sure it can and will be adapted to movement, sound, text, language, reasoning, multi-dimensional data, and more.

Blend in more features like pattern recognition, memory scaffolds, simulation buffers, a prediction engine, saliency modulation, a curiosity drive, facial perception. Grow and optimize it alongside hardware advances and new insights about the brain. Diligently sew it together with scientific rigor and utmost patience. Over time, things will click together. :eye:

Anyhow, here’s a better breakdown of what Monty can do at the moment, and what the team is cooking: Capabilities of the System

5 Likes

Thank you, @pepedocs and @AgentRev . I think @AgentRev answered all my questions and also @pepedocs ‘s.

I forgot to mention, Monty does have some limited movement capability, although it’s very sensor-centric. There are hints of future support for object manipulation, but nothing concrete for now.