Dynamic point clouds as an embodiment agnostic representation

Hi everyone,

Here is a recently released paper by INSAIT/Sofia University and ETH Zurich about point clouds as an embodiment agnostic representation. As the authors put it:

In this work, we approached the challenge of learning a generalist manipulation policy from a mix of labeled and unlabeled video and proposed MotoVLA together with a two-stage training procedure. By establishing dynamic point clouds as an embodiment agnostic representation, our approach successfully transfers knowledge from video to manipulation motion priors. Using simulation and real-world experiments, we demonstrate a consistently improved model performance in in- and out-of-domain settings and showcase the direct transfer from human demonstration to robot actions.

Source: Generalist Robot Manipulation Beyond Action Labeled Data

I’m sharing this as there is slight resemblance in this approach to TBP/Monty. Namely, generalizing points of interest (cloud) and their movement in a human hand, and being able to transfer such behavior “on a high-level” to a robot hand. To me it is a light and simplified version of what Monty is built for and capable of, but yet an interesting approach.

Let me know if you see something else in this article which might be interesting to pick at.

Thank you,

Alex Kamburov

6 Likes

Nice find! This is similar to Gemini Robotics and Skild AI, but open source, with the authors actually explaining what they’re doing and acknowledging its limitations. :wink:

I don’t think Monty has the capability to learn movement and interact with things yet, the focus is very much on object recognition right now, which is a different problem space altogether.

What’s also really cool is the algorithm they use for converting 2D photos to 3D: GitHub - microsoft/MoGe: [CVPR'25 Oral] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

This could probably be leveraged to use Monty with regular cameras that don’t have depth sensors. The team used Depth Anything V2 for this during their robot hackathon, but MoGe seems to have better accuracy overall!

2 Likes

Hi Alex Kamburov, would you like to critique that paper about where it might or might not suit Monty?

Hi Trung Doan,

It’s interesting to review how motion of the cloud is observed in the paper as I know analyzing movement is a future research target for Monty.

Still, I’m too new to Monty and would need to spend more time with the framework before being able to comment further.

Thank you,

Alex