Design of Final-Year Undergraduate Project with the Inclusion of the TBP!

Hi all, the following is a Gist I have created for my undergrad project idea that begins October/November 2025. It’s not complete yet, and I’m still conceptualising. Any comments are welcome! See here for the original message thread which led to the creation of this topic.

2 Likes

Hey all, long time no see!

I’ve finally had the chance to give a bit more depth to my project, see the new comment in the gist here :smile:

I ended up going back on a previous comment I made in the Community Involvement thread and have set my sights on using Monty for object inference (as vclay suggested) with multiple sensors.

I noticed that @vclay released some tutorials back in April for custom Monty applications and using Monty for robotics, so I’ll need to check them out soon!

I also managed to secure a supervisor who specialises in robotics, programming and AI at my university!

The project isn’t set in stone yet, and might change before it’s start date early October, so any suggestions/alterations/criticisms are welcome.

I’m very much looking forward to this and hope I’m able to create something worthy of the Project Showcase page!

5 Likes

Project Update

I wanted to share a quick update on my dissertation project integrating Monty with a robotic setup.

At the moment I am still working through the literature review and continuing to learn the Monty framework through the tutorials and documentation. I still have more ground to cover but am making progress.

Following discussions with Niels, Ramy and Viviane, I have shifted my initial plan slightly. My original idea involved combining vision and tactile sensing with a robot arm, but given the added complexity of physical interaction (i.e. object movement when applying a tactile sensor), the current aim is to begin with a stereo vision setup alone and extend from there. The initial plan is to replicate the Monty Meets World experiment as a proof of concept, using a Zed2 stereo camera. This will help with camera calibration and give me a better understanding of the pipeline before I mount the camera on the robot arm, discover (and hopefully solve!) the difficulties of hardware calibration and begin experimenting with agent behaviours.

If time allows after that, I may explore adding a second sensing modality such as time of flight or ultrasound, depending on feasibility and available hardware.

The robot arm I have been provided is the Ufactory Lite 6.

I am currently using Windows through WSL for early development, but I will move to a native Linux environment once I start working directly with the hardware. I also plan to run one of the benchmark experiments on this setup to make sure everything is working correctly.

I will keep posting updates as things progress. Happy for thoughts from anyone! :slight_smile:

8 Likes

@vclay or @rmounir

I just wanted to clarify a current capability of Monty.

This may be a silly question but I don’t think that the current capabilities mentions that Monty can learn brand new objects (those that it hasn’t learnt or ever seen before). I believe the potential to learn new objects is mentioned in Current Capabilities of the first TBP Implementation youtube video but just wanted to clarify.

1 Like

Hi @Zachary_Danzig

Yes, Monty can learn new objects (or extend it’s knowledge about objects it has seen before) at any point. We evaluate Monty’s continual learning abilities (without catastrophic forgetting) in our recent preprint here: https://arxiv.org/pdf/2507.04494 (section 4.3.2 and figure 7)

You might also find this part in our documentation interesting that talks about how learning and inference can be two intertwined processes in Monty.

And this tutorial on continual, unsupervised learning: Unsupervised Continual Learning

I hope this helps!

Best wishes,

Viviane

4 Likes

Hi @Zachary_Danzig , I’m back from paternity leave now - exciting to hear about the progress you made in planning the project.

Firstly, just thought I would echo the advice that Ramy and Viviane gave you - that’s a great idea to start with replicating MontyMeetsWorld, as well as focusing on a depth camera over other sensors to begin with.

Re. the tactile sensor you were interested in trying, and the issue of the target object moving when touched - I think in some sense this isn’t too big of an issue, you could always fix the object at its base so that it is unlikely to move. I think the bigger issue is having a policy that controls the arm, i.e. will get the touch sensor on to the surface, and be able to explore the surface. One option (similar in spirit to our ultrasound project), is if you can remote control an arm, then you could always use this to move the arm until it touches the object, then move over the surface - as long as you are recording the movement data and able to feed this to Monty, then it doesn’t matter if the motion was carried out by a human. This is similar to how in the ultrasound project, a human operator moved the probe, but we tracked it’s position so that Monty could use this to determine the change in location, and thereby perform object inference.

I still think it makes sense to start simple and not worry about the tactile sensor, but just thought I’d make that point in case you have the time towards the end of the project.

Hi @nleadholm, Welcome back! And sorry for the long delay in my response, it’s been very busy on the approach to the end of term next Friday. Alongside some other courseworks, I’m currently in the process of drafting my presentation for Gateway 2 - this is a slide deck formalising project aim and objective, background research and more. I submit it this Friday and then present it early January.

In terms of project planning, some changes to what I’ve shared above. To prevent bottlenecks when coming to the calibration of the stereo camera attached to the robot, I’ve decided to mount the camera statically and use it with a distant agent.

I’m thinking of mounting a ToF array sensor to the robot, something like this, which will act similarly to the surface agent (though it will have different raw data input to the SMs and won’t actually touch the surface).

Any thoughts? Hopefully this is a reasonable alteration. If I find myself with more time later down the line I’ll add more complexity to the system, but for now this is my intended approach.

And thanks for your points on tactile sensing, I didn’t consider having the motion carried out by a human and I’ll keep it in mind as a potential approach - this might be something I use with the ToF sensor if I’m unsuccessful in making a policy for the arm.

What I really envisioned for the tactile sensing was something like cats whiskers that could be brushed over an object with negligible force, but this technology is very early days.

Looking forward to the online Meetup on the 17th!

3 Likes

No worries at all, and that’s exciting. Best of luck with wrapping up your Gateway 2 slides.

Just to check from the description you’ve given

  • The statically mounted stereo camera: in terms of how movement will still exist - from your earlier descriptions, are you going to move a small sensory patch within a larger (depth-augmented) image, like in Monty Meets World?
  • The moving ToF sensor: this does indeed sound similar to the surface agent; for what it’s worth, we also don’t have collision detection in Monty’s Habitat simulator at the moment, so the “surface agent” is in fact a depth camera that we just set a short maximum perceptual depth to. That way it needs to be close to the surface to sense anything, but this works well with the policy it executes. If it gets even closer, you can imagine the finger “bending” to accommodate the sorter distance. It sounds like you could do something similar, assuming the ToF still works at short distances. Overall I think this is definitely a reasonable approach. It will still be a different imaging modality from the stereo camera, and will have a different policy.
  • One other thing to be mindful of with the ToF sensor is - how accurately you will be able to estimate surface normals and principal curvature from its depth readings. I think this is actually a good test of Monty’s robustness in the real world when these are noisy, but if you are able to get a sensor that is slightly higher resolution (e.g. 16x16), that might help.

Whiskers would be cool! But yeah I’m not aware of any physical hardware that would do a good job of capturing that.

3 Likes

Thanks!

  • Yep, that was the intention for the stereo camera. I’m wondering if this will hinder recognition, as the stereo camera won’t be able to see the whole object - or as much as the ToF sensor can. Will voting still work as expected between the learning modules of the 2 different sensors?
  • Perfect, hopefully I’ll be able to adapt the surface agent for my use case then. I’m going to have a look and select a ToF sensor over Christmas.
  • Good point, I’ll have a look at getting one with a bit higher resolution at short distances.

Coincidentally, I came across this research the other day from the Human Brain Project with some similarities to the TBP. The robot in it appears to use some sort of tactile whiskers for sensing!

2 Likes
  • Ok nice. It will indeed prevent the surface agent from classifying objects where the distinguishing features are occluded, however this need not be a bad thing. For example, what if the distant agent LM can then direct the surface agent LM to test a hypothesis on the other side of the object, thereby confirming or refuting its hypotheses? Voting should indeed work between these two LMs, so it could be a nice way of showing multi-modality in practice.
  • One thing to note is that Monty does not currently support multiple agents, simply because it is not implemented in the Habitat interface. In other words, all our experiments currently use one agent with potentially multiple sensors fixed to it. Implementing this is something we can look to maybe fast track (perhaps in collaboration with yourself) as it becomes necessary for your project.

Very cool video about the whiskers, thanks for sharing! Looks like Martin Pearson at UWE/Bristol has been working with whisker robots for a while, was not aware of this research until now (https://royalsocietypublishing.org/rstb/article-abstract/366/1581/3085/21765/Biomimetic-vibrissal-sensing-for-robotsReview?redirectedFrom=fulltext)

2 Likes

Ok perfect.

I didn’t realise Monty currently only supports one agent at a time, I’m definitely happy to help with the implementation of multiple agent support. Current project plan has me reaching that mid/late Feb.

I’d not heard of it either, it was only when I saw a cat climb through a gap in a fence the other week that I thought to have a google about how they sense with their whiskers!

Good luck with the presentation shortly!

2 Likes

Thank you, and sounds good!

Love the cat inspiration.

1 Like

Hi,

I’m all completed with the tutorials and have been designing a few of my own experiments. I wanted to expand on the Monty Meets World worldimages dataset with my own images and have taken .HEIC portrait selfie images with the TrueDepth camera on my iPad. From this I am able to extract the depth image and a png through python.

Original .HEIC:

Depth Image:

PNG (with added alpha channel - I found I had to add this channel to meet dimension requirements), this was added to a folder I created in tbp/data/worldimages, matching the existing folders:

I ran into a few issues with my monty meets world experiment, specifically I don’t know the format of the “.data” files in the worldimages dataset. Are they depth images represented as numpy arrays, saved as raw binary? (I might be overcomplicating here)

I suspect I’m going wrong somewhere as the sensor depth image in the figure below doesn’t seem to detect any depth variations and the sensor patch does not move whilst the step count increments very fast. For the experiment figure below I used the PNG above and converted the depth image to a Numpy array before saving with np_depth.astype(np.float32).tofile("depth_0.data")

Any thoughts are appreciated.

Wishing you all a Merry Christmas!

1 Like

Hey Zach, exciting to see you’re up and running with that. My immediate answer is I’m not sure, and unfortunately I need to run so won’t be able to properly respond until Monday - at that time I can dig into what we did with the dataset files.

One suggestion however if you haven’t already tried: it’s worth checking what your actual depth sensor values are in the patch. Sometimes normalization issues can make it look weird, but if you look at the raw numerical values, do you see that they are all 0s, 1s, or something else? Something as simple as an assumed range of 0:1 vs 0:255 vs 0:100 can make a big difference.

2 Likes

Hey Niels, Thanks and good point. I seem to have my depth values at 0:255, but normalising to 0:1 seems to have no effect. I’m certain it’s an issue with how I’m forming the .data files so will do some more digging.

1 Like

Hey Zach, just to follow-up, have you already had a chance to look through both of the following?

`uploadDepthData` in the Swift code for the iPad app, which you can find here

and

MontyRequestHandler in server.py, which you can finder here

Hopefully by replicating this pipeline (I’m not sure if you’re using the original Swift code, or a custom iPad app that you’ve made), hopefully you can resolve the issue.

My understanding from this existing code (bearing in mind this was some time ago) is that:

  • The saved data is the HTTP payload written as raw bytes with no header
  • The format is 32-bit float after conversion from Float16
  • The layout is a flat 1D array of length width * height (depthDataSize), dimensions derived from CVPixelBuffer
  • The units are probably in metres

Hope that helps, but don’t hesitate to let me know if you’re still stuck.

2 Likes

Thanks Niels, I spotted the MontyRequestHandler but didn’t see the uploadDepthData and uploadRGBData in the Swift code.

I’m lacking the ability to compile an app without XCode, so I’ve been writing a script (translating the original Swift code) to take the TrueSense .heic image input and extract the necessary data and create the necessary files (.heic to .png and .data) in python on WSL.

I believe I’m meeting the above points, except the units of metres - I suspect the normalised 0:255 range I have is messing things up. This is probably due to the image quantisation converting the absolute depth to 8-bit relative depth after it’s taken by the iPad.

I’m pausing for now, but might be able to work around this by:

  • Approximating back to metres using the iPad cams min/max range (though I think this comes at a great loss of accuracy)
  • Using existing apps to get the data in a raw format. I’ve come across 3 potential iPad apps (1, 2, 3) that may have potential, though they use the LiDAR so will require me to switch tack slightly. Most of the available apps I tested previously tended to use AI methods instead of the in-built sensors but these claim to use the sensors.

Worst case, if I can’t get it working I’ll be back at Uni in January so can get access to the actual stereo camera - the iPad method is my alternative for now but If I get it working I’ll post the method for others to use if they can’t access XCode.

1 Like

Ah ok interesting. Yeah that seems like a reasonable approach. One other thought to complement approximating back to metres, is that you could take images of some calibration items, where you know the depth (distance between the iPad camera and the object you’re imaging). You can then sanity check what value you are getting in your depth map for those areas, adjusting as necessary.

Definitely feel free to post again if you get close but not what you’re hoping for before you are back to your stereo camera.

1 Like

Hi Niels, Wishing you a Happy New Year!

I’ve been looking into selecting my ToF sensor to be attached to the robot arm and used with an adapted surface agent. Based on your previous comment I’ve been looking for one that is higher resolution than the 8x8 VL53L5CX I was considering before. I’ve narrowed it down to 2 that have potential and was wondering if you, @rmounir or any discourse readers might be able to give me some insight:

  • Mikroe LightRanger 14 Click with the TMF8829 chip. Measuring up to 48x32 zones with a range of 0.01-11m. However, this isn’t plug and play and will need some sort of bridge via a microcontroller or USB-to-I2C Adapter as suggested by Gemini. Though this might be worth it for the close range.
  • 3D TOF Sensor Camera - DFRobot. Measuring 100x100 zones with 0.15-1.5m range. This is said to be plug and play, with a direct connection to the PC as well as ROS support.

Sorry for coming in late (I only recently skimmed the thread :-). In any case, congratulations on your progress in defining the necessary components, approach, etc. Here are some random notions for your amusement and consideration…

More cameras

A few years ago I bought some small (~1.5" dia.), battery-powered, Wi-Fi connected cameras. IIRC, they only cost $50 each; they may be available even more cheaply now. It might be useful to mount something like these on the robot arm itself, focused on the gripping region.

For example, this “Wi-Fi body camera” seems promising, at about $30:

Alternatively, something like a borescope / endoscope camera could let you mount sensors on the gripper(s) themselves. For example, this one is about $20:

Discussion

Using multiple, simple cameras could let you explore various (e.g., non-mammalian) sensing and evaluation scenarios. Connectivity and/or power could be provided via Bluetooth, USB, Wi-Fi, etc.

Note: It may be necessary to jump through some hoops to find out how to interact with the camera(s) from your “base” computer (e.g., cell phone, laptop, tablet). I’d recommend doing some reasearch and/or making some picky inquiries before buying anything.

Turntable

An inexpensive turntable (i.e., lazy susan) would give the setup another degree of freedom (letting Monty view objects from many more angles, under precise control). This would let you explore a variety of sensorimotor-based activities.

This approach is commonly found in 3D scanners, using a motorized turntable and surrounding cameras. In your use case, the robotic arm could spin the turntable itself, reposition objects on the platform, etc. If need be, cogs on the table’s periphery could make things easier for the gripper.

Turtles (all the way down)

There are various ways to specify the actions and motions of robotic arms, etc. (The nice thing about standards is that there are so many to choose from. :-) I’d probably start with a textual (e.g., JSON-based) representation of a simple command set. Python’s turtle module might be an interesting starting point for an implementaton.

Use Case?

From time to time, I’ve speculated about crafting an intelligent robot for gardening, This might use a lightweight mobile platform (with motorized wheels), a robot arm and cameras, etc. It could be trained to detect, recognize, and report assorted plants. At some point, it might even be allowed to tend (e.g., weed) the garden.

As a small step in this direction, how about maintaining a few potted plants, letting the robot monitor their growth and emerging characteristics? Over the course of your project, you’d have time to observe and interact with an entire growth cycle.

2 Likes