Learning Categories of Objects

Hello,

I have been looking at Grid Object Models for Unsupervised Learning and how to learn categories of objects. Before jumping into making a dataset and testing I had some conceptual questions the team might have already thought about.

I’m planning on starting with mugs or something easy like that, but when thinking about the future I inevitably thought about applying this to medical imaging since that’s my area of work. Given a CT scan, we essentially have a 3D model of part of a human with features at a pose (intensity, curvature). The cool thing about it is we get X-ray vision, so Monty could learn a model of a human body, but within that know the objects that compose it such as heart, liver etc. This is where generalizing to within-category objects comes in, since organs can have different sizes and structures.

Although mugs can also vary in structure and size, most of them just vary in color patterns. My main question was: do you think it will be better to start with an object that has more variance in its size and structure? If so, do you have any recommendations?

**Edit to say I think this is very interconnected with the Support Scale Invariance issue but does the team consider them completely separate? One could separate categories with objects of same structure and size but different color but I think that’s only a subset of generalizing within categories, and to truly do that you would need scale invariance.

1 Like

Hi @YiannosD , that’s great to hear you’re interested in this item.

You’re absolutely right that morphological/structural differences would be important for testing this. While mugs do often differ mostly by color, they can also have structural differences like the (admittedly odd!) examples below show. You would want to reflect this to some degree in any dataset to see if canonical representations emerged that are more like the “Platonic” representation of a mug we often think of (partly hollowed cylinder + a handle).

It certainly isn’t a requirement that you use mugs. Maybe one other idea if you wanted to build a dataset from publicly available/open source 3D assets would be a dataset of 3D modelled trees and boulders. Do you then get a fairly generic model of a tree (brown, vertical center, bushy green top), and boulder (round, brown-grey object)? If you have different threshold for detecting an object as being unfamiliar, you may get different models of different granularity (e.g. one LM learns a model for conifers vs. deciduous trees, while another one has even more fine-grained separation by species).

Medical data would definitely be interesting to explore in the longer term, but I think it would be best to stick to something simpler for now.

Re. scale invariance, I agree this is a related issue, in that categorization can either be dependent or totally invariant to the scale of an object (e.g. toy car vs. an actual car). It’s probably simplest if this can be avoided for the moment by limiting scale differences between objects that we expect to have the same category to within ~10% of one another. Once we have implemented a solution to scale invariance, then we could consider more extreme deviations from this.

Hope that makes sense, let me know if I can clarify anything.

3 Likes

That makes sense, thank you! I think trees might be a very good example to start with. I’ll work on this and see where it takes me!

1 Like

Ok great, sounds good! Let us know if there’s anything we can help with once you get started.

1 Like

Although you may be able to find datasets of 3D modeled trees and/or boulders, a problem could arise if you want to bring actual examples into your lab (:-). One advantage of mugs and staplers is that they are compact, light weight, etc. So, you might want to consider working with smaller plants and/or rocks, instead.

On a vaguely related note, some years ago I took a blind friend to the San Diego Botanic Garden (Encinitas). We examined and discussed a number of plants, trying to find structural and morphological commonalities that would translate between sight and touch.

3 Likes

Once we need a real tree it may actually be fun to work outside haha! But yes plants might be easier to work with, thanks.

That’s an interesting story, sounds fun!

1 Like

@nleadholm Hi Niels,

Following up on your tree suggestion. I went ahead and built a prototype for unsupervised category learning using 10 tree species from Objaverse (birch, cedar, cypress, maple, oak, palm, pine, spruce, a generic tree, and willow).

The setup is MontyObjectRecognitionExperiment with do_eval: false. I used EvidenceGraphLM with a MeshEnvironment wrapping trimesh.

Some interesting findings along the way:

  • max_match_distance is the dominant knob for collapse. The default 0.01m (1cm) is way too tight for ~1m trees since the median nearest-neighbor distance in stored models is about 1.7cm, so query points almost never land within range and evidence stays negative. Bumping it to 0.03m was the sweet spot between over-splitting and collapsing everything into one model.

  • x_percent_threshold and object_evidence_threshold barely matter in comparison. Ran a 12-point grid sweep and mmd dominated everything.

  • InformedPolicy was a dead end for this task. It only does Look/Turn, so it captures roughly one hemisphere per episode. A tree seen from the front in epoch 0 looks completely different from the back in epoch 1, so cross-epoch recognition failed almost entirely. Switching to SurfacePolicy fixed this and cross-epoch recognition went from nearly zero to 7/9 trees being re-recognized.

  • Best result so far: 5 learned models for 9 tree species (willow ran out of steps). The two dominant models each absorbed 4-5 species. One grouped cedar/cypress/spruce/maple/birch, the other grouped palm/pine/tree_generic/maple. Oak/birch/cypress stayed as singletons. So we’re getting genuine category-like collapse, though not yet the conifer vs. deciduous split you hypothesized.This could be a geometry-based collapse (dense bushy canopy vs tall/sparse trunk)

  • desired_object_distance also matters a lot — the default 0.025m puts the camera basically inside the canopy. 0.3m works for trees.

This was all with 2 epochs and 2000 steps as a debug run. I’m planning to do a longer run next (3+ epochs, 5000 steps) and re-run the parameter sweep now that SurfacePolicy is working. Before I design the next round of experiments though, do you have any suggestions for what to try, or specific things you’d want to see tested?

2 Likes

Wow, that’s awesome to hear @YiannosD , thanks for making such great progress.

And really interesting results, some initial questions that come to mind:

  • Are you able to share visualizations of the tree objects you are using?

  • Similarly, can you share a visualization of the two main learned models you describe, plus the Oak/birch/cypress singletons? It would be particularly interesting to see what the grouped models look like.

  • How many model points are you allowing for each object? Maybe if you are able to share a repository with your configs, then I can also check some details like this directly.

  • One thing to keep in mind during your hyper-parameter sweep is that, in a full Monty system, we would want to have multiple LMs, and these could focus on learning models at different levels of detail/abstraction. As such, there won’t necessarily be one perfect configuration - rather, we want e.g. one LM that learns conifer vs. deciduous (although that may not practically happen at the moment for various reasons), one that learns them all as separate models, etc.

  • One thing you could consider is using something like a simple dendrogram (Dendrogram - Wikipedia), with mini-versions of the models visualized, to see how the breakdown happens as a function of different hyperparameters. You might find the below figure useful for this. In this case there would only be one level, based on what model it belongs to.