Some thoughts about scale invariance and model composition

This is my first post, so first of all: Hello to the Monty team and thank you for such an interesting and promising project, and for sharing it with the world! I’m not so overconfident as to believe I’m ready to practically contribute to the topics, rather I just try to get better understanding of the problem and organize things in my head in the right way. For now I mostly read and watch available and recommended materials and have not yet really dived into the Monty code, so may miss something that’s already there. Most probably the things mentioned below were already considered by the Monty team in one way or another, some maybe were discarded for certain reasons, in which case I’m sorry for poor reinvention of a bicycle or its spare parts.

To recognize exact real object means also to recognize its scale, and when starting recognition we have no strong evidences as for the real object size. So in whatever way scale support would be implemented, we need to produce and check scale-aware hypotheses at some point of recognition process, one way or another. As mentioned in Support Scale Invariance, testing of different scales of an object may be required, and a set of scales to test may be defined by certain heuristics based on low-level sensory input and past experience, which definitely makes sense for me. Sensory input should normally be available at any time, while experience will build up gradually, and may be less useful in a sensorymotor system’s “infancy”.

My general understanding is that the notion of scale could be a part of a model itself: relative scales of the model subparts. Of course, for this a model has to be composed of the subparts, as opposed to being monolithic, like a single point cloud or a graph, where points have no size and so relative scale has no meaning. Another point to have subparts is that a recognition will start from exploring a certain part of an object and it may be good to generate some hypotheses of which part it is and what size it may have yet at this early stage of recognition. Therefore, for me the topics of model composition and scale invariance look to be intertwined.

In Some Questions from the Documentation - #21 by vclay there is following mentioned:

Related to HumbleTraveller response, the first case (info from multiple objects in one graph) is actually a desirable quality in some cases such as when we want to learn a model of the general object category (hot air balloon) instead of detailed models of specific instances.

To me it looks like it could be useful to have objects’ subparts as the “basic” models, not the complete objects, as it is likely that multiple different general category models will have some similar subparts, but not necessarily all of them. Like both a mug and a can (or other vessel) may have same flat round bottom, and the can may even reuse the same subpart for its top. In such case few objects models’ graphs (with subparts as nodes) could be partly merged, allowing to naturally move between somehow alike models.

The subparts may be useful for modeling objects states, if we adjust relative poses and scales of the subparts, which looks to be somehow considered in Implement & Test GNNs to Model Object Behaviors & States, but for points. To me it looks like object state changes on the level of points correspond more to stretching/squeezing/twisting/etc a non-rigid object, while composite objects like a stapler have few rigid subparts moving relative to each other. In the video 2023/02 - Speedup Discussions - Part 1 at 45:30+ there is a suggestion, if I get it right, to treat areas with constant curvatures and gradients as the standalone entities to be used instead of raw points in the same area, to save some computation. It was also mentioned that it is a good topic for a further discussion, not sure if there was any and what was the outcome, but it may be related to current topic. In 2023/04 - Outline of Constrained Object Models & Monty Mapping on Cortical Column there was a certain discussion on learning model parts, but it’s quite old an there might be some advances in that topic since then; in the code there is a class GridObjectModel, which looks to be related to the video, but not sure about the overall current state of the approach. In any case, there need to be some criteria of splitting an object into subparts, and they better to be less artificial and more data-driven. Model composition is also mentioned in Use Models with Fewer Points, but it’s still about points.

At the moment each point sampled from an object surface has a kind of own reference frame: a point normal and the principle (min/max) curvatures directions, while curvature magnitudes k1 and k2 are stored as the features of the point. It might be possible to cluster sampled points by their k1 and k2 (their differences and/or ratios) and poses. For example a cylinder could be split this way into its side wall and two flat circular areas of its top and bottom, or maybe even one circular area reused for both cases, if the top and the bottom do not differ in some other features like color (we may need such features to distinguish an exact object from a set of morphologically identical ones), but that’s another topic. In case of a mug, its cylindrical wall will have some thickness and thus its rim will have its own monotonic curvature, which may be treated as another subpart of circular shape with sampled points located exclusively along the circle. For a mug handle the logic is the same, but it may consist of few subparts. A subpart can keep its sampled points, not necessarily all of them, but the most informative ones: the points at the borders of the subpart’s surface area, so we can judge about the subpart size, and the ones which bear with them additional distinct features, morphological (like texture) and non-morphological ones (like color), so we can judge about relative poses of these features on a subpart’s surface.

With noisy measurements and the real worlds’ non-ideal objects’ shapes, the clustering task may be non-trivial. I imagine a clustering algorithm should be able to define clusters on the fly, create new ones, and probably redefine existing ones if needed, attempting to simultaneously keep clusters number reasonably(?) low and have the points in a cluster relatively(?) similar with respect to their surface curvatures. Using some sophisticated statistical methods may allow to distinguish more complex surface curvature patterns, like non-monotonic ones. If we observe some complex surface, like a spiny hedgehog’s back and cannot make sense of the back’s curvature on a scale of standalone spines, we may somehow encode the spines’ “unevenness” as another morphological feature – a surface texture (could be related to Extract Better Features), and try to sample the points more sparsely, to capture the curvature on a larger scale.

Each model’s subpart – or a surface patch in other words, let’s use this term further – can have its own reference frame originating in its centroid (for example). The coordinate system may be Cartesian or spherical, but here I mention spherical because it is easier for me to thinks in its terms. Radius vectors of all the sampled points in the surface patch reference frame are normalized, so the longest is 1 – the patch’s unit radius vector, which is the patch’s own distance measurement unit. Having this, it is possible to place other patches (their reference frames) into this patch’s reference frame, so they will have relative poses. Ratios of this patch unit vector to other patches unit vectors, but calculated in the units of the outer, sensor’s reference frame, will define relative scales of other patches. This can be done for all other patches which an object consists of. In this case the whole object model will have multiple representations – from each patch’s perspective so to say. And the model itself will be a graph of the patches, where edges are directed, because each patch has own distance unit to measure other patches’ relative positions and its own relative scale factors. Probably such graph does not have to be a complete one, but some strategy will be required to decide which edges in which directions to establish while learning a model.

During recognition, even a single sampled point’s curvatures may give some hypotheses of what kind of patch/subpart is observed. Even within a single patch one can make predictions of what the surface will look like at certain displacement (if the patch is large enough). In general,

  • selecting known patches by currently observed curvatures,

  • checking to which models they belong,

  • following edges of the candidate models’ graphs to find neighbor patches and their relative poses (within a single patch we still need to work with points, same as it’s done already)

can give testable predictions as to which patches of which models we may be observing now and their (patches’ and models’) possible poses. Estimations of a minimal observed patch size (in the sensor reference frame’s units) can be refined with newly sampled points and can give hypotheses of neighbor patches relative and absolute sizes, to make testable predictions more concrete and shrink the candidate patches and models lists.

All of it does not look to require hierarchy (if I don’t miss something), therefore, it probably could be done within a single LM. Of course, on top of that learned statistics of the environment should be used, as mentioned in Future work/Support Scale Invariance. Like it’s less common to see giant mugs and tiny buildings, than vice versa.

This is my basic understanding of possible object composition and scale handling. It relies on the ability to identify the curvature patterns and cluster surface patches based on them. Sounds simple, but I feel like this problem is deep and may have something making it not feasible at least. Aside of this, there are probably some other elephants in the room.

  1. Does anything of it make sense?

  2. What are the benefits to work with the standalone points as opposed to some more high-level entities in scope of a single LM?

  3. Does Use Models with Fewer Points mean, that on any hierarchy level a single input feature, be it sensor-sampled point (with pose), or an object ID from the lower level (with pose), is represented by a single point? Like ID (+pose) of a mug from a lower layer is represented by a single point in a higher level’s bigger compositional model of a table with a mug on it.

Thank you.

2 Likes

Hi @artem thank you for the thoughtful post! Just a quick reply for now, but more later - have you watched this video where Jeff suggests a way scale may be handled? 2022/03 - Scale Invariance in the Brain

Hi @brainwaves, thank you, yes, I watched the video and will definitely re-watch it (more than once, I suppose). It is more about biology and how the brain can do it, and I understand that the principles Monty relies on should be in sync with the brain. I’m yet to build my understanding of it, so I wrote the stuff above without considering biology, and it can be a problem of its own, of course.

1 Like