Some Questions from the Documentation

  1. How does the system handle false positive matches during the matching phase? If part of Object A matches with a similar patch on Model B in memory and matching stops there and then we do exploration and update the model of B, wouldn’t updating the model with this incorrect match lead to corrupted object representations? What mechanisms are in place to prevent or correct such mismatches?

  2. While I understand that SM poses use point normals and principal curvature directions for local patches, what defines the pose in LMs when representing complete objects?

  3. While Matching, when terminal condition is met then we stop. My point is let’s say the Monty has learned a model of the cylinder and then it is being tested on a coffee mug and somehow the sensor only moves on the cylindrical part of the mug then it will confuse the coffee mug with the cylinder. How do we make sure that for each object we move the sensor to all parts of the object and how are these parts recognized or divided?

These questions might sound very naive but I am finding it difficult to understand these point. Please help me out. Thanks a lot in advance.

Can someone please help here.

Thanks

Hi @ak90

those are great questions! Sometimes it takes me a while to reply because, as you can imagine, I have a lot of other things on my plate at the moment. And since this project was just launched a few weeks ago most other people here are still learning about it like you are so please understand if it takes a bit longer to get a response to such in depth questions.

  1. That’s true and this is indeed what we see happening right now. If you look at our benchmark experiment results for unsupervised learning (Benchmark Experiments) you will notice that, especially in the experiment where we show very similar objects, the LM learns graphs that merge observations of several objects (see “mean objects per graph” column). Here is an example image of such a graph where points from a cup (no handle) were added into a graph of a mug (has a handle) because the LM thought it was the same object.

Note that for this merged graph to look like it does now, the LM has recognized the location and orientation of the cup such that it aligns nicely with the mug.

We think that this is actually a desirable property of the system. We don’t want to create separate models for each instance of an object if this isn’t necessary. However, it is on our roadmap to implement a process to separate out models and ‘forget’ points on models more dynamically so that, if in a certain application it is important to distinguish very similar instances of an object class, the system would be capable to do so.

For now, we have several hyperparameters you can tweak to make the system more or less likely to merge points from similar objects into the same object model. For instance, we have max_match_distance and tolerances which specify how closely the sensed points need to resemble the points stored in the model for it to be classified as a match. You can also tweak the x_percent_threshold and min_eval_steps to make the LM take more steps until it is certain enough about its classification. Lastly, you could use more efficient or sophisticated policies, such as taking larger steps or moving quickly to distinguishing features on the object, to make sure the LM sees the relevant parts of the object that makes it different from other objects, such as the handle in the example above.

  1. Good question. In the output of an LM, the pose represents the location and orientation of the detected object. Each LM doesn’t just perform object recognition but also determines the sensors location on the object as well as the objects orientation relative to the orientation in which the object was learned. Object and pose recognition are a closely intertwined process in Monty. There is no object recognition without pose recognition and vice versa.
    For more details on the specific implementation and format of the pose output on an LM, you can have a look at the LM’s get_output function (tbp.monty/src/tbp/monty/frameworks/models/evidence_matching.py at f93ff45936c055a7b1f8e15bc3ce16a7fa29868d · thousandbrainsproject/tbp.monty · GitHub)

  2. That’s a good point! Like I briefly touched on in the response to 1. This is something you could achieve using more sophisticated policies. For instance, we have the hypothesis testing policy which takes the LM’s top two hypotheses and calculates which point on the models of these two objects would most distinguish them. So for instance, if the LM thinks that it may be sensing the mug or the cylinder, this policy would suggest to where it thinks the handle would be if it were sensing the mug (using its current pose hypothesis). This is just one policy we have implemented for this so far so if you have other ideas, feel free to make suggestions here or even open a PR or RFC for it.

I hope this helps!
-Viviane

1 Like

Hi @vclay

I completely understand, and I genuinely appreciate the time and effort you’ve dedicated to answering all my questions in such a detailed and comprehensive manner. Your explanations are incredibly helpful, and I’m truly grateful for your support.

Thanks a lot!!

1 Like

Hi @vclay

I am sorry but I am back with few more questions.

Q1: I ran the pre training experiment (without matching). Here is a snippet of the logs:
“”"
INFO:root:Logger initialized at 2024-12-14 18:25:23.339830
INFO:root:New primary target: {‘euler_rotation’: array([0, 0, 0]),
‘object’: ‘mug’,
‘position’: [0.0, 1.5, 0.0],
‘quat_rotation’: array([0., 0., 0., 1.]),
‘rotation’: quaternion(1, 0, 0, 0),
‘scale’: [1.0, 1.0, 1.0],
‘semantic_id’: 1}
INFO:root:Running a simulation to model object: mug
INFO:root:
—Updating memory of learning_module_0—
INFO:root:mug not in memory ()
INFO:root:Adding a new graph to memory.
INFO:root:init object model with id mug
INFO:root:built graph Data(
x=[472, 30],
pos=[472, 3],
norm=[472, 3],
feature_mapping={
node_ids=[2],
pose_vectors=[2],
pose_fully_defined=[2],
on_object=[2],
object_coverage=[2],
rgba=[2],
min_depth=[2],
mean_depth=[2],
hsv=[2],
principal_curvatures=[2],
principal_curvatures_log=[2],
gaussian_curvature=[2],
mean_curvature=[2],
gaussian_curvature_sc=[2],
mean_curvature_sc=[2]
}
)
“”"

I tried to understand the numbers but facing some issues. Please help me

  • x=[472, 30] ------- 472 nodes are saved, but I am unable to make sense of 30
  • pos=[472, 3] ------- 472 points with x,y,z as 3 coordinates??
  • norm=[472, 3] ------- 3 curvature direction and point normal for 472 points ??
  • feature_mapping={node_ids=[2]} ------- I am unable to make sense of what 2 is??

Q2: In supervised learning the model is given the object ID and pose. I understand that object ID would be used for saving the graph of the model but I am unable to make sense of how the pose is used. As far as I can think of we just take the points from the SM and plot it in 3-D space. Do we try to rotate the object to match the given pose at the end of the graph building?

Q3: Given the vastness and complexity of the code base can you please suggest an order in which i should see the code base. Theoretically, I have understood the concepts by going through the tutorials and other resources that you suggested earlier. Now I want to actively work with the code. For that I need help here.

Thanks a lot in advance.

1 Like

I can potentially help with Q3 if you haven’t seen this video yet in which the team goes through the code base - https://www.youtube.com/watch?v=x0e5SBY2nu8

1 Like

Hi Will. Thank You so much for the response.

I have already gone through the materials suggested by @vclay including this video. But I am still finding it a little difficult to join all the threads together.

Hi @ak90, I authored a few sequence diagrams as I was reading the code in order to better visualize what was going on. Here’s an example from RFC 4 of what the code is doing when executing a HabitatEnvironment.step(action) method.

Is this sort of thing helpful? I have a few more in RFC drafts that I could share.

1 Like

Hi Tristan. Thanks a lot.
This will help. I request you to please share the others too.

Thanks again :smiley:

Hi @ak90
I’m happy to see that you went through the tutorials and want to dive deeper into the code! Here some more specific answers to your questions:

Q1: What Information is Stored in the Graphs?
x refers to all the nodes in the graph (here 472). At each node we can store a number of features. Here, we store 30 feature values (some feature may be stored in multiple values, like RGB would be 3 values in the x vector). pos refers to the x,y,z locations as you correctly guessed. norm is the point normal for each point (curvature directions are actually stored in the features in x). feature_mapping is a dictionary that tells you which ids in x correspond to which feature. You can see that it is a dictionary where each key corresponds to one of the features stored in x and each value contains 2 numbers: the start and end index of that feature in the x vector.
You can look at the ObjectModel class for more details on how object models are defined. Particularly the build_adjacency_graph function may be useful as this is where the graph object you are referring to is created.

Q2: How is the Pose used in Supervised Learning?
Even though we supervise the learning in this setup by supplying the object label and orientation, the agent still needs to move over the object itself to collect observations. We don’t just give the agent a complete pointcloud of the object. This means that often not all parts of the object are seen in one episode. For example, when we have the distant agent, this agent is fixed in one position and tilts up, down, left, and right like a camera mounted somewhere. That means it is impossible to see the backside of the object. This is why we usually show the object in multiple rotations during training (like the 14 rotations visualized in this tutorial: Pretraining a Model ).
That means, we first build a model of the object in one orientation. The next time we see this object, we may see it in a totally new location and orientation. So, if we want to extend our model of this object with newly observed points, we first need to rotate and translate those points into the object’s reference frame. This is what the pose is used for (the same way it is also used during unsupervised learning, only that there the model has to infer the pose on its own). For a more detailed overview with some animations and plots you could watch these two videos: https://www.youtube.com/watch?v=0Gcw1itpbWM
https://www.youtube.com/watch?v=bkwY4ru1xCg

Q3: How to best get started to get a deeper understanding of the code?
That’s a good question and the answer depends a bit on your personal preference for learning. Here are a few suggestions:

  • If you like to understand code by reading through it, I would suggest starting with the classes in the tbp/monty/frameworks/models folder.
  • You could also start by slowly stepping through an experiment, starting with the MontyObjectRecognition experiment class used in the tutorials and zooming in onto all the steps executed there.
  • One thing that I find very useful is visualizing what is happening. You could start by taking the learned graph (what you printed above) and visualize it. You can find some starting points for that in the documentation on logging and analysis: Logging and Analysis Then you could add the DetailedJSONHandler to your experiment and dig through the detailed statistics of one episode. We also have a lot of example scripts and visualization in the monty_lab repository. A good folder to start with could be this one: monty_lab/graph_matching at main · thousandbrainsproject/monty_lab · GitHub (note: that not all notebook in there are still working as these are not actively maintained as the code evolves)
  • Lastly, you could move on with learning-by-doing. You could pick a small improvement or idea you’d like to test and try to figure out how this would be done in the monty code base. What to pick here really depends on your interests and skills. You could look into our project roadmap in the future work section (Project Roadmap ) if you want a bigger more involved project or simply pick one of the many TODOs in the code. Of course, you can also come up with your own ideas or have a look at this list of other ways to contribute: Ways to Contribute. For me personally, this approach of picking a task and just getting started on it is the best way to learn about something, but of course, everyone is different in that regard.

I hope this helps!

  • Viviane
2 Likes

Here you go @ak90.

These tbp.monty sequence diagrams came from my attempt at understanding what the different motor policies do and how different data loaders interact with different motor policies.

I would also add that some elements are outdated because the tuple action, amount and constraint are no longer passed around individually. Implementing RFC 4 Action Object changed all the action, amount tuple interactions into a single action object.

Also, action sampling changed, instead of:

action = self.action_space.sample()
amount = self.get_next_amount(action)

We now do something like:

action = self.action_creator.sample(self.agent_id)
2 Likes

Thanks a ton @vclay .

This will be a lot of help and i really appreciate you taking your time out and answering my queries in such detail.

Thanks Again
Avinash

1 Like

Thank You Tristan.

I will go through these resources.

Hi @vclay . A very Happy New Year to you. I hope you enjoyed your Holidays :smiley:

I have few questions

  1. Where can i get some resources on how reference frame transformations work in Monty. The sources of data and final representation(graph) happens in different spaces, so I am assuming that there must be reference frame transformation for this

  2. I have an IMU device with me which outputs Acceleration, Orientation(Gyroscope) and Magnetometer data. I was thinking of using it to make Monty learn the handwritten letters like a,b,c,d etc. But I am not able to decide that how do I setup the environment dataset and dataloader here. Also in this case there would be no morphological features and input data would be 2-D. Can you please guide me on how should i approach this.

Thanks,
Avinash

Hi Avinash,

the reference frame transforms happen inside the learning module. There are several points when they are applied:

  • transforming incoming displacements using pose hypotheses of the LM (body RF → model RF)
  • transforming observations in the buffer given the detected pose to update the model of the object (body RF → model RF)
  • transforming poses stored in the graph, given our hypothesis about the objects pose, to suggest a movement if the agent (model RF → body RF)
  • transforming poses from one model’s reference frame into another when voting (LM1 model RF → LM2 model RF)

In the code those instances are easiest to find if you search for “.apply(” which is called whenever we rotate a vector given a pose.

In regards to your second question, you could have a look at the OmniglotDataLoader and OmniglotEnvironment which we wrote to test hand written digit recognition on the omniglot dataset. This is a simulation, not using a real sensor so you will have to make some more adjustments like write a custom sensor module for your sensor to output messages that are CMP compliant (instances of the State class) but it may be a good starting point.

I hope this helps!
Best wishes,
Viviane

4 Likes

Thank You @vclay. This will help

1 Like

In the code I found this
“Dictionaries to tell which objects were involved in building a graph
and which graphs correspond to each target object”
self.target_to_graph_id = dict()
self.graph_id_to_target = dict()

Now i was wondering, how is this possible??

  • Wouldn’t there be different graphs for different object. For example a red mug and a yellow mug will have different graph. Or is it that since we do not have a control in exploratory step and sometimes the model might identify some of the objects and add some points to it’s existing graph and because of that we are keeping this mapping

  • .graph_id_to_target ------ Also how can one object have multiple graphs.

I’m curious about this as well.

As to your first question, here’s one way I’ve begun to conceptualize it:

Nodes within a graph may be synonymous to neurons within the brain. Representations of objects in the brain may then be stored as stable firing patterns encoded across neuronal ensembles (think functional groups of neurons). With this is mind, you may have the same neurons involved in multiple, disparate ensembles. Think of how a person might be involved in multple groups, even though they are infact a single person. Its like that.

An example of how this might work (that is, how a single graph, or portion therein, comes to represent multiple objects) could be like this: you have two objects which share much in common. For instance, their morphology, texture, ect. But they differ in color:

Hot air balloon #1:

Hot air balloon #2:

Obviously, both of these objects are “hot air balloons” in spite of the fact there are clear differences between them. So why not re-use certion neuronal groups (aka graphs) to represent both? Its simply more efficient.

Hopefully this makes sense (kind of writing it inbetween some tasks at work). Also, I hope I represented the TBP team’s intent here correctly. Please correct me if this misses the mark.

1 Like

Thanks a lot for your response.

This makes sense to me. But I have one doubt. Currently the way things are implemented here for object recognition, there is one-to-one correspondence. So, even if there are points for a single object in multiple graphs, at the end we must find one object. In that case how will we use this many to many correspondence?

Good question.

At this point, we might be getting a little speculative. Especially if my original conceptualization is anything less than accurate.

That said, in the brain, object recognition typically involves distributed patterns of activity spread across multiple neuronal populations. A single object activates multiple ensembles across different brain regions, and individual neurons within these ensembles can participate in representing multiple different objects. There’s not really any direct one-to-one correspondence between a lone ensemble and a resulting object representation. To my knowledge, its more like many-to-one.

However, it could be that their code is simply diverging from a strict neuro-plausible approach.

I don’t want us to get too far into the weeds here though and risk burying your original comment in the thread (I kind of want to see what the TBP team has to say). However, if you’re able to message me a link to the code you’re looking at, as well as a detailed description of your concern, I might be able to get you a more thorough answer sometime tomorrow.

1 Like