Evaluating Monty's performance on the Omniglot dataset

Hi,

I tried evaluating Monty’s performance on the Omniglot dataset.

I trained the model using all characters from the Bengali, Inuktitut, and Blackfoot sets (with version=2).

When I evaluated the trained model on the same data, I achieved an accuracy of 75.6%.

However, when I evaluated it on a different version of the characters (version=5), the model only produced 3 correct predictions out of 76 samples.

This suggests that the Monty model only performs well on the data it was trained on.

Would Monty generalize better to unseen data if I changed the model configuration or training settings?

Also, how does Monty handle background images in 2-dimensional data? How does it distinguish between the background and the object?
It seems that datasets used for evaluating HTM and Monty (e.g., MNIST, YCB) have no background and consist only of objects.
So I am curious whether TBP’s new neural network system can handle other 2-dimensional image dataset that include complex background, such as ImageNet or VOC.

My final goal is to achieve 90% of accuracy on the SVHN dataset.
Is it likely that Monty will be able to achieve strong object detection performance on 2D datasets following the integration of TBP’s compositional object data work?

1 Like

Hi @skj9865 and welcome to our forum :slight_smile:

Thos are questions! I am happy to see that you got Monty to work on Omniglot (at least technically, not with great performance yet).

The bad performance is to be expected at the moment. It is as you say, we expect to improve a lot on this dataset as we introduce hierarchy into Monty. These handwritten digits are fundamentally compositional objects (letters that are composed of strokes at relative locations and orientations), and to model compositional objects, two LMs stacked on top of each other are necessary. This is our current research focus. We have already implemented the basic infrastructure in Monty for stacking LMs but there are still several outstanding items (see our project planning sheet, particularly 47F for preliminary work and 50F) We hope to be able to have much better performance on the omniglot dataset by this summer.

As to your more general questions:

Monty’s generalization capabilities

Monty is able to generalize to objects with similar shape as the ones it has learned about as well as similar shape but different features (for some basic examples see https://www.youtube.com/watch?v=lqFZKlsb8Dc&t=2807s ). It can also recognize an object it has learned about in orientations and locations that it has never seen it in before.

The amount to which Monty generalizes to different shapes can be tweaked using the tolerance parameters. However, if you increase the tolerances too much, you may get false positives and “recognize” objects that are nothing like the learned models and should instead be learned as new, different objects.

The issue with using this approach for generalizing compositional objects like in the Omniglot dataset is that the low-level morphology of the letters can vary widely (which would require large tolerance parameters) but we want to constrain the high-level arrangement of strokes. For instance, an H can have the two vertical strokes at varying distances from each other but they should always be roughly orthogonal and be made of solid strokes without gaps within them. That can be expressed as a compositional object.

We have a meeting recording where I talk about this in more depth. I will see if we can move this forward in our video release queue and post it here.

Monty on datasets with different backgrounds

While most of our YCB benchmark experiments currently test object and pose recognition with single objects in a void, this was just a starting point for us to prove out and put together the basic algorithm. We are slowly transitioning to more complex environments and certainly want Monty to be able to perform well in settings with all kinds of backgrounds.

Two scenarios we have already looked at are:

  • Recognizing objects from images taken with an iPad camera. Here, we test several scenarios, including hand intrusion (hand covering part of the object) and multiple objects touching. Also, since those are images taken in the real world, there is naturally always a background as well. Results are reported in the Monty meets world benchmarks, and a demo is shown here: Project Showcase

Our performance in these more complex environments is not as good as we would like it to be, and we are actively working on this at the moment as prerequisite work for compositional objects.

To your last question: I would say that it is likely that Monty will do much better on tasks such as Omniglot character recognition after our work on compositional objects is completed.

However, I would like to highlight that Monty is not designed for learning from large, static image datasets. We looked at Omniglot as it is designed to test learning from a small number of examples, and there was a straightforward way of defining movement (i.e., following the strokes). You can use Monty on 2D image datasets by moving a small patch over the images. However, if the dataset doesn’t contain depth information and you are trying to recognize 3D objects, it will not be able to learn good representations of those. Monty is a sensorimotor framework designed to learn just like humans by actively exploring, moving its sensors, testing hypotheses, and rapidly building up structured models of the world. It is not a drop-in replacement for ANNs, it is a fundamentally different approach with a different range of applications (Application Criteria ). That is not to say that it won’t replace ANNs in many applications. In a lot of cases, applications have been artificially forced to fit the mold of what ANNs can do (collecting large static datasets, no option for continual learning, doing anything to make the data i.i.d.,…) and in those cases, our approach will be a much more natural fit to solve the tasks in an elegant and efficient way.

Hope this help! Let me know if you have more questions.

  • Viviane
5 Likes

Here is the mentioned video! 2023/01 - Hierarchy in the Neocortex Overview

1 Like

Thank you for your kind reply.

I’m wondering — for 2D image datasets like Omniglot, does Monty require time-based stroke or location data in order to operate effectively? Or is it also able to process static images without such information?

1 Like

Hi @skj9865

no, it doesn’t require time-based information such as stroke order. The important thing is that you can define how movement works in your dataset and how movement changes the observations. For example, you can have a look at the [SaccadeOnImageEnvironment](https://github.com/thousandbrainsproject/tbp.monty/blob/2518a246214d8a487e1054da8ac57269e5014399/src/tbp/monty/frameworks/environments/two_d_data.py#L256) where we move a small patch over a 2D image. This is used in the Monty Meets World demo and benchmark experiments but could be used for any 2D image dataset. In the Omniglot dataset we use the stroke information because this is a bit more principled and efficient way to explore the handwritten digits but you could also just arbitrarily move a patch over those images.

  • Viviane
2 Likes

We just posted another video where (in the second half) I talk more about the Omniglot dataset and why hierarchy will be required for generalizing to new versions of the characters. You can watch it here: https://youtu.be/-qPfBrTVoks?si=HXqZ0X7hNXsjYQ-x&t=3780

3 Likes

Thank you for the link.

It seems that the hierarchical structure of LMs allows lower-level models and higher-level models to work together to recognize objects effectively.

I’m trying to apply this hierarchical structure to 2D image datasets.

However, I couldn’t find an example of hierarchical LM usage in the tbp.monty code.

Also, customizing the Monty model to work with 2D datasets using SaccadeOnImageEnvironment has been quite challenging for me.

Are there any example implementations for testing hierarchical LMs on 2D datasets?

Or would it be better to wait for your upcoming work on the compositional dataset?

Using stacked LMs is something we are currently actively working on, and since it is not fully supported yet, it is not part of our tbp.monty benchmark experiments yet. It is technically supported to set up such configs (for example see TwoLMStackedMontyConfig and for a full experiment config you can see examples in our monty_lab repository here: monty_lab/experiments/configs/graph_experiments.py at e36561ddf9875d2ba68ebbb2b9fbdcd7307102c9 · thousandbrainsproject/monty_lab · GitHub or this unit test tbp.monty/tests/unit/evidence_lm_test.py at 498030b5c51d0a8369586d02ce51d73f9c47a815 · thousandbrainsproject/tbp.monty · GitHub) but functionally it won’t work as well as it should yet.

As for running Monty on a 2D dataset, I just opened a PR with two new tutorials here: docs: Customize monty tutorials by vkakerbeck · Pull Request #248 · thousandbrainsproject/tbp.monty · GitHub. The first one, about using Monty in custom applications, might be particularly useful. It also includes more details on how we used it for the Omniglot dataset and follow-along code snippets.

Hope this helps!

  • Viviane
1 Like

Hi @skj9865
I just wanted to mention that I added two new tutorials to our documentation (New Tutorials on Using Monty in Custom Applications) One of them talks through the Omniglot example in detail and also provides code to follow along.
You may also be interested in the tutorial in general as it describes how to apply Monty to new environments.

  • Viviane
1 Like

Thank you for the update.

I evaluated Monty’s performance on the omniglot dataset using the new tutorial as a reference.

As you mentioned, Monty’s generalization performance is currently limited. I believe the hierarchical structure of the LM has the potential to improve generalization significantly.
I also noticed that the Omniglot tutorial relies on stroke data, which is typically not available in other datasets.

To explore broader applicability, I’m working with the saccadic environment for 2D object recognition, with the expectation that it may generalize better than the stroke-based approach.
For faster recognition, I’m also converting 3D mathematical operations to their 2D equivalents.

My goal is to achieve 90% accuracy and sub-second recognition latency on the MNIST dataset.

I’ll be happy to share the results once I reach those milestones.

Thank you again.

2 Likes

Hi @skj9865
that sounds great, thanks for the update!

You are right, we use the stroke data in the Omniglot environment, which isn’t available in other image datasets. But like you mentioned, using the saccade on 2D image environment should work perfectly fine too. We don’t give the stroke data to the learning modules but just use it for more efficient movement across the characters. It should be similar to using our curvature following policy on the images. In that policy, we use the sensed features to move such that the sensor follows the detected principal curvatures (which in this case would mean following the strokes without having data about the stroke sequence). The main difference would be that you may have to add some mechanism to go from one stroke to the next when they are not connected. But even just moving randomly over the image should work fine (might just require a few more steps).

Those goals sound great, I’m excited to hear how it goes! Also, you are welcome to make a PR to our repository with a learning module version that is written for 2D instead of 3D environments. I could imagine that this would be generally useful for others (and us as we test hierarchy in simpler environments like Omniglot).

Best wishes,
Viviane

1 Like