Hi @skj9865 and welcome to our forum 
Thos are questions! I am happy to see that you got Monty to work on Omniglot (at least technically, not with great performance yet).
The bad performance is to be expected at the moment. It is as you say, we expect to improve a lot on this dataset as we introduce hierarchy into Monty. These handwritten digits are fundamentally compositional objects (letters that are composed of strokes at relative locations and orientations), and to model compositional objects, two LMs stacked on top of each other are necessary. This is our current research focus. We have already implemented the basic infrastructure in Monty for stacking LMs but there are still several outstanding items (see our project planning sheet, particularly 47F for preliminary work and 50F) We hope to be able to have much better performance on the omniglot dataset by this summer.
As to your more general questions:
Monty’s generalization capabilities
Monty is able to generalize to objects with similar shape as the ones it has learned about as well as similar shape but different features (for some basic examples see https://www.youtube.com/watch?v=lqFZKlsb8Dc&t=2807s ). It can also recognize an object it has learned about in orientations and locations that it has never seen it in before.
The amount to which Monty generalizes to different shapes can be tweaked using the tolerance parameters. However, if you increase the tolerances too much, you may get false positives and “recognize” objects that are nothing like the learned models and should instead be learned as new, different objects.
The issue with using this approach for generalizing compositional objects like in the Omniglot dataset is that the low-level morphology of the letters can vary widely (which would require large tolerance parameters) but we want to constrain the high-level arrangement of strokes. For instance, an H
can have the two vertical strokes at varying distances from each other but they should always be roughly orthogonal and be made of solid strokes without gaps within them. That can be expressed as a compositional object.
We have a meeting recording where I talk about this in more depth. I will see if we can move this forward in our video release queue and post it here.
Monty on datasets with different backgrounds
While most of our YCB benchmark experiments currently test object and pose recognition with single objects in a void, this was just a starting point for us to prove out and put together the basic algorithm. We are slowly transitioning to more complex environments and certainly want Monty to be able to perform well in settings with all kinds of backgrounds.
Two scenarios we have already looked at are:
- Recognizing objects from images taken with an iPad camera. Here, we test several scenarios, including hand intrusion (hand covering part of the object) and multiple objects touching. Also, since those are images taken in the real world, there is naturally always a background as well. Results are reported in the Monty meets world benchmarks, and a demo is shown here: Project Showcase
Our performance in these more complex environments is not as good as we would like it to be, and we are actively working on this at the moment as prerequisite work for compositional objects.
To your last question: I would say that it is likely that Monty will do much better on tasks such as Omniglot character recognition after our work on compositional objects is completed.
However, I would like to highlight that Monty is not designed for learning from large, static image datasets. We looked at Omniglot as it is designed to test learning from a small number of examples, and there was a straightforward way of defining movement (i.e., following the strokes). You can use Monty on 2D image datasets by moving a small patch over the images. However, if the dataset doesn’t contain depth information and you are trying to recognize 3D objects, it will not be able to learn good representations of those. Monty is a sensorimotor framework designed to learn just like humans by actively exploring, moving its sensors, testing hypotheses, and rapidly building up structured models of the world. It is not a drop-in replacement for ANNs, it is a fundamentally different approach with a different range of applications (Application Criteria ). That is not to say that it won’t replace ANNs in many applications. In a lot of cases, applications have been artificially forced to fit the mold of what ANNs can do (collecting large static datasets, no option for continual learning, doing anything to make the data i.i.d.,…) and in those cases, our approach will be a much more natural fit to solve the tasks in an elegant and efficient way.
Hope this help! Let me know if you have more questions.