Welcome to the Forum, @vamsi and thanks for sharing this! HRM is certainly an intriguing architecture.
From the Thousand Brains perspective, we agree that hierarchical organization and temporal abstraction are important for modeling sequences. And while HRM is closer to our notion of hierarchy than, say, a stack of CNN layers or transformer blocks, there are still fundamental differences.
In TBT, the different levels of the hierarchy represent compositional objects, where higher-level objects are composed of reusable lower-level ones (e.g. logo on cup). This is different from a recurrent deep learning architecture operating at different fixed timescales.
The following is from our “Hierarchy or Heterarchy” paper:
Columns in each region learn structured models, up to, and including, complete objects. We propose that the role of the hierarchical connections between columns is to learn compositional models, that is objects that are composed of other objects. Most of the world is structured this way. For example, a bicycle is composed of a set of other objects, such as wheels, frame, pedals, and seat, arranged relative to each other. Each of these objects, such as a wheel, is itself composed of other objects such as tire, rim, valve, and spokes. In another example, words are composed of syllables, which are themselves composed of letters. And finally, an example that we often use in our research, a coffee mug may have a logo printed on its side. The logo is an object that was previously learned, but in this example the logo is a component of the coffee mug. Learning compositional objects can occur rapidly, with just a few visual fixations. This tells us that the neocortex does not have to relearn the component objects; the neocortex only has to form links between two existing models.
Beyond the hierarchical processing, one key concept that’s still missing in many architectures, including HRM, is that of reference frames. In TBT, reference frames are at the core of intelligence and are used within every cortical column. They allow for structured learning of environments and the objects within. An agent uses these reference frames to represent features in space, learn spatial relations of features and objects to each other, plan and apply movements, and make predictions. Without these reference frames, models often rely on statistical correlation rather than structured models of the world to complete their tasks.
We also take a different stance on deep learning itself. Monty doesn’t use deep learning at all, not only for performance reasons, but because we believe it’s not how the brain works. The learning mechanisms are fundamentally different. We discuss this more here.
Here are more thoughts on this: