The following text is an excerpt from Ashutosh Shrivastava’s post:
Sapient released their Hierarchical Reasoning Model (HRM) and the results are pretty interesting. This is a 27M parameter model that outperforms Claude 3.5 and o3-mini on reasoning benchmarks like ARC-AGI-2, complex Sudoku puzzles, and pathfinding in large mazes.
What makes this notable:
The efficiency aspect is striking. HRM was trained on roughly 1000 examples with no pretraining or Chain-of-Thought prompting, yet it handles complex reasoning tasks that typically require much larger models. This makes it practical for deployment on edge devices and accessible for teams without massive compute budgets.
The brain-inspired architecture is more than just terminology. HRM uses a dual-system design with two modules: one for high-level abstract planning and another for rapid detailed execution, operating at different time scales. This mirrors how human cognition works with both fast intuitive processing and slower deliberate reasoning.
The low-resource requirement changes the accessibility equation. While most advanced AI requires significant infrastructure, HRM can run on regular hardware, opening up sophisticated reasoning capabilities to startups and researchers who can’t afford large-scale compute.
You can read the paper here: [2506.21734] Hierarchical Reasoning Model
This is so interesting. Have you figured out how to do something with it? If this model scales it will change the world. 29 million parameter model that trains in a 1000 examples.
Reminds me of some older work by Numenta: Hierarchical temporal memory - Wikipedia
Welcome to the Forum, @vamsi and thanks for sharing this! HRM is certainly an intriguing architecture.
From the Thousand Brains perspective, we agree that hierarchical organization and temporal abstraction are important for modeling sequences. And while HRM is closer to our notion of hierarchy than, say, a stack of CNN layers or transformer blocks, there are still fundamental differences.
In TBT, the different levels of the hierarchy represent compositional objects, where higher-level objects are composed of reusable lower-level ones (e.g. logo on cup). This is different from a recurrent deep learning architecture operating at different fixed timescales.
The following is from our “Hierarchy or Heterarchy” paper:
Columns in each region learn structured models, up to, and including, complete objects. We propose that the role of the hierarchical connections between columns is to learn compositional models, that is objects that are composed of other objects. Most of the world is structured this way. For example, a bicycle is composed of a set of other objects, such as wheels, frame, pedals, and seat, arranged relative to each other. Each of these objects, such as a wheel, is itself composed of other objects such as tire, rim, valve, and spokes. In another example, words are composed of syllables, which are themselves composed of letters. And finally, an example that we often use in our research, a coffee mug may have a logo printed on its side. The logo is an object that was previously learned, but in this example the logo is a component of the coffee mug. Learning compositional objects can occur rapidly, with just a few visual fixations. This tells us that the neocortex does not have to relearn the component objects; the neocortex only has to form links between two existing models.
Beyond the hierarchical processing, one key concept that’s still missing in many architectures, including HRM, is that of reference frames. In TBT, reference frames are at the core of intelligence and are used within every cortical column. They allow for structured learning of environments and the objects within. An agent uses these reference frames to represent features in space, learn spatial relations of features and objects to each other, plan and apply movements, and make predictions. Without these reference frames, models often rely on statistical correlation rather than structured models of the world to complete their tasks.
We also take a different stance on deep learning itself. Monty doesn’t use deep learning at all, not only for performance reasons, but because we believe it’s not how the brain works. The learning mechanisms are fundamentally different. We discuss this more here.
Here are more thoughts on this:
ARC-AGI provides both an informal definition:
AGI is a system that can efficiently acquire new skills outside of its training data.
and a formal definition:
The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.
Does TBP provide something similar? It would make elevator pitches much easier.
I agree with Chollet, intelligence shouldn’t be measured by specific tasks. Saying it is the ability to learn, or learn new things, or apply previous knowledge to new tasks, is too fuzzy. Many different types of systems could meet these requirements, but inconsistently. In my book A Thousand Brains, I argued that the presence of intelligence should be measured by how a system works internally. This is how we determine if something is a computer or not. My toaster and my laptop both have computers in them even though they have different capabilities. They have read/write memory, a CPU, etc. and are both universal Turing machines.
In the future, we will adopt similar terminology for AI systems. Biological brains, and Thousand Brains AI, both have the ability to directly sense the world, the ability to move their sensors relative to the world for learning and inference, and to manipulate the world to achieve goals. Internally, knowledge is represented using reference frames that capture the structure of the world.
With this definition, you can imagine range of intelligent machines from small and limited, to super-human. I haven’t tried to reduce this to an elevator pitch, but I believe this is how we will define intelligence in the future.
The TBT model is indeed unique in key areas, like the use of reference frames in each column and building a compositional hierarchy across groups of columns. My curiosity goes to the question, of how attention may be triggered and steered within such hierarchies both bottom-up and top-down? And where are the results of such attention temporarily stored? This second answer is probably that everything in the object remains distributed in the set of active columns, but where are attentional decisions made? The voting mechanism must include some form of attentional control, because some object components may be shared by competing compositional hierarchies. Like a logo can be a part of a mug and also a part of a bottle or on a pen, like a Montblanc. Attention control is needed in each of these 3 competing models before some higher instance of voting decides which one wins. I can envision that this compositional attention is both distributed and running in parallel from a temporal point of view. Do we have a concept for this?
Here are a few options:
Broad Classes of Methods for Aggregation
There are multiple conceptually distinct methods by which the system could integrate features into a higher-level object representation. These methods often overlap in real cortex, but they can be separated for clarity:
1. Summation / Linear Voting
-
Each column’s activity is a “vote” for features.
-
Higher-level columns simply sum or average the incoming activity.
-
Recognition occurs when the summed evidence exceeds a threshold.
-
Analogy: “bag-of-features” models in vision.
2. Multiplicative Conjunction / Coincidence Detection
-
Higher-level columns fire only when specific sets of lower-level features co-occur.
-
This requires coincidence detection — e.g., dendritic nonlinearities detecting simultaneous input patterns.
-
Analogy: AND-like gating, or conjunctive feature binding.
3. Pattern Completion in Attractor Networks
-
The higher-level representation exists as an attractor state in a recurrent network.
-
Partial feature activation from lower-level columns “pulls” the network into the full object representation.
-
Analogy: Hopfield networks, hippocampal pattern completion.
4. Predictive Coding / Error Minimization
-
Higher-level columns send predictions downward.
-
Lower-level columns compare observed features against predictions.
-
When prediction errors are minimized across a set of columns, the higher-level object is recognized.
-
Analogy: predictive coding theories of cortex.
5. Temporal Sequence Aggregation
-
Features are integrated across time rather than space alone.
-
The higher-level object is recognized not by a static configuration but by the unfolding sequence of features (e.g., the parts of an object revealed as you move your eyes).
-
Analogy: sequence memory in HTM.
6. Synchrony / Oscillatory Binding
-
Features belonging to the same object are grouped by synchronous firing or phase-locking.
-
Higher-level recognition emerges when a consistent oscillatory group is detected.
-
Analogy: binding-by-synchrony hypothesis in neuroscience.
7. Bayesian / Probabilistic Inference
-
Columns encode likelihoods of features.
-
Higher-level representations emerge by Bayesian inference combining likelihoods and priors.
-
Analogy: probabilistic graphical models, belief propagation.
8. Sparse Distributed Representations (SDRs)
-
Features map into a sparse code that is unique for each higher-level object.
-
The overlap structure of the SDR ensures that higher-level units can robustly detect the object even from partial input.
-
Analogy: HTM’s sparse coding or cortical population codes.
These computational.variants are all compatible with the papers on cortical-thalamical pathways segmenting the cortical hierarchical regions. (Murray et. al).
These algorithmic alternatives just point out the options that exist at the lower levels of representational computation.
Another interesting approach:
Here is the link to “A cortical hierarchy of localized and distributed processes revealed via dissociation of task activations, connectivity changes, and intrinsic timescales”: A cortical hierarchy of localized and distributed processes revealed via dissociation of task activations, connectivity changes, and intrinsic timescales - ScienceDirect
I definately beleive our brain circuitry operates on a multi-clocked set of timescales. (I use clocked losely. I mean a set of different time domaines).
