2026/02 - A Thousand Brains on a Thousand Chips: Scaling Monty on CPUs, GPUs, and PiM chips

Xavier from ETH Zurich presents his master’s thesis, exploring how Thousand Brains systems could scale on modern hardware. His research examines how scaling the number of learning modules affects computing performance on GPUs, CPUs, and processing-in-memory (PIM) architectures. GPUs aren’t a great fit because autoregressive algorithms like Monty have low operational intensity. CPUs can scale reasonably well but require more computation time as the number of modules increases. PiMs offer a promising alternative by placing computation near memory, enabling large-scale parallelism. Xavier shows results from scaling to 2,500 learning modules, representing millions of neurons and billions of synapses.

0:00 Introduction
0:49 An Overview of Monty’s Structure
2:12 Motivation: Rapid, Continuous, and Compute Efficient Learning
2:51 Scaling Cortical Columns
3:48 Scaling an Algorithm with an Auto-regressive Loop
4:39 Investigate the Scalability of Thousand Brains Systems
5:41 Montyll – A Novel Thousand Brains System
6:25 Why HTM Networks?
8:03 Scaling on GPUs
9:56 Scaling on GPUs: Operational Intensity
11:54 Scaling on CPUs
13:07 Can Multicore CPUs Handle the Amount of Data Movement?
15:03 Scaling on Processing-in-Memory (PiM) Chips
15:47 Scaling on PiMs: DRAM Banks
24:14 Scaling in Data Centers
25:13 The Montyll Implementation
30:20 Cat Cortex Scale System (2500 Learning Modules)
32:33 Why Logic Frequency Is Low?
33:58 Processing-in-Memory Chip Illustration
38:20 WRAM
39:26 MRAM
40:41 Connection Transfer
41:12 Tasklet Level Parallelism
42:08 Barriers and Synchronization
43:36 The Results
45:35 Results: Time per Step
56:40 Neurons and Synapses vs Devices

6 Likes

Hey, this is my work!

I am very excited that this video is coming out today because I’ve been meaning to share this work with the smart people in this forum for a while now. I hope you guys get some value out of it. Let me now add some context and key takeaways.

Goal. If one really believes that the learning modules are designed after columns in the neocortex, it begs the question: what are the consequences of scaling Thousand Brains Systems to the scale of the human neocortex (200’000 learning modules)? What computing platforms are best equipped to respond to the scaling requirements?

Importance. Thousand Brains Systems is basically a huge set of heterogeneous weights (different learning modules) operating on a set of heterogeneous inputs (different sensor patches). This implies a unique scaling profile that is completely and fundamentally different from the scaling profile of deep-learning based AI systems. Understanding this scaling profile and its impact on different computing architectures felt important to me.

Methods. We introduce a novel Thousand Brains Systems, called Montyll, which stands for “Monty low-level”, because it introduces elements of low-level cortical processing (HTM networks). This is for two reasons. The first reason is that I wanted to computationally capture the long-term goals of the Thousand Brains Project, which mentions incorporating elements of HTM networks in Thousand Brains Systems. The second is that I was really interested in evaluating this promising computing architecture called “Processing-in-Memory” in scaling Thousand Brains Systems. HTM networks were a great match for the capabilities of that architecture. We look at the potential of GPUs, CPUs and Processing-in-Memory (PiM) in scaling Montyll. I also talk about clusters and neuromorphic hardware in the thesis.

Key Takeaways. Well you’ll have to read the thesis or watch the talk for further details but here are some key takeaways.

  1. GPUs are not a great fit for scaling Thousand Brains Systems. It has to do with the need for high operational intensity (high reuse), which is incompatible with the auto-regressive loop and heterogeneous weights found in Thousand Brains Systems. This does not completely exclude GPUs from accelerating the workload, and we’ve already seen some impressive efforts from people in this forum, but I do not think it can greatly revolutionize the applicability and scale of Thousand Brains Systems like it did for deep learning.
  2. Short/Medium Term. CPUs are more than good enough. I do not see Thousand Brains researchers scrambling to find a better computing platform on the short to medium term. Single node CPUs can accommodate pretty big systems already, of the size of the guinea pig cortex (400 LMs) and potentially up to the size of a cat cortex (2’500 LMs). Datacenters and cpu clusters close the scale gap with a promise to scale to the human neocortex without needing an egregious amount of machines.
  3. Medium Term. DRAM-based PiM represents a very capable computing platform. The promise of PiM lies in its ability to execute a massive amount of learning modules in parallel. There are many drawbacks to using Processing-in-Memory chips today, but they are not fundamental or “first principles” drawbacks. They mostly stem from a lack of resources being put towards making them better and more capable. If Thousand Brains Systems become as big a deal as I think they will, I see DRAM-based PiM as having the ability to potentially revolutionize the scalability of Thousand Brains Systems on the medium term, especially for embodied applications that require real-time, power-constrained. Only time will tell if I am right. It is quite possible that large investments are instead made into connecting robots to compute clusters, which will undercut the need for scale in a small form factor, but this approach has its own obvious problems.
  4. Long Term. If the idea is to eventually have a machine capable of running Thousand Brains in a small form-factor (single node, low-ish power, for embodied intelligence), not sure any compute platform today can hit the device scaling requirements, let alone the power budget requirements. Would need ~600 TB of memory footprint, and copious amounts of compute-parallelism at multiple levels of granularity. That’s outside the scope of the paper, but if I had to make a bet, NVM-based PiM looks like the best fit from what I’ve seen, for its combination of footprint and weight-heterogeneous compute parallelism. I don’t see how Processing-in-Storage (PiS) could accommodate the workload, but I might be wrong.

Links.

Acknowledgements. I want to specifically not thank Jeff Hawkins and the whole Thousand Brains Project team, including the great Viviane Clay. Getting headaches thinking about this whole thing is your guys’ fault. A direct consequence of the awfully good supervision I received from Doctor Clay and the disgustingly interesting ideas and theories developed by Jeff Hawkins and everyone else over the years at Numenta and the Thousand Brains Project.

I know you guys will probably have some interesting thoughts and questions about this, which I’ll be excitedly waiting for.

6 Likes

Hi @xavier

Thank you for a very interesting presentation.

I don’t know if you looked at this, but another option is arrays of FPGAs. I have designed several ASICs and many FPGAs. With an FPGA you can design your own logic, have whatever accuracy arithmetic you need and as much of it as you need. There is on-chip single-cycle memory and typically there is a DRAM interface for bulk storage. You can also attach non-volatile memory of course, and the logic can run faster than 400MHz.

If, for example, you could implement 8 learning modules on a single FPGA + DRAM, then an array of 64 FPGAs on a PCB would give you 512 learning modules, all running in parallel. Placing 8 of these PCBs in a rack (along with some fans !) would give you 4096 learning modules, all in parallel.

FPGAs have many I/O pins including some very high speed (PCIe gen3) I/O with which to implement interconnect protocols both chip to chip and board to board.

And being FPGA it means that you can tinker with the learning module design and the interconnect protocols as ideas develop. When the design crystallizes you might consider going to ASIC to reduce system cost.

It would not be a cheap system to build, probably quarter of a million dollars for the hardware alone, but still much cheaper than custom silicon, and without the risk.

Alex

2 Likes

Well that’s a very good point Alex. Well I am a bit ashamed to admit but I really haven’t given much of a thought to FPGAs, so I am pretty happy about your comment because it’s giving me something new to think about.

My first thought is that it suffers from the same memory wall problem that a CPU-based system suffers from. The workload is memory-bound, hence the limiting factor is not necessarily the logic processing speed but rather accessing the huge amount of learning-module data that is DRAM-resident, at every step. Parallelizing over more memory systems (each FPGA gets its own DRAM chip) will always yield a benefit because it increases the aggregate memory bandwidth, but that would also be true for CPU-based systems.

Very cool that you have developed the expertise to design several ASICs and FPGAs. That’s above my pay grade, but I always looked up to people who could do that.

1 Like

Yes, you could of course build an array of CPUs, but CPUs require a lot of support hardware and don’t lend themselves to interconnect communications very well.

For sure FPGAs are losing prominence to CPUs and GPUs in the processing world, but I think for the requirements of this new kind of architecture they may have some advantages. Distributed memory, parallel processing and optimised logic functions, they are a good fit.

Another big issue for continuous learning systems is non-volatility. A sensorimotor intelligence is creating memories and learning from the moment it is switched on and its sensors start sending back information. If those memories are in DRAM and there is a power glitch then all of the learning is lost. So you either have to have a very reliable power source or write the memories to non-volatile storage which is typically much slower than DRAM. Biological brains suffer the same weakness sadly.

Like DRAM and many other semiconductors, FPGAs are experiencing shortages and price hikes, but you can experiment with a single FPGA on a development board, there are even development boards that plug into PC expansion slots.

1 Like

I’ve been working on a (mostly) personal project for (at least) 15 years. It’s a computational psychohistory model aimed not at future prediction, but at archaeological research. The concept is to solve difficult anthro and archaeo problems by constructing truly vast sociological simulations. Linear A, anachronisms like Antikythera, and detailed explanations of the rise and fall of civilizations are examples (sort of Gibbon meets Asimov meets scidata :wink: )

It’s definitely neurological, not GenAI. It’s also very old-fashioned (GOFAI). I learn a lot from TBP and even from the old Palm Computing days. The agents are coded in PROLOGish FORTH and the simulation itself in big FORTH, so there’s no copy/pasting of anything from TBP (or anywhere else, other than my own old code). There’s no github team because I’m old, crusty, and hard to work with.

If there’s something helpful I can contribute to TBP, I do (mostly biological findings) just to earn my keep here and not be accused of lurking.

Assembling a diverse, distributed, and motivated research group is not easy, kudos to the TBP leadership. I ran a worldwide Citizen Science team for a decade that eventually got squeezed out by the hostile GPU agora, which pushed me to look at CPU, PIM, FPGA, and other technologies. I work on a small scale (just a few ‘cells’), and was going to scale up to a superchip (working name SELDON I) to be made at a foundry. It would contain a very large number of independent ‘PROLOG-IN-FORTH’ cores, enabling individual agents to run largely independent threads. This architecture isn’t conducive to GPU usage of course.

The recent talk of moving the big Samsung foundry (and others) to Ontario, Canada has rekindled that notion a wee bit. The pace of change these days is challenging.

As Lynn Margulis often said, we shouldn’t be in any great hurry to forget or dismiss old theory and practice. The mind is still largely unexplored and un-replicated computationally (which is why we’re here). The jury’s still out on whether token-munching, stochastic GenAI is The Great Revolution or simply a trillion dollar parlor trick.

1 Like

A lot of TBP scaling discussions seem to assume that all of the modules need to be available at all times and that communication speed is a critical factor. However, I doubt that this is the case: apparently, most cortical columns are idle most of the time. So, here’s a different approach…

Set up a bunch of processors, each of which has a dispatching module (DM). When a message is received, the DM checks to see it the target is resident in memory. If so, the DM simply forwards the message. If not, the DM sends a message to the module loader, asking for the target to be swapped in, then forwards the message in due time. And, if the processor starts to get bogged down, we move some modules to other processors.

None of this is new technology; paging and swapping are common in OS design and process migration has been used in some systems. The key difference here is the duty cycle: we might be able to have 90% of the code and data swapped out at any given time. That wouldn’t work for most computing systems, but it might for us…

3 Likes

@Rich_Morin makes a very good point. As the system scales up we can improve efficiency by selectively processing only those parts that are experiencing change. We won’t know the trade-offs until we start to build big systems and hit hardware limitations.

While musing the processing of an architected neural network for my plastic spider I realised that if I have to process all neurons many times per second I would soon run out of processing power, or memory bandwidth, or both.
Processing all neurons on every pass means that the order of processing is unimportant as any sensory input changes will ripple through in a few passes, which lends itself to parallel processing very nicely. However, many of the neurons will be inactive. It would be more efficient to follow paths of activation and abandon inactive paths. But then processing order becomes important again and the processing engine becomes more complicated.

3 Likes

I agree with Xavier that FPGAs would suffer the same memory bottleneck on the DRAM side. It’s a bit of a shame, because FPGA logic blocks are kinda organized in a way that’s reminiscent of cortical columns. Their distributed BRAM could technically alleviate the memory wall, but even high-end FPGAs only have a few hundred MBs of it, so not really viable.

The non-volatility part is another can of worms on its own. To save something, computers have to transfer stuff between DRAM and drives, unlike neurons. What happens while saving? Do we put the robot on standby? How often? The thesis mentions terabytes of data, what about SSD lifespan? Incremental backups!? etc. But maybe it’s a bit early to talk about that.

On a side-note, here’s a pretty great introduction to FPGAs for those interested: The Most Versatile Chips Ever Built || FPGA Deep Dive and Use

1 Like

Yea agreed with @Rich_Morin, CPU systems are pretty well suited given that only a sparse set of columns are active at each step. Though we still suffer from the cost of moving all of the activated columns’ memory from main memory to logic, which is going to be an important challenge medium/long term.

Non-volatility which you guys are talking about (@Alex and @AgentRev) reminds me of the following. In DRAM, there is a constant power cost to pay for keeping the data uncorrupted. In particular, every dram cell needs to be “refreshed” multiple times a second. This is in part why Processing-in-Memory solutions cannot accommodate scaling to terabytes on a single compute node, because the power demands would be too high.

On the other hand, non-volatile memory technologies have not shown to be a close match to DRAM on multiple dimensions: density, endurance and latency. Choosing non-volatile technology over DRAM comes at a cost.

In any case, discussing these details does feel like getting ahead of myself. If anything, my work showed that typical CPU-based systems, with typical DRAM main memory and flash storage should be more than good enough for the short to medium term.

Before doing this work, I thought I would find that Thousand Brains Systems would need a very particular computing architecture to support the scaling, and I still believe that this will be true on the long term, but what I found for the short/medium term is that we’ll be fine. I don’t see researchers scrambling for esoteric hardware to scale Thousand Brains.