"Introducing Nested Learning: A new ML paradigm for continual learning" by Google

Ali Behrouz et al., the authors of the Titans architecture that made waves earlier this year, published a new paper about a new architecture called “Hope”.

Blog post: Introducing Nested Learning: A new ML paradigm for continual learning

Paper: https://abehrouz.github.io/files/NL.pdf

Abstract

Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance the capability of machine learning models. Despite the recent progresses, particularly in developing Language Models (LMs), there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improved, and find ‘‘effective solutions,’’. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own ‘‘context flow’’. NL reveals that existing deep learning methods learns from data through \emph{compressing} their own context flow, and explain how in-context learning emerges in large models. NL suggests a path (a new dimension to deep learning) to design more expressive learning algorithms with more ‘‘levels’’, resulting in higher-order in-context learning abilities. In addition to its neuroscientifically plausible and mathematically white-box nature, we advocate for its importance by presenting three core contributions: (1) Deep Optimizers: Based on NL, we show that well-known gradient-based optimizers (e.g., Adam, SGD with Momentum, etc.) are in fact associative memory modules that aim to compress the gradients with gradient descent. Building on this insight, we present a set of more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Titans: Taking advantage of NL’s insights on learning algorithms, we present a novel sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of ``long-term/short-term memory’'. Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called Hope, showing promising results in language modeling, continual learning, and long-context reasoning tasks.

My honest take

The authors performed a mathematical reframing of optimizers to split them into layers, and introduced multi-rate parameter updates at different frequencies (e.g. every 1, 100, 10000 tokens) for these layers. Fast params adapt to new tasks quickly, and slow params preserve old knowledge, which the authors think might reduce forgetting both in-context and during re-training. The benchmark results show modest performance improvements in some cases, and no improvement in others.

The authors then go on to plaster a bunch of neuroscience terms like “associative memory”, “neuroplasticity”, “human brain” and plenty of buzzwords like “continual learning”, “self-modifying”, “neuroscientifically plausible”, “self-improving AI”, and so on, despite these having little to do with their actual findings.

Long story short; a small incremental improvement, few results, and lots of sensationalism!

Edit: The authors mention that they provide more extensive results in the appendix, but it’s not published yet.

4 Likes

Interesting. I would have interpreted temporal windows more as a series of coordination/execution layers, rather than nested learning loops. The authors also seem to put a bit too much functional weight on oscillations themselves. To my knowledge, brain waves don’t perform the learning in and of themselves, but rather schedule such coordination. I like the overall direction, but the paper feels a bit too abstracted from the neuroscience to justify borrowing so many of its terms. Either way, cool post. Thanks for sharing!

3 Likes

Thanks for sharing! It’s hard to keep up with everything that is happening in AI these days, so it’s super useful to read this summary and your take on its relationship to neuroscience and TBP :slight_smile:

3 Likes

There is another similar paper where Chris Royse proposes an approach to solving the “learning” part and “remember” part of of ML. Here is a link to the paper at ResearchGate. LinkedIn post from Chris that has a short YT video.

1 Like

@duncwa That’s more of a high-level RAG pipeline, whereas Nested Learning is an ML architecture; apples to oranges :wink: