Extracting concepts from GPT-4

hlee · April 3, 2025, 2:37am

Hi @Aleksandar_Kamburov! This is Hojae, one of TBP’s Researchers. Welcome to the community!

Great question about the convergence between these approaches. We’re familiar with some works in the field of mechanistic interpretation and Sparse Autoencoder (SAE) techniques for understanding LLMs (e.g. there is also some work done by Anthropic’s Interpretability team, such as Scaling Monosemanticity).

The SAE approach is fascinating because it attempts to identify the “concepts” learned by LLMs after they’ve been trained. While OpenAI’s work identified an impressive 16 million concepts from GPT-4, there are still important limitations. These concepts often require human labeling to be meaningful, and the discovered features can be difficult to interpret consistently. Sometimes what appears to be a single concept shows unexpected activations alongside seemingly unrelated concepts. Second, there’s certainly movement toward convergence with reverse-engineering approaches, but we’re not quite there yet. It may be possible that in the near future the DL community uses techniques from previous “steerability” papers in generative networks (e.g., semantic factorization of StyleGANs and GANspace to causally influence model behavior by activating specific features (e.g. in the paper the Anthropic linked above, they have “steered” Claude to think it is a Golden Gate Bridge by maximizing this activation), but we don’t yet have a clear path to fully reverse-engineer an LLM based on SAE-discovered concepts.

One interesting philosophical difference is our perspective on “polysemanticity” (the idea of neurons representing multiple concepts) and “monosemanticity” (the idea of neurons forming one-to-one correlation with features or concepts) from the field of mechanistic interpretation. In the human brain, there isn’t a specific neuron dedicated to a particular concept. Rather than attempting to extract concepts post-hoc from a trained system, Monty learns structured representations through sensorimotor interaction with its environment. The concepts or objects that Monty learns are directly acquired through experience rather than emerging from subsequent analysis. Our approach is based on reference frames and embodied learning, which we believe more closely mirrors how biological intelligence develops.

That said, it’s great to see efforts to better understand the inner workings of neural networks.

Topic		Replies	Views
Anthropic Microscope Research and Theory	3	80	April 3, 2025
Hello, and thoughts about leveraging LLMs in Monty Introductions	7	185	July 23, 2025
2024/08 - Encoding Object Similarity in SDRs Video Discussions core-video	12	344	April 28, 2025
Absolute Zero: Reinforced Self-play Reasoning with Zero Data Miscellaneous	6	131	May 16, 2025
Abstract Concept in Monty General	9	360	April 18, 2025

Extracting concepts from GPT-4

Related topics