Anthropic Microscope

Curious if anyone’s read through Anthropic’s latest paper analyzing the internals of their Claude model and how it works.

Observations I found interesting:

  • (Human) language-agnostic representations of concepts and their relations with each other. This seems to indicate Claude has some model of entities / objects / terms in the world, and isn’t just predicting the next token.

  • Multi-step planning in poems. Claude was observed holding multiple candidates to end a line with a rhyme and using those final words to influence the immediate next word. Again, also more sophisticated behavior than how traditional LLMs are thought to work.

  • Long range connections. This was briefly mentioned, but they found that edge can sometimes skip layers and have disproportional effect on later-layer features or outputs. Feels like something much closer to the heterarchy than to hierarchical deep learning networks.

Resources

Would love to get all y’all’s takes!

1 Like

Hey @elpizo! This is Hojae, one of TBP’s researchers.

Thanks for linking the papers! Disclaimer: I read Anthropic’s previous work on Scaling Monosemanticity in detail before, but haven’t yet explored detail interactive graphs in their newer papers.

(Human) language-agnostic representations of concepts and their relations with each other. This seems to indicate Claude has some model of entities / objects / terms in the world, and isn’t just predicting the next token.

I agree this finding is pretty cool! The cross-lingual concept representation suggests some form of universal modeling. Though I wonder if this is less about Claude having an explicit “model of the world” and more about an efficient optimization that emerges during training. Since they found more features than neurons (polysemanticity), this might be the model’s way of efficiently “compressing” information across languages rather than storing duplicate concepts for each language.

Multi-step planning in poems. Claude was observed holding multiple candidates to end a line with a rhyme and using those final words to influence the immediate next word. Again, also more sophisticated behavior than how traditional LLMs are thought to work.

I liked their example of suppressing “rabbit” concept in a poem and Claude responding with “habit” (both rhymes with “grab it”). While this could suggest planning, I’m curious if it might also be explained by how the attention mechanisms works. The model can attend to all previous tokens during next-token prediction, and removing “rabbit” from the vocabulary naturally elevates other rhyming options like “habit.” It also somewhat reminded me of Beam Search in NLP.

Long range connections. This was briefly mentioned, but they found that edge can sometimes skip layers and have disproportional effect on later-layer features or outputs. Feels like something much closer to the heterarchy than to hierarchical deep learning networks.

Regarding the long range connections, I wasn’t sure if this was partly due to the tool they used (i.e. cross-layer transcoders). It would be interesting to see if these observations hold true as more tools to interpret LLMs become available.

I think the most interesting example to me was how they probed “addition” in Claude. Finding that internally Claude uses some heuristic to approximate the sum, and some specific circuit to get the last digit correct (presumably because we would most likely notice if there was an error in the last digit during addition), was both fascinating and concerning. What’s concerning was that while Claude used the heuristic, it claims to have gotten this answer by adding the ones and carrying the 1, then adding the tens (they have more examples like these in the “Chain-of-Thought Faithfulness” section). I think trying to align Claude’s verbal explanation to what is happening inside would be crucial for developing safe and trustworthy AI.

Btw, there was a similar question also posted here, for which I have also given a response if you are interested.

4 Likes

It’s a great paper - and fascinating. Some great comments too.
On this one, I thought it was the fact that it picked a set of rhyming words - which IFAIK output sampling methods (such a Beam Search) won’t do. This suggests that it was already using an activation for rhyming and had a candidate stored in the residual stream (at a EOL position!), before it needed to generate any output tokens. On a ‘next token only’ prediction method this suggests it is looking way ahead (remember other input tokens from the second line still masked by attention at this point of computation).

It’s hard to say if this is or isn’t planning - but it does kinda feel like it. I’d be more comfortable with parallel paths probably (is this different? :thinking: )

I do like your suggestion about language-agnosticism being for compression.

The ‘English default’ comments also in their section, does makes me wonder - since grammar will invoke the concepts in different sentence placements (and hence computation graph orders for NTP). I probably haven’t thought it through well enough :wink:

All fascinating but very different to what you are trying here.

3 Likes

@DanML oh, good point! Wish we could somehow know more about how “planning” is happening. And yep, agree that these are all fascinating but pretty different from what we are implementing in Monty. :slight_smile:

2 Likes