Hey @elpizo! This is Hojae, one of TBP’s researchers.
Thanks for linking the papers! Disclaimer: I read Anthropic’s previous work on Scaling Monosemanticity in detail before, but haven’t yet explored detail interactive graphs in their newer papers.
(Human) language-agnostic representations of concepts and their relations with each other. This seems to indicate Claude has some model of entities / objects / terms in the world, and isn’t just predicting the next token.
I agree this finding is pretty cool! The cross-lingual concept representation suggests some form of universal modeling. Though I wonder if this is less about Claude having an explicit “model of the world” and more about an efficient optimization that emerges during training. Since they found more features than neurons (polysemanticity), this might be the model’s way of efficiently “compressing” information across languages rather than storing duplicate concepts for each language.
Multi-step planning in poems. Claude was observed holding multiple candidates to end a line with a rhyme and using those final words to influence the immediate next word. Again, also more sophisticated behavior than how traditional LLMs are thought to work.
I liked their example of suppressing “rabbit” concept in a poem and Claude responding with “habit” (both rhymes with “grab it”). While this could suggest planning, I’m curious if it might also be explained by how the attention mechanisms works. The model can attend to all previous tokens during next-token prediction, and removing “rabbit” from the vocabulary naturally elevates other rhyming options like “habit.” It also somewhat reminded me of Beam Search in NLP.
Long range connections. This was briefly mentioned, but they found that edge can sometimes skip layers and have disproportional effect on later-layer features or outputs. Feels like something much closer to the heterarchy than to hierarchical deep learning networks.
Regarding the long range connections, I wasn’t sure if this was partly due to the tool they used (i.e. cross-layer transcoders). It would be interesting to see if these observations hold true as more tools to interpret LLMs become available.
I think the most interesting example to me was how they probed “addition” in Claude. Finding that internally Claude uses some heuristic to approximate the sum, and some specific circuit to get the last digit correct (presumably because we would most likely notice if there was an error in the last digit during addition), was both fascinating and concerning. What’s concerning was that while Claude used the heuristic, it claims to have gotten this answer by adding the ones and carrying the 1, then adding the tens (they have more examples like these in the “Chain-of-Thought Faithfulness” section). I think trying to align Claude’s verbal explanation to what is happening inside would be crucial for developing safe and trustworthy AI.
Btw, there was a similar question also posted here, for which I have also given a response if you are interested.