AI models are trained and not directly programmed, so we don’t understand how they do most of the things they do. Our new interpretability methods allow us to trace their (often complex and surprising) thinking.
With two new papers, Anthropic's researchers have taken significant steps towards understanding the circuits that underlie an AI model’s thoughts.
In one example from the paper, we find evidence that Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes each line to get there. This is powerful evidence that, even though models are trained to output one word at a time, they may think on much longer horizons to do so.
Read more: https://anthropic.com/research/tracing-thoughts-language-model