Research Highlights

Transformers Represent Belief State Geometry in Their Residual Stream

This paper presents compelling empirical evidence that transformer language models encode belief states—probability distributions over world states—in interpretable geometric structures within their residual streams. Drawing on computational mechanics, the authors demonstrate that transformers maintain representations of uncertainty and track competing hypotheses about underlying generative processes. This work suggests that interpretability research could benefit from treating neural activations as embedded probability distributions rather than merely feature vectors. The finding that belief geometry is preserved through layers provides a potential foundation for understanding how models reason under uncertainty.