PIRAMID Project

Physics-Informed Research for Ambitious Mechanistic Interpretability Development

We believe that the pursuit of ambitious mechanistic interpretability (AMI) supported by a rigorous science of AI systems is essential for scalable AI alignment. Starting with the methods and frameworks of statistical physics, our research will simultaneously build up the scientific foundations of three pillars of AI safety.

Ongoing work focuses on how neural networks learn and leverage hierarchical structure in real-world data, grounded in several hypotheses:

  1. Learned features are organized according to a hierarchical, scale-dependent notion of relevance.
  2. Faithful interpretability tools leverage this hierarchy.
  3. A scale-aware framework can place principled, probabilistic bounds on worst-case behaviors of AI systems.

Advancements in Learning Theory

We build theoretical frameworks that explain how neural networks organize and process information, and establish principled methods to bound and predict model behaviors, particularly in rare or worst-case scenarios. Projects include:

  • Direct Idealization of Toy Tasks, via comprehensive, though idealized, formal analysis, to guide heuristics that generalize empirically.
  • Indirect Idealization of Realistic Tasks, by empirically identifying hypotheses of realistic behaviors (e.g., in gpt2-small) and creating a model organism that isolates this behavior in a synthetic environment with improved theoretical tractability.
Currently, the main research focus is to conceptualize neural networks as thermodynamic systems and apply mean field approximations to better model their respective behaviours, as demonstrated in these blog post distillations.

Interpretability Applications

We focus on developing the infrastructure (architectures, tooling) for scale-aware interpretability. The goal is to ambitiously model network internals in a way that is informed by theory and useful for AI safety researchers and engineers, grounded in fundamental understanding rather than ad-hoc evaluations and optimization practices. Current work aims to build and advance methods to:

  • Develop new unsupervised training methods – similar to SAEs or weight-based decomposition methods – to reverse-engineer SOTA AI systems
  • Explore training-time uses of interpretability tools for better safety guarantees
  • Design new neural network architectures which are interpretable by design
Examples include this preprint and code.

Data Models & Validation

We construct synthetic datasets that:

  • Encode increasingly complex but tractable properties of natural language data, and
  • As principled validation methods, can be used to assess current tools and phenomenologically guide theory construction.
Current work includes the generation of synthetic datasets that encode a statistically self-similar and sparse structure based on high-dimensional percolation theory, as demonstrated by these papers and this code repository.

These are deeply interconnected: theory underpins applications, which provide empirical support for theoretical predictions, and principled validation methods facilitate quick feedback between the two. Though we take inspiration from physics, we prioritize mission impact over methodological purity, and remain open to complementary or alternative insights from multiple disciplines.

Our Team

Lauren Greenspan Lauren Greenspan Technical Director
Dmitry Vaintrob Dmitry Vaintrob Research Lead, Learning Theory
Nischal Mainali Nischal Mainali Research Affiliate, Learning Theory
Ari Brill Ari Brill Research Lead, Data Models
Tom I. Carlson Tom I. Carlson Research Affiliate, Data Models
Andrew Mack Andrew Mack Research Lead, Tools
Jennifer Lin Jennifer Lin Research Affiliate, Tools