Physics Inspired Ambitious Mech Interp (PI-AMI) Roadmap
We outline a roadmap for how physics inspired ambitious mech interp (PI-AMI) could plausibly save the world.
We begin by outlining what we see as three different classes of hurdles which PI-AMI should aim to overcome: technical, methodological and sociological hurdles.
Section One
Section 1 begins by giving an introduction and sociological overview of the different research cultures, which we label as pragmatic, engineering, phenomenological, and theoretical. In order to deliver on its promise, we claim that PI-AMI will have to meaningfully engage with each of these different research cultures and their differing standards of evidence (pragmatism & reconstructive benchmarks). We present these sociological hurdles.
Read MoreSection Two
Section 2 continues by reconstructing what we label as the bearish position on ambitious mechanistic interpretability, which has been recently adopted by many working in the pragmatist research cultures and most aptly summarized in [Nanda 25]. We outline the main thesis in this position: Methodological Difficulties (It’s hard to measure progress, scalability), and two different technical issues Auditability of the Long Tail of Learned Heuristics, Difficulty of Establishing Completeness Arguments as well as give an overview of the body of evidence which currently supports these points. In this section, we also make out why a physics inspired research culture and methods is well positioned to address the issues as the hurdles presented in the bearish position, as well as a positive vision of what successful industry and government adoption looks like.
Read MoreSection Three
Section 3 We outline a roadmap of open technical problems as well as currently partially open solutions. We break it down into the following sections: Theories of Data, of Learning, and of Learned Representations.
Read MoreSection Four
Section 4 We outline different potential applications of the foundational science laid out above. In particular, we focus on Interpretable by Design, Reverse Engineering Models, Utilizing Interpretability Tools within the Training Process.
Read More