Section 1

AI Safety Research Cultures

3 Different Cultures – Pragmatic, Engineering, Theory

We’ll look at three different research cultures, with different institutional boundaries, values & standards of evidence:

  • Pragmatic – Frontier AI Labs and Government Regulators, impact, application standards & auditing practices
  • Engineering – Mostly Research NGOs some Academia, focused on building tools for production sized models which are steered by natural data ML benchmarks
  • Theory – Mostly Academia some research NGOs, focused on theory in models trained on synthetic data, peer review & academic accolades

Pragmatic – Labs & Auditors (Work)

There are two ways in which labs & auditors are currently using interpretability methods: Monitoring Probes & White Box Auditing Games.

  • Monitoring Probes
    • Production Ready Probes (DeepMind)
    • Constitutional Classifiers (Anthropic) (Activation Oracles)
  • White Box Auditing Games
    • Red and Blue Team games
  • As we can see, most of this work isn’t very ambitious in its aim. In the next section we’ll seek to understand why.

Pragmatic – Labs & Auditors (Philosophy)

“We think ambitious reverse-engineering should be evaluated the same way as other pragmatic approaches: via empirical payoffs on tasks, not approximation error.”

  • The Pivot Away from Full Reverse Engineering
    • Models are more capable nowadays and can be used as meaningful model organisms for safety relevant phenomena
        intention, coherence, scheming, evaluation awareness, reward hacking, alignment faking, and other rich, safety-relevant behaviors that earlier models didn’t exhibit.
      Negative results in solving direct problems on downstream tasks:
        SAEs weren’t competitive with simple probes
        Overall, disbelief in ambitious interp. Are now aiming for a “Swiss Cheese” safety case style arguments
    • Their MO: Solve Proxy tasks which are directly useful for with the simplest methods
        Examples
          Interpret the hidden goal of a model organism
          Stop emergent misalignment without changing training data
          Predict what prompt changes will stop an undesired behaviour