AI Safety Research Cultures
3 Different Cultures – Pragmatic, Engineering, Theory
We’ll look at three different research cultures, with different institutional boundaries, values & standards of evidence:
- Pragmatic – Frontier AI Labs and Government Regulators, impact, application standards & auditing practices
- Engineering – Mostly Research NGOs some Academia, focused on building tools for production sized models which are steered by natural data ML benchmarks
- Theory – Mostly Academia some research NGOs, focused on theory in models trained on synthetic data, peer review & academic accolades
Pragmatic – Labs & Auditors (Work)
There are two ways in which labs & auditors are currently using interpretability methods: Monitoring Probes & White Box Auditing Games.
- Monitoring Probes
- Production Ready Probes (DeepMind)
- Constitutional Classifiers (Anthropic) (Activation Oracles)
- White Box Auditing Games
- Red and Blue Team games
As we can see, most of this work isn’t very ambitious in its aim. In the next section we’ll seek to understand why.
Pragmatic – Labs & Auditors (Philosophy)
“We think ambitious reverse-engineering should be evaluated the same way as other pragmatic approaches: via empirical payoffs on tasks, not approximation error.”
- Models are more capable nowadays and can be used as meaningful model organisms for safety relevant phenomena
- intention, coherence, scheming, evaluation awareness, reward hacking, alignment faking, and other rich, safety-relevant behaviors that earlier models didn’t exhibit.
- Negative results in solving direct problems on downstream tasks:
- SAEs weren’t competitive with simple probes
- Overall, disbelief in ambitious interp. Are now aiming for a “Swiss Cheese” safety case style arguments
- Examples
- Interpret the hidden goal of a model organism
- Stop emergent misalignment without changing training data
- Predict what prompt changes will stop an undesired behaviour