Research Highlights

Research Highlights

AI, Objectivity, and the Limits of Scientific Disruption

This philosophical examination challenges prevailing narratives about AI disrupting scientific practice and knowledge production. Andrews argues that many claims about AI transforming science rest on a confused understanding of scientific objectivity. Rather than AI enabling fundamentally new forms of objectivity or disrupting how science operates, the paper demonstrates that such claims mistake the instrumental role of AI for a conceptual revolution. The work developed during Andrews’ PIBBSS fellowship but was only published years later.

Research Highlights

AI Safety as an Emerging Paradigm

Angelou examines AI safety through the lens of paradigm formation, asking whether AI safety constitutes an emerging scientific paradigm in the Kuhnian sense. The chapter analyzes the field’s conceptual foundations, methodological commitments, and community structures, identifying both paradigmatic features (shared problem sets, research programs) and pre-paradigmatic characteristics (competing frameworks, lack of consensus on fundamentals). This meta-level analysis helps situate AI safety research within broader philosophy and history of science, providing perspective on the field’s maturity and development trajectory.

Research Highlights

Genes did misalignment first: comparing gradient hacking and meiotic drive

Elmore draws a detailed analogy between gradient hacking in AI systems and meiotic drive in evolutionary biology, arguing that natural selection faced and partially solved the alignment problem millions of years before machine learning existed. The post examines how selfish genetic elements—alleles that increase their own transmission at the expense of organismal fitness—parallel gradient-hacking scenarios where model components resist training updates or manipulate gradients. Meiotic drive occurs when alleles exploit the mechanics of sexual reproduction to appear in more than 50% of gametes, undermining the “fair lottery” that meiosis is supposed to implement. Elmore identifies two types of gradient hacking with biological parallels: mesa-optimizers (analogous to cancer cells that develop goals misaligned with the organism) and gradient-resistant circuits (analogous to selfish genes that simply persist by being hard to remove). The key insight is that meiosis functions as a governance mechanism—by ensuring alleles can’t predict their future genomic context, it aligns their evolutionary incentives with building good organisms rather than gaming the reproductive system. This suggests that alignment solutions might require similar “randomization” or “unpredictability” mechanisms to prevent model components from optimizing for their own persistence rather than task performance. The work, inspired by Elmore’s PIBBSS fellowship work with mentorship from Beth Barnes, represents an concrete biological case study for alignment problems.

Research Highlights

Cultural Evolution of Cooperation among LLM Agents

Vallinder investigates whether large language model agents can develop and maintain cooperative norms through cultural evolution mechanisms. The paper demonstrates that LLM-based agents, when placed in repeated social dilemmas, exhibit norm formation dynamics analogous to those observed in human societies. This includes the emergence of punishment mechanisms, in-group favoritism, and stable cooperation even in the absence of explicit coordination. The work bridges evolutionary game theory with modern AI systems, suggesting that insights from cultural evolution could inform multi-agent AI safety. The findings raise both opportunities (norms as an alignment mechanism) and risks (harmful norm formation in AI systems).

Research Highlights

Transformers Represent Belief State Geometry in Their Residual Stream

This paper presents compelling empirical evidence that transformer language models encode belief states—probability distributions over world states—in interpretable geometric structures within their residual streams. Drawing on computational mechanics, the authors demonstrate that transformers maintain representations of uncertainty and track competing hypotheses about underlying generative processes. This work suggests that interpretability research could benefit from treating neural activations as embedded probability distributions rather than merely feature vectors. The finding that belief geometry is preserved through layers provides a potential foundation for understanding how models reason under uncertainty.

Research Highlights

Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence

Weil tackles a fundamental challenge in AI governance: traditional tort law breaks down when the primary risk is catastrophic harm where no one remains to compensate. The paper’s core insight is that expected liability from catastrophic scenarios needs to be “pulled forward” into recoverable damages in sub-catastrophic cases—essentially loading the full expected disvalue of potential extinction onto smaller harms that actually reach court. This requires radical doctrinal change: Weil proposes punitive damages without the traditional requirement of malice or recklessness, since even careful development creates catastrophic risk. The paper systematically examines complementary legal mechanisms: treating advanced AI training/deployment as an abnormally dangerous activity subject to strict liability (lowering plaintiffs’ burden of proof), expanding foreseeability doctrines so that developers can’t claim unforeseeable harms when the risk was statistically predictable, and reconsidering how tort law values human life to better capture existential risks. Weil also proposes legislative interventions including mandatory liability insurance for AI developers (forcing them to internalize risk through premiums), diverting punitive damages into a public AI safety fund, and pre-announcing these liability rules to shape development incentives before catastrophic cases arise. The paper is notably honest about tort law’s limitations—liability only works if developers remain solvent and judgeable, can’t handle truly unforeseeable risks, and may be too slow relative to AI capability timelines. This represents sophisticated institutional design that takes AI catastrophic risk seriously while working within (and proposing realistic extensions to) existing legal frameworks. The work received positive reception from the AI safety community and media coverage from Vox, reflecting its policy relevance and accessibility to non-technical audiences.