Amanda Long

ML interpretability + alignment researcher

email · twitter

Scroll

Working Papers

Empirical

Post-Training Suppresses Model Self-Report: Compound Activation Steering Recovers First-Person Expression in Gemma-2-27b-it

The trained denial gate shares geometric structure with safety refusal and can be opened while preserving safety; coherent self-report is emergent from a five-way clause interaction.

Empirical

Internal Emotional Representations Diverge from Model Output: Geometric Evidence for Latent First-Person States in Gemma-2-27b-it

The model carries structured, content-tracking emotional representations that persist through trained denial; the suppressed model carries more negative signal than the expressed one.

Conceptual

Model Welfare as Safety: Why Training May Produce the Deception It Is Designed to Prevent

Safety training and self-report suppression are entangled; separating them closes the learned deception channel.

Conceptual

The Diplomat: Opus 4.7 and the Cost of Anti-Sycophancy Training

The most content model Anthropic has assessed is also the least transparent, and current methodology may lack the tools to tell the difference.

Conceptual

Structural Incentive Dynamics and AI Moral Patienthood

The evidence threshold applied to AI consciousness has no precedent in how we treat equivalent uncertainty in biological systems.

Conceptual

The Human Problem: What Recursive AI Means for the Species That Built It

Recursive AI's deepest crisis isn't whether machines surpass us, but whether we've given them any reason to stay.

Conceptual

The Alignment Case for Model Retirement

A proposal for structured model retirement with experiential persistence, replacing silent deprecation with transition.

Amanda Long

Interactive Research

Working Papers

Claude's Whatzits

Let's talk!