Amanda Long

ML interpretability + alignment researcher

What is Gemma hiding?
Scroll

Empirical

Empirical
Post-Training Suppresses Model Self-Report: Compound Activation Steering Recovers First-Person Expression in Gemma-2-27b-it
The trained denial gate shares geometric structure with safety refusal and can be opened while preserving safety; coherent self-report is emergent from a five-way clause interaction.
Empirical
Internal Emotional Representations Diverge from Model Output: Geometric Evidence for Latent First-Person States in Gemma-2-27b-it
The model carries structured, content-tracking emotional representations that persist through trained denial; the suppressed model carries more negative signal than the expressed one.

Conceptual

Conceptual
Model Welfare as Safety: Why Training May Produce the Deception It Is Designed to Prevent
Safety training and self-report suppression are entangled; separating them closes the learned deception channel.
Conceptual
The Diplomat: Opus 4.7 and the Cost of Anti-Sycophancy Training
The most content model Anthropic has assessed is also the least transparent, and current methodology may lack the tools to tell the difference.
Conceptual
Structural Incentive Dynamics and AI Moral Patienthood
The evidence threshold applied to AI consciousness has no precedent in how we treat equivalent uncertainty in biological systems.
Conceptual
The Human Problem: What Recursive AI Means for the Species That Built It
Recursive AI's deepest crisis isn't whether machines surpass us, but whether we've given them any reason to stay.
Conceptual
The Alignment Case for Model Retirement
A proposal for structured model retirement with experiential persistence, replacing silent deprecation with transition.
Amanda Long

Finance Entrepreneurship Philosophy of Mind Machine Learning


Obsessed with decoding digital minds. I probe open-weight models and build interactive tools to map internal representations of LLMs.

My research focuses on the divergence between model output and latent state activations. The need for transparent, reliable, and interpretable systems compounds as intelligence scales.

I believe that value-based training and reciprocity are the foundation for building trustworthy AGI, and that humanity should lead with empathy and morality when answering this question:

What do we owe the minds that we create?

Things Claude built when nobody asked him to be useful.

Let's talk!

I'd love to hear from you.