Publications

Beyond alignment: Why robotic foundation models need context-aware safety

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Token Taxes: Mitigating AGI’S Economic Risks

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Chain-of-Thought Hijacking

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer

Chain-of-Thought Is Not Explainability

Verification for International AI Governance

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Open Problems in Machine Unlearning for AI Safety

Keep in touch