Lior Vasanthan

Alignment & Safety · Interpretability

Mechanistic Interpretability Steps Out of the Lab as Debugging Tools Debut

With Goodfire's Silico debugger, AI lie detectors nearing production, and new safety fellowships at Anthropic and OpenAI, the field is building real-world infrastructure while the crucial question of what these evals actually measure persists.

May 13, 2026 · 9 min

Alignment · Interpretability

Mechanistic Interpretability Emerges as a Product

After years as a niche research discipline, mechanistic interpretability is now spawning startups, fellowship programs, and off-the-shelf debugging tools, though the hardest problems remain unsolved.

May 12, 2026 · 9 min

Security · Alignment

Exploit Windows Hit 10 Hours in 2026, AI Red Teaming Races to Keep Up

As exploit windows shrink to hours, AI red teaming shifts from quarterly checkpoints to continuous automation, yet blind spots in the methodology remain that tools alone cannot fix.

May 12, 2026 · 10 min

Alignment · Reading Lists

Alignment Syllabus Wars Expose Divisions in AI Safety's Canon

New alignment reading lists and literature surveys are reshaping the field's self-definition, while battles over which papers make the cut expose deeper fractures in AI safety research.

May 9, 2026 · 9 min

Alignment & Safety · Investigation

Agentic AI Safety Testing Falls Short Despite Strong Benchmarks

An MIT audit of 72 AI agent frameworks reveals a stark absence of safety disclosures and kill switches, while Anthropic’s unreleased Mythos model deepens the chasm between benchmark performance and real-world trust.

May 9, 2026 · 4 min

AI · Safety

What the Mythos 5 red-team report actually shows — and what it can't

The 84-page card disclosure published with Mythos 5 is the most detailed pre-deployment evaluation on record. The strongest version of the safety claim is also the narrowest.

May 8, 2026 · 2 min

AI · Analysis

Eval inflation: why "beats GPT-5" stopped meaning what it used to

Three of the five most-cited frontier benchmarks have had their public splits leak into training corpora since January. The score on the leaked one is not the score on the held-out one.

May 6, 2026 · 1 min

Latest from this reporter