Alignment & Safety · Interpretability
With Goodfire's Silico debugger, AI lie detectors nearing production, and new safety fellowships at Anthropic and OpenAI, the field is building real-world infrastructure while the crucial question of what these evals actually measure persists.
May 13, 2026
·
9 min
Alignment · Interpretability
After years as a niche research discipline, mechanistic interpretability is now spawning startups, fellowship programs, and off-the-shelf debugging tools, though the hardest problems remain unsolved.
May 12, 2026
·
9 min
Security · Alignment
As exploit windows shrink to hours, AI red teaming shifts from quarterly checkpoints to continuous automation, yet blind spots in the methodology remain that tools alone cannot fix.
May 12, 2026
·
10 min
Alignment · Reading Lists
New alignment reading lists and literature surveys are reshaping the field's self-definition, while battles over which papers make the cut expose deeper fractures in AI safety research.
May 9, 2026
·
9 min
Alignment & Safety · Investigation
An MIT audit of 72 AI agent frameworks reveals a stark absence of safety disclosures and kill switches, while Anthropic’s unreleased Mythos model deepens the chasm between benchmark performance and real-world trust.
May 9, 2026
·
4 min
AI · Safety
The 84-page card disclosure published with Mythos 5 is the most detailed pre-deployment evaluation on record. The strongest version of the safety claim is also the narrowest.
May 8, 2026
·
2 min
AI · Analysis
Three of the five most-cited frontier benchmarks have had their public splits leak into training corpora since January. The score on the leaked one is not the score on the held-out one.
May 6, 2026
·
1 min