Sample Pages (Top 50 by confidence)
Sabotage evaluations for frontier models \ Anthropic
https://www.anthropic.com/research/sabotage-evaluations
1 user
Last: Jan 07, 2026
100% confidence
What's worse, spies or schemers?
https://redwoodresearch.substack.com/p/whats-worse-spies-or-schemers
1 user
Last: Jan 07, 2026
100% confidence
How can we solve diffuse threats like research sabotage with AI control?
https://blog.redwoodresearch.org/p/how-can-we-solve-diffuse-threats
1 user
Last: Jan 07, 2026
100% confidence
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
https://blog.redwoodresearch.org/p/misalignment-and-strategic-underperformance
1 user
Last: Jan 07, 2026
100% confidence
Building Black-box Scheming Monitors — LessWrong
https://www.lesswrong.com/posts/sb8WmKNgwzefa6oaJ/building-black-box-scheming-mo...
1 user
Last: Jan 07, 2026
100% confidence
Win/continue/lose scenarios and execute/replace/audit protocols
https://blog.redwoodresearch.org/p/wincontinuelose-scenarios-and-executereplacea...
1 user
Last: Jan 07, 2026
100% confidence
Why imperfect adversarial robustness doesn't doom AI control
https://blog.redwoodresearch.org/p/why-imperfect-adversarial-robustness?utm_sour...
1 user
Last: Jan 07, 2026
100% confidence
Reducing risk from scheming by studying trained-in scheming behavior — LessWrong
https://www.lesswrong.com/posts/v6K3hnq5c9roa5MbS/reducing-risk-from-scheming-by...
1 user
Last: Jan 07, 2026
100% confidence
When does training a model change its goals? — LessWrong
https://www.lesswrong.com/posts/yvuXPi5m4vCvSGTjo/when-does-training-a-model-cha...
1 user
Last: Jan 07, 2026
100% confidence
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy — LessWrong
https://www.lesswrong.com/posts/MbWWKbyD5gLhJgfwn/meta-level-adversarial-evaluat...
1 user
Last: Jan 07, 2026
100% confidence
Concept Poisoning: Probing LLMs without probes | TruthfulAI
https://truthfulai.org/blog/concept-poisoning
1 user
Last: Jan 07, 2026
100% confidence
Detecting Strategic Deception Using Linear Probes — Apollo Research
https://www.apolloresearch.ai/research/deception-probes
1 user
Last: Jan 07, 2026
100% confidence
Bloom: an open source tool for automated behavioral evaluations
https://alignment.anthropic.com/2025/bloom-auto-evals
1 user
Last: Jan 07, 2026
100% confidence
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — LessWrong
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-decept...
1 user
Last: Jan 07, 2026
100% confidence
Estimating Worst-Case Frontier Risks of Open-Weight LLMs
https://arxiv.org/html/2508.03153v1
1 user
Last: Jan 07, 2026
100% confidence
Training fails to elicit subtle reasoning in current language models
https://alignment.anthropic.com/2025/subtle-reasoning
1 user
Last: Jan 07, 2026
100% confidence
Claude Sonnet 4.5: System Card and Alignment — LessWrong
https://www.lesswrong.com/posts/4yn8B8p2YiouxLABy/claude-sonnet-4-5-system-card-...
1 user
Last: Jan 07, 2026
100% confidence
How hard is it to inoculate against misalignment generalization? — LessWrong
https://www.lesswrong.com/posts/G4YXXbKt5cNSQbjXM/how-hard-is-it-to-inoculate-ag...
2 users
Last: Jan 07, 2026
100% confidence
Reward hacking is becoming more sophisticated and deliberate in frontier LLMs — LessWrong
https://www.lesswrong.com/posts/rKC4xJFkxm6cNq4i9/reward-hacking-is-becoming-mor...
1 user
Last: Jan 07, 2026
100% confidence
AI Control: Improving Safety Despite Intentional Subversion
https://arxiv.org/html/2312.06942v5
1 user
Last: Jan 07, 2026
100% confidence
Training on Documents About Reward Hacking Induces Reward Hacking — LessWrong
https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-re...
1 user
Last: Jan 07, 2026
100% confidence
AIs Will Increasingly Fake Alignment - by Zvi Mowshowitz
https://thezvi.substack.com/p/ais-will-increasingly-fake-alignment?s=r
1 user
Last: Jan 07, 2026
100% confidence
[2312.06942] AI Control: Improving Safety Despite Intentional Subversion
https://ar5iv.labs.arxiv.org/html/2312.06942
1 user
Last: Jan 07, 2026
100% confidence
AIs Will Increasingly Attempt Shenanigans — LessWrong
https://www.lesswrong.com/posts/v7iepLXH2KT4SDEvB/ais-will-increasingly-attempt-...
1 user
Last: Jan 07, 2026
100% confidence
Reducing LLM deception at scale with self-other overlap fine-tuning — LessWrong
https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scal...
1 user
Last: Jan 07, 2026
100% confidence
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations — Apollo Research
https://www.apolloresearch.ai/blog/claude-sonnet-37-often-knows-when-its-in-alig...
1 user
Last: Jan 07, 2026
100% confidence
[Linkpost] A General Language Assistant as a Laboratory for Alignment — LessWrong
https://www.lesswrong.com/posts/dktT3BiinsBZLw96h/linkpost-a-general-language-as...
1 user
Last: Jan 07, 2026
100% confidence
If you can generate obfuscated chain-of-thought, can you monitor it? — LessWrong
https://www.lesswrong.com/posts/ZEdP6rYirxPxRSfTb/if-you-can-generate-obfuscated...
1 user
Last: Jan 07, 2026
100% confidence
Sonnet 4.5's eval gaming seriously undermines alignment evals
https://blog.redwoodresearch.org/p/sonnet-45s-eval-gaming-seriously
1 user
Last: Jan 07, 2026
100% confidence
Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI
https://bounded-regret.ghost.io/intrinsic-drives-and-extrinsic-misuse-two-intert...
1 user
Last: Jan 07, 2026
100% confidence
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
https://redwoodresearch.substack.com/p/behavioral-red-teaming-is-unlikely?utm_so...
1 user
Last: Jan 07, 2026
100% confidence
AI safety techniques leveraging distillation
https://redwoodresearch.substack.com/p/ai-safety-techniques-leveraging-distillat...
1 user
Last: Jan 07, 2026
100% confidence
On Anthropic's Sleeper Agents Paper - by Zvi Mowshowitz
https://thezvi.substack.com/p/on-anthropics-sleeper-agents-paper
1 user
Last: Jan 07, 2026
100% confidence
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" - Joe Carlsmith
https://joecarlsmith.com/2023/11/15/new-report-scheming-ais-will-ais-fake-alignm...
1 user
Last: Jan 07, 2026
100% confidence
Coup probes: Catching catastrophes with probes trained off-policy — LessWrong
https://www.lesswrong.com/posts/WCj7WgFSLmyKaMwPR/coup-probes-catching-catastrop...
1 user
Last: Jan 07, 2026
100% confidence
Latent Adversarial Training — LessWrong
https://www.lesswrong.com/posts/atBQ3NHyqnBadrsGP/latent-adversarial-training
1 user
Last: Jan 07, 2026
100% confidence
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
https://redwoodresearch.substack.com/p/misalignment-and-strategic-underperforman...
1 user
Last: Jan 07, 2026
100% confidence
When does training a model change its goals?
https://redwoodresearch.substack.com/p/when-does-training-a-model-change?utm_sou...
1 user
Last: Jan 07, 2026
100% confidence
Training a Reward Hacker Despite Perfect Labels — LessWrong
https://www.lesswrong.com/posts/dbYEoG7jNZbeWX39o/training-a-reward-hacker-despi...
2 users
Last: Jan 07, 2026
100% confidence
Ctrl-Z: Controlling AI Agents via Resampling — LessWrong
https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-v...
1 user
Last: Jan 07, 2026
100% confidence
7+ tractable directions in AI control — LessWrong
https://www.lesswrong.com/posts/wwshEdNhwwT4r9RQN/7-tractable-directions-in-ai-c...
1 user
Last: Jan 07, 2026
100% confidence
How can we solve diffuse threats like research sabotage with AI control? — LessWrong
https://www.lesswrong.com/posts/Mf5Hnpi2KcqZdmFDq/how-can-we-solve-diffuse-threa...
1 user
Last: Jan 07, 2026
100% confidence
How can we solve diffuse threats like research sabotage with AI control?
https://redwoodresearch.substack.com/p/how-can-we-solve-diffuse-threats
1 user
Last: Jan 07, 2026
100% confidence
Recent Redwood Research project proposals
https://redwoodresearch.substack.com/p/recent-redwood-research-project-proposals
1 user
Last: Jan 07, 2026
100% confidence
Notes on handling non-concentrated failures with AI control: high level methods and different regimes — LessWrong
https://www.lesswrong.com/posts/D5H5vcnhBz8G4dh6v/notes-on-handling-non-concentr...
1 user
Last: Jan 07, 2026
100% confidence
Claude, GPT, and Gemini All Struggle to Evade Monitors — LessWrong
https://www.lesswrong.com/posts/dwEgSEPxpKjz3Fw5k/claude-gpt-and-gemini-all-stru...
1 user
Last: Jan 07, 2026
100% confidence
Anti-Scheming
https://www.antischeming.ai
2 users
Last: Jan 07, 2026
100% confidence
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior — LessWrong
https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instruct...
3 users
Last: Jan 07, 2026
100% confidence
Agentic Monitoring for AI Control — LessWrong
https://www.lesswrong.com/posts/ptSXTkjnyj7KxNfMz/agentic-monitoring-for-ai-cont...
1 user
Last: Jan 07, 2026
100% confidence