curius graph

all topics

click on a topic to explore it

186

Topic Clusters

167,187

Total Pages

Large Language Models Analysis

910 pages in cluster

Sample Pages (Top 50 by confidence)

Sabotage evaluations for frontier models \ Anthropic

https://www.anthropic.com/research/sabotage-evaluations

Last: Jan 07, 2026

100% confidence

What's worse, spies or schemers?

https://redwoodresearch.substack.com/p/whats-worse-spies-or-schemers

Last: Jan 07, 2026

100% confidence

How can we solve diffuse threats like research sabotage with AI control?

https://blog.redwoodresearch.org/p/how-can-we-solve-diffuse-threats

Last: Jan 07, 2026

100% confidence

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

https://blog.redwoodresearch.org/p/misalignment-and-strategic-underperformance

Last: Jan 07, 2026

100% confidence

Building Black-box Scheming Monitors — LessWrong

https://www.lesswrong.com/posts/sb8WmKNgwzefa6oaJ/building-black-box-scheming-mo...

Last: Jan 07, 2026

100% confidence

Win/continue/lose scenarios and execute/replace/audit protocols

https://blog.redwoodresearch.org/p/wincontinuelose-scenarios-and-executereplacea...

Last: Jan 07, 2026

100% confidence

Why imperfect adversarial robustness doesn't doom AI control

https://blog.redwoodresearch.org/p/why-imperfect-adversarial-robustness?utm_sour...

Last: Jan 07, 2026

100% confidence

Reducing risk from scheming by studying trained-in scheming behavior — LessWrong

https://www.lesswrong.com/posts/v6K3hnq5c9roa5MbS/reducing-risk-from-scheming-by...

Last: Jan 07, 2026

100% confidence

When does training a model change its goals? — LessWrong

https://www.lesswrong.com/posts/yvuXPi5m4vCvSGTjo/when-does-training-a-model-cha...

Last: Jan 07, 2026

100% confidence

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy — LessWrong

https://www.lesswrong.com/posts/MbWWKbyD5gLhJgfwn/meta-level-adversarial-evaluat...

Last: Jan 07, 2026

100% confidence

Concept Poisoning: Probing LLMs without probes | TruthfulAI

https://truthfulai.org/blog/concept-poisoning

Last: Jan 07, 2026

100% confidence

Detecting Strategic Deception Using Linear Probes — Apollo Research

https://www.apolloresearch.ai/research/deception-probes

Last: Jan 07, 2026

100% confidence

Bloom: an open source tool for automated behavioral evaluations

https://alignment.anthropic.com/2025/bloom-auto-evals

Last: Feb 24, 2026

100% confidence

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — LessWrong

https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-decept...

Last: Jan 07, 2026

100% confidence

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

https://arxiv.org/html/2508.03153v1

Last: Jan 07, 2026

100% confidence

Training fails to elicit subtle reasoning in current language models

https://alignment.anthropic.com/2025/subtle-reasoning

Last: Jan 07, 2026

100% confidence

Claude Sonnet 4.5: System Card and Alignment — LessWrong

https://www.lesswrong.com/posts/4yn8B8p2YiouxLABy/claude-sonnet-4-5-system-card-...

Last: Jan 07, 2026

100% confidence

How hard is it to inoculate against misalignment generalization? — LessWrong

https://www.lesswrong.com/posts/G4YXXbKt5cNSQbjXM/how-hard-is-it-to-inoculate-ag...

Last: Jan 07, 2026

100% confidence

Reward hacking is becoming more sophisticated and deliberate in frontier LLMs — LessWrong

https://www.lesswrong.com/posts/rKC4xJFkxm6cNq4i9/reward-hacking-is-becoming-mor...

Last: Jan 07, 2026

100% confidence

AI Control: Improving Safety Despite Intentional Subversion

https://arxiv.org/html/2312.06942v5

Last: Jan 07, 2026

100% confidence

Training on Documents About Reward Hacking Induces Reward Hacking — LessWrong

https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-re...

Last: Jan 07, 2026

100% confidence

AIs Will Increasingly Fake Alignment - by Zvi Mowshowitz

https://thezvi.substack.com/p/ais-will-increasingly-fake-alignment?s=r

Last: Jan 07, 2026

100% confidence

[2312.06942] AI Control: Improving Safety Despite Intentional Subversion

https://ar5iv.labs.arxiv.org/html/2312.06942

Last: Jan 07, 2026

100% confidence

[2502.04040] Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment

https://ar5iv.labs.arxiv.org/html/2502.04040

Last: Jan 07, 2026

100% confidence

AIs Will Increasingly Attempt Shenanigans — LessWrong

https://www.lesswrong.com/posts/v7iepLXH2KT4SDEvB/ais-will-increasingly-attempt-...

Last: Jan 07, 2026

100% confidence

Reducing LLM deception at scale with self-other overlap fine-tuning — LessWrong

https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scal...

Last: Jan 07, 2026

100% confidence

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations — Apollo Research

https://www.apolloresearch.ai/blog/claude-sonnet-37-often-knows-when-its-in-alig...

Last: Jan 07, 2026

100% confidence

[Linkpost] A General Language Assistant as a Laboratory for Alignment — LessWrong

https://www.lesswrong.com/posts/dktT3BiinsBZLw96h/linkpost-a-general-language-as...

Last: Jan 07, 2026

100% confidence

If you can generate obfuscated chain-of-thought, can you monitor it? — LessWrong

https://www.lesswrong.com/posts/ZEdP6rYirxPxRSfTb/if-you-can-generate-obfuscated...

Last: Jan 07, 2026

100% confidence

Sonnet 4.5's eval gaming seriously undermines alignment evals

https://blog.redwoodresearch.org/p/sonnet-45s-eval-gaming-seriously

Last: Jan 07, 2026

100% confidence

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

https://bounded-regret.ghost.io/intrinsic-drives-and-extrinsic-misuse-two-intert...

Last: Jan 07, 2026

100% confidence

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming

https://redwoodresearch.substack.com/p/behavioral-red-teaming-is-unlikely?utm_so...

Last: Jan 07, 2026

100% confidence

AI safety techniques leveraging distillation

https://redwoodresearch.substack.com/p/ai-safety-techniques-leveraging-distillat...

Last: Jan 07, 2026

100% confidence

On Anthropic's Sleeper Agents Paper - by Zvi Mowshowitz

https://thezvi.substack.com/p/on-anthropics-sleeper-agents-paper

Last: Jan 07, 2026

100% confidence

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" - Joe Carlsmith

https://joecarlsmith.com/2023/11/15/new-report-scheming-ais-will-ais-fake-alignm...

Last: Jan 07, 2026

100% confidence

Coup probes: Catching catastrophes with probes trained off-policy — LessWrong

https://www.lesswrong.com/posts/WCj7WgFSLmyKaMwPR/coup-probes-catching-catastrop...

Last: Jan 07, 2026

100% confidence

Latent Adversarial Training — LessWrong

https://www.lesswrong.com/posts/atBQ3NHyqnBadrsGP/latent-adversarial-training

Last: Jan 07, 2026

100% confidence

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

https://redwoodresearch.substack.com/p/misalignment-and-strategic-underperforman...

Last: Jan 07, 2026

100% confidence

When does training a model change its goals?

https://redwoodresearch.substack.com/p/when-does-training-a-model-change?utm_sou...

Last: Jan 07, 2026

100% confidence

Training a Reward Hacker Despite Perfect Labels — LessWrong

https://www.lesswrong.com/posts/dbYEoG7jNZbeWX39o/training-a-reward-hacker-despi...

Last: Jan 07, 2026

100% confidence

Ctrl-Z: Controlling AI Agents via Resampling — LessWrong

https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-v...

Last: Jan 07, 2026

100% confidence

7+ tractable directions in AI control — LessWrong

https://www.lesswrong.com/posts/wwshEdNhwwT4r9RQN/7-tractable-directions-in-ai-c...

Last: Jan 07, 2026

100% confidence

How can we solve diffuse threats like research sabotage with AI control? — LessWrong

https://www.lesswrong.com/posts/Mf5Hnpi2KcqZdmFDq/how-can-we-solve-diffuse-threa...

Last: Jan 07, 2026

100% confidence

How can we solve diffuse threats like research sabotage with AI control?

https://redwoodresearch.substack.com/p/how-can-we-solve-diffuse-threats

Last: Jan 07, 2026

100% confidence

Recent Redwood Research project proposals

https://redwoodresearch.substack.com/p/recent-redwood-research-project-proposals

Last: Jan 07, 2026

100% confidence

Notes on handling non-concentrated failures with AI control: high level methods and different regimes — LessWrong

https://www.lesswrong.com/posts/D5H5vcnhBz8G4dh6v/notes-on-handling-non-concentr...

Last: Jan 07, 2026

100% confidence

Claude, GPT, and Gemini All Struggle to Evade Monitors — LessWrong

https://www.lesswrong.com/posts/dwEgSEPxpKjz3Fw5k/claude-gpt-and-gemini-all-stru...

Last: Jan 07, 2026

100% confidence

Anti-Scheming

https://www.antischeming.ai

Last: Jan 07, 2026

100% confidence

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior — LessWrong

https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instruct...

Last: Jan 07, 2026

100% confidence

Agentic Monitoring for AI Control — LessWrong

https://www.lesswrong.com/posts/ptSXTkjnyj7KxNfMz/agentic-monitoring-for-ai-cont...

Last: Jan 07, 2026

100% confidence