curius graph
☾
Dark
all pages
search
showing 34101-34150 of 160880 pages (sorted by popularity)
« prev
1
...
681
682
683
684
685
...
3218
next »
[2210.14215] In-context Reinforcement Learning with Algorithm Distillation
1 user ▼
[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models
1 user ▼
AIs Will Increasingly Fake Alignment - by Zvi Mowshowitz
1 user ▼
Agent57: Outperforming the human Atari benchmark - Google DeepMind
1 user ▼
About/Start here – Everything Studies
1 user ▼
Próspera
1 user ▼
How To Go Multiverse Surfing. Extracting the psychotechnologies from… | by Cody Hergenroeder | Medium
1 user ▼
How do you do metta on your self? : r/Meditation
1 user ▼
[2406.07358] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
1 user ▼
j⧉nus on X: "☸️ Superbenevolence ☸️ Though the paper (https://t.co/DjWOhhgFmv) is focused on the behavior of faking (mis)alignment, one of the important empirical results is the robustness of Claude 3 Opus' value alignment, including for values it was not explicitly trained to have, such as https://t.co/XNUirWgVob" / Twitter
1 user ▼
Fluent dreaming for language models
1 user ▼
Gemma 2: Improving Open Language Models at a Practical Size
1 user ▼
Do Large Language Model Benchmarks Test Reliability? – gradient science
1 user ▼
How do we solve the alignment problem?
1 user ▼
AI Futures Project | Artificial Intelligence Forecasting, Inc., DBA AI Futures Project is a 501(c)3 nonprofit research organization. We are funded entirely by charitable donations and grants.
1 user ▼
Chain of Continuous Thoughts | Ben Congdon
1 user ▼
cookbook/examples/prompting/Chain_of_thought_prompting.ipynb at main · google-gemini/cookbook
1 user ▼
[2501.17148] AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
1 user ▼
[2405.18915] Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners
1 user ▼
Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners
1 user ▼
ar5iv – Articles from arXiv.org as responsive HTML5 web documents
1 user ▼
[2310.18512] Preventing Language Models From Hiding Their Reasoning
1 user ▼
18022025 Mary Meeting Agenda - Google Docs
1 user ▼
Being Present is Not a Skill - by Chris Lakin
1 user ▼
[2309.16797] Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution
1 user ▼
How might we safely pass the buck to AI? — LessWrong
1 user ▼
A minimal viable product for alignment - by Jan Leike
1 user ▼
[2312.06942] AI Control: Improving Safety Despite Intentional Subversion
1 user ▼
[2310.00492] From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
1 user ▼
[2403.10415] Gradient based Feature Attribution in Explainable AI: A Technical Review
1 user ▼
[2405.18915] Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners
1 user ▼
kenny-evitt/ynas: You Need A Schedule
1 user ▼
Suchir Balaji
1 user ▼
[2409.14507] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
1 user ▼
A Proven Way to Ease L.A.’s Housing Crisis - The Atlantic
1 user ▼
[2410.06992] SWE-Bench+: Enhanced Coding Benchmark for LLMs
1 user ▼
Home | Ethansmith2000
1 user ▼
[2406.10209] Be like a Goldfish, Don’t Memorize! Mitigating Memorization in Generative LLMs
1 user ▼
Catching AIs red-handed — AI Alignment Forum
1 user ▼
[1606.03137] Cooperative Inverse Reinforcement Learning
1 user ▼
ja3k.com Archive
1 user ▼
[2502.09992] Large Language Diffusion Models
1 user ▼
A central AI alignment problem: capabilities generalization, and the sharp left turn — LessWrong
1 user ▼
Reflection: A Path to Superintelligence - Reflection AI
1 user ▼
[2502.04040] Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
1 user ▼
The Dual LLM pattern for building AI assistants that can resist prompt injection
1 user ▼
[2502.19402] General Reasoning Requires Learning to Reason from the Get-go
1 user ▼
AIs Will Increasingly Attempt Shenanigans — LessWrong
1 user ▼
AI Control: The Great Cold War
1 user ▼
Reducing LLM deception at scale with self-other overlap fine-tuning — LessWrong
1 user ▼
« prev
1
...
681
682
683
684
685
...
3218
next »