47 - David Rein on METR Time Horizons...

Description

When METR says something like "Claude Opus 4.5 has a 50% time horizon of 4 hours and 50 minutes", what does that mean? In this episode David Rein, METR researcher and co-author of the paper "Measuring AI ability to complete long tasks", talks about METR's work on measuring time horizons, the methodology behind those numbers, and what work remains to be done in this domain. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2026/01/03/episode-47-david-rein-metr-time-horizons.html Topics we discuss, and timestamps: 0:00:32 Measuring AI Ability to Complete Long Tasks 0:10:54 The meaning of "task length" 0:19:27 Examples of intermediate and hard tasks 0:25:12 Why the software engineering focus 0:32:17 Why task length as difficulty measure 0:46:32 Is AI progress going superexponential? 0:50:58 Is AI progress due to increased cost to run models? 0:54:45 Why METR measures model capabilities 1:04:10 How time horizons relate to recursive self-improvement 1:12:58 Cost of estimating time horizons 1:16:23 Task realism vs mimicking important task features 1:19:50 Excursus on "Inventing Temperature" 1:25:46 Return to task realism discussion 1:33:53 Open questions on time horizons Links for METR: Main website: https://metr.org/ X/Twitter account: https://x.com/METR_Evals/ Research we discuss: Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499 RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts: https://arxiv.org/abs/2411.15114 HCAST: Human-Calibrated Autonomy Software Tasks: https://arxiv.org/abs/2503.17354 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity: https://arxiv.org/abs/2507.09089 Anthropic Economic Index: Tracking AI's role in the US and global economy: https://www.anthropic.com/research/anthropic-economic-index-september-2025-report Bridging RL Theory and Practice with the Effective Horizon (i.e. the Cassidy Laidlaw paper): https://arxiv.org/abs/2304.09853 How Does Time Horizon Vary Across Domains?: https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains/ Inventing Temperature: https://global.oup.com/academic/product/inventing-temperature-9780195337389 Is there a Half-Life for the Success Rates of AI Agents? (by Toby Ord): https://www.tobyord.com/writing/half-life Lawrence Chan's response to the above: https://nitter.net/justanotherlaw/status/1920254586771710009 AI Task Length Horizons in Offensive Cybersecurity: https://sean-peters-au.github.io/2025/07/02/ai-task-length-horizons-in-offensive-cybersecurity.html Episode art by Hamish Doodles: hamishdoodles.com

47 - David Rein on METR Time Horizons

Guests

Description

Audio

47 - David Rein on METR Time Horizons

Guests Re-extract with AI

Description

Audio

Guests