The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper • 2509.09677 • Published Sep 11, 2025 • 34 • 4
Answer Matching Outperforms Multiple Choice for Language Model Evaluation Paper • 2507.02856 • Published Jul 3, 2025 • 8 • 2
Pitfalls in Evaluating Language Model Forecasters Paper • 2506.00723 • Published May 31, 2025 • 3 • 2
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Paper • 2502.19414 • Published Feb 26, 2025 • 20 • 2
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 Paper • 2502.03544 • Published Feb 5, 2025 • 44 • 5
Great Models Think Alike and this Undermines AI Oversight Paper • 2502.04313 • Published Feb 6, 2025 • 33 • 2