WideSearch: Benchmarking Agentic Broad Info-Seeking Paper • 2508.07999 • Published Aug 11, 2025 • 110
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts Paper • 2508.09848 • Published Aug 13, 2025 • 71
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds Paper • 2508.12782 • Published Aug 18, 2025 • 25
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20, 2025 • 43
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Paper • 2508.11987 • Published Aug 16, 2025 • 71