On Robustness and Reliability of Benchmark-Based Evaluation of LLMs Paper • 2509.04013 • Published Sep 4, 2025 • 4