Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
LLMs perform well on coding benchmarks like LiveCodeBench but struggle with real-world software engineering (SWE) tasks (Jimenez et al. 2024). Even large models like Claude reach only around 60% accuracy on SWE-bench, despite using carefully engineered prompting pipelines (Xia...