SWE-Universe Notes
Scalable, verifiable, multilingual SWE environments from GitHub PRs for training
Addresses 3 challenges in generating SWE environments: low production yield, weak verifiers, and high cost
Existing datasets suffer from Python-only focus, manual setup, limited scale, and vague/undisclosed methods
Researchers trained 80B model as an autonomous building agent - custom-trained MoE model makes large-scale generation economically viable while also slightly outperforming other models like Claude Opus on build success rate
This model does 3 things: PR patch-splitting, iterative self-verification, hacking detection.
PR patch-splitting: Agent analyzes code changes and partitions into test patch (test-related changes) and fix patch (source code changes)
For the verifier, the agent can either use unit tests in test patch or generate its own custom test from scratch
Iterative self-verification: Agent tests generated verifier against buggy and fixed states. The verifier is only considered correct if it fails in the buggy state and succeeds in the fixed state. If one of these conditions fail, feedback is given to the building agent and it tries again.
Hacking detection: Agent inspects generated testing script to ensure it is actually running the code instead of just checking files
Created dataset of 807,693 instances from 52,960 repos across multiple languages
Mid-training using rejection sampling techniques for 500k successful trajectories improves from 50.3% → 61% on SWE-Bench Verified
RL training using evaluation.sh testing script as reward signal improves 30B model from 32% → 42% on SWE-Bench Multilingual
Training Qwen3-Max-Thinking using framwork achieves SOTA 75.3% resulton SWE-Bench Verified - SOTA resullt