m8ngotree - ML Blog

SWE-Universe Notes

Scalable, verifiable, multilingual SWE environments from GitHub PRs for training

Addresses 3 challenges in generating SWE environments: low production yield, weak verifiers, and high cost

Existing datasets suffer from Python-only focus, manual setup, limited scale, and vague/undisclosed methods

Researchers trained 80B model as an autonomous building agent - custom-trained MoE model makes large-scale generation economically viable while also slightly outperforming other models like Claude Opus on build success rate

This model does 3 things: PR patch-splitting, iterative self-verification, hacking detection.

PR patch-splitting: Agent analyzes code changes and partitions into test patch (test-related changes) and fix patch (source code changes)

For the verifier, the agent can either use unit tests in test patch or generate its own custom test from scratch

Iterative self-verification: Agent tests generated verifier against buggy and fixed states. The verifier is only considered correct if it fails in the buggy state and succeeds in the fixed state. If one of these conditions fail, feedback is given to the building agent and it tries again.

Hacking detection: Agent inspects generated testing script to ensure it is actually running the code instead of just checking files

Created dataset of 807,693 instances from 52,960 repos across multiple languages

Mid-training using rejection sampling techniques for 500k successful trajectories improves from 50.3% → 61% on SWE-Bench Verified

RL training using evaluation.sh testing script as reward signal improves 30B model from 32% → 42% on SWE-Bench Multilingual

Training Qwen3-Max-Thinking using framwork achieves SOTA 75.3% resulton SWE-Bench Verified - SOTA resullt