m8ngotree - ML Blog

SWE-Synth Notes

Bug-fix pairs from open-source repos are noisy, lack intermediate reasoning traces, & rarely include test-based verifiability

SWE-Bench contains gold patches but no training environments

SWE-GYM has some traning environments but relies on human-curated test cases, limiting scalability

Mutation-based synthetic code data is often trivial/unrealistic

LLM-based synthetic data generation is not advanced enough to simulate complex SWE tasks

Synthetic data frequently lacks context/verification

Paper proses synthetic code data should contain 4 properties: human-like bugs, scalability, automated verification, & repair traces / intermediate steps

Many Auomated Program Repair (APR) approaches focus directly on generating fix patch without capturing intermediate repair trajectory. The intermediate steps are helpful to capture reasoning & handle more complex fixes like a skilled dev would.

The proposed framework is intended to generate variants of programs that fail at least one of its existing test cases. It does this by selecting a component (function/class) from the original program and masking out the implementation. An LLM is used to re-implement the masked component. Only variants that fail at least one test case are kept.

Component selection - there are 2 approaches: naive uniform selection & test coverage-based selection. The first samples at random from all components, and the second weights components according to their test coverage. The second approach is done by using cached coverage information & modeling the relationship between test cases & components as a bipartite graph. The degree of a component in the graph is the number of test cases that cover it.

For function masking, the original body is removed but the signature & docstring are kept. For class masking, the bodies of all methods are cleared while the class attributes and documentation are kept.

For extracting the ground truth fix, the naive approach is the make a reverse patch diff, but this doesn't capture the iterative repair process. The paper found that training on these doesn't give optimal results. The better approach was determined to be training on success rollout trajectories of repair actions (codebase exploration or modification).

One future research direction that is suggested is to devise a procedure for guiding intermediate steps toward the reverse patch diff by providing hints during the repair process while also avoiding leakage about the ground truth / location of the bug.

The repository set for SWE-Synth is based on SWE-GYM which has 11 Python repos with 2,438 tasks & corresponding test suites.

SWE-Synth is generalizable to other codebase despite the paper focusing on Python.

For test coverage-based selection, components are mapped to test cases using line coverage metrics via coveragepy.

Qwen 2.5 Coder Instruct 32B is used to generate variants - temperature is set to 0.7.

MoatlessTool is used as the agent scaffold - it integrates code search with LLM-based retrieval.

Each variantโ€™s failed test log was truncated to 16k tokens and included in the input prompt

SWE-Synth generated 21,804 program variants from 35 original program versions across seven repositories. After filtering invalid or unverifiable variants, SWE-Synth contains 9,459 high-quality, test-case-verifiable program variants. Of the 9,459 variants, 7,798 remain buggy,

Valid patches were shown to modify multiple lines, hunks, & files. The bugs spanned diverse code locations, each buggy variant had 254.9 test cases on average, the median length of test failure logs was 18,842 tokens, & each trajectory has an average of 8.68 intermediate repair steps.

Qwen 2.5 Coder Instruct was trained at 3 sizes (7B, 14B, 32B) on SWE-Synth & manual datasets. SWE-Bench Verified & BugsInPy were used for evaluation.

Training experiments were conducted using LoRA with a rank of 8, alpha of 16, & learning rate of 2 x 10^-5.

3 metrics were measured - resolve rate, correct patch rate, & empty patch rate. Resolve rate measured if patch passes all test cases, correct patch rate measures if AST matched reference patch, and empty patch rate measured the number of suggested fixes that were empty which sometimes occurs with LLMs.

Per-Bug instance capping - this was ablated as the rejection/repeated sampling tends to favor easier bugs as these result in more successful trajectories. A low cap would restrict the data while a high cap would skew towards easy tasks.

When ablating synthetic/manual data on same number of variants, performance was fairly similar - SWE-Synth only outperformed in some cases. When using the same total number of trajectories, performance was similar as well.

When using a 685B param DeepSeek model to sample 2,136 trajectories for each dataset, SWE-Synth clearly outperformed the manual data. This suggests the advantage of larger set of bug instances and trajectory diversity from synthetic data grows with larger fine-tuning data & stronger teacher models.

Research Question 2 - How does increasing the number of synthetic training instances affect model performance? Scaling the number of synthetic training instances was shown to improve model performance as SWE-Synth (the largest dataset) had the highest resolve rate on SWE-Bench Lite. SWE-Synth showed improvement when the cap on number of per-bug instances was increased from 1 to 3. From the paper - "A plausible explanation is that with large caps, SWE-Synth โ€™s larger and more diverse bug distribution reduces overfitting on simpler tasks, enabling APR model to effectively leverage additional success trajectories without overshadowing complex or less frequent bugs".

RQ3: How well can humans distinguish SWE-Synthโ€™s results from real-world bugs? Overall accuracy was 55.17%, real bugs were correctly identified 64.77% of the time, synthetic bugs were correctly identified 45.11% of the time. This showed it was fairly challenging to differentiate real from synthetic.

๐‘…๐‘„4: How does model performance change across different model sizes when fine-tuned on our synthetic training data? Out of the 7B, 14B, and 32B models trained on success trajectories, the 7B model benefited most. This aligns with the Weak-to-Strong effect. From the paper - "Addressing this may require self-improvement methods, such as Reinforcement Learning with Verifiable Rewards (Guo et al., 2025), which SWE-Synth can support through its automated verifiable bugs and intermediate steps."

๐‘…๐‘„5: How do different component granularities affect the trained modelsโ€™ performance? Function-level granularity had the highest resolution rate, but class-level granularity had the lowest empty patch rate. The paper suggests that larger granularity doesn't necessarily lead to improved performance.

๐‘…๐‘„6. How do different component selection strategies affect the trained modelsโ€™ performance? Coverage-based selection outperformed uniform sampling.

๐‘…๐‘„7: How does the size of the model used for component rewriting affect APR performance? Smaller models are less efficient in generating valid variants due to syntax & formatting errors, but the resolve, correct, & empty patch rates across data generated by all sizes were shown to be fairly similar.

๐‘…๐‘„8: How well does the model perform when being trained on reverse patch diff compared to when being trained on SWE-Synth with rollout? Rollout rejection sampling was clearly the best performer compared to reverse patch diff & RAG rejection sampling.

Potential flaws mentioned in paper: bugs and intermediate steps do not encompass all bugs or all possible alternative solutions, using SWE-GYM as a basis for LLM reimplementation might not be optimal/comprehensive, & SWE-Bench & BugsInPy might not be comprehensive evals.

Ideas for extensions from the paper: "First, by enabling scalable and diverse bug generation, SWE-Synth addresses the longstanding data scarcity challenge in APR, paving the way for more robust and generalizable repair models via intermediate steps. Second, our methodo is generic for any programming language due to the inherent capability in LLMs. Third, it paves ways for research on exploring adaptive bug synthesis tailored to different contexts including specific programming paradigms, security vulnerabilities, or domain-specific software, etc. Fourth, this method can be extended to generate adversarial bugs, facilitating the evaluation of APR models under realistic conditions. Finally, the integration of synthetic bugs can enhance automated testing and vulnerability detection by refining models to handle faults in different domains".