第一名解决方案撰写 - Eduardo Rocha de Andrade

标题: 第一名解决方案撰写
作者: Eduardo Rocha de Andrade (Grandmaster)
发布时间: 2025-03-18
更新时间: 2025-07-28
竞赛排名: 1

首先，我要感谢 Kaggle 团队和本次竞赛的主办方。这不仅是一个非常有趣的问题，而且我可以想象准备这样一个在基础设施/工程方面如此复杂的竞赛是多么困难。

我参加这次竞赛比较晚（距离结束大约 1 个月），所以我知道我最好找一个开源解决方案来进行 adaptation，而不是从头开始编写代码。经过一番研究，我决定基于 Agentless 1.5 构建我的解决方案。

太长不看版 (TLDR):

Agentless 1.5 经过大量修改，以支持本地模型、优化运行时并提高“较弱”的 32B 模型的质量。

主要关键点:

改进上下文检索以生成 F2P 测试（故障复现测试）
通过 F2P 和 P2P（单元）测试进行补丁拒绝/筛选
Qwen2.5 Coder 32B 模型
用于补丁生成的 Search/Replace（搜索/替换）diff 格式（比直接生成 diff 好得多）
如果没有 F2P 能够复现 GitHub 问题，则启用重试机制
并发执行包安装和测试执行
用于控制运行时间的全局和局部时间管理系统

(为了获得更好的体验，请右键单击并在新的标签页中打开图片)

解决方案详情

补丁筛选 (Patch Rejection)

我流程的第一阶段是生成能够复现 GitHub 问题的测试，随后评估潜在的补丁候选是否实际修复了该问题。我将分享我使用的提示词（prompt），因为这有助于解释整个过程：

generate_tests_prompt_template_with_related_content = """
We are currently solving the following issue within our repository. Here is the issue text:
--- BEGIN ISSUE ---
{problem_statement}
--- END ISSUE ---

Here is the related content from other tests in the repository, which you can use as example on how to import modules, instantiate classes, and use functions:
--- BEGIN RELATED CONTENT ---
{file_contents}
--- END RELATED CONTENT ---

Please generate a complete test that can be used to reproduce the issue.

The complete test should contain the following:
1. Necessary imports, this include any native python imports as well as imports from the repository. Also pay attention to custom Exceptions and Classes.
2. If the test script requires any setup or initialization, make sure to include it in the test.
3. Code to reproduce the issue described in the issue text
4. Print "Issue reproduced" if the outcome indicates that the issue is reproduced
5. Print "Issue resolved" if the outcome indicates that the issue has been successfully resolved. This should test both if the issue was resolved and if it presents the expected behavior. *Treat this similarly to an unit test that someone would add to the code base to assess if the issue is resolved or not*. The only difference is that this is a standalone script instead of using frameworks like pytest or unittest.
6. Print "Other issues" if the outcome indicates there are other issues with the source code
7. If the repo is django make sure to add `import django` and `django.setup()` at the beginning of the test.

Here is an example:

```python
from sqlfluff import lint

def test__rules__std_L060_raised() -> None:
    try:
        sql = "SELECT   IFNULL(NULL, 100),
            NVL(NULL,100);"
        result = lint(sql, rules=["L060"])
        assert len(result) == 2
    except:
        print("Other issues")
        return

    try:
        assert result[0]["description"] == "Use 'COALESCE' instead of 'IFNULL'."
        assert result[1]["description"] == "Use 'COALESCE' instead of 'NVL'."
        print("Issue resolved")
    except AssertionError:
        print("Issue reproduced")
        return

    return

test__rules__std_L060_raised()
```

Please ensure the generated test reflects the issue described in the provided issue text.
The generated test should be able to be used to both reproduce the issue as well as to verify the issue has been fixed.
Note that we won't have internet access when running the tests, so avoid using any code that requires internet access like downloading files, making API calls or using datasets. Instead you could try to mock the data or use a small toy example.
Wrap the complete test in ```python...```.
"""

我使用上述提示词为问题生成了 5 个 F2P 候选测试样本。然后，我会运行所有测试，只保留那些实际复现了问题且没有其它问题的测试。例如，如果它同时打印了 "Issue reproduced" 和 "Other issues"，我会移除该测试，因为这看起来很可疑。

如果 5 个 F2P 候选都没有成功复现，我会触发第二批 5 个测试的生成，但使用更高的 temperature。如果 10 个测试中没有一个复现问题，我就直接跳过该样本，因为无法评估我的候选修复补丁。

与 Agentless 的主要区别

我认为我的解决方案在此处对 Agentless 进行了最大的改进。原始代码库不包括生成测试的上下文——它仅提供问题描述并要求模型生成复现测试。在查看了一些样本后，我很快注意到这种策略对于像 Claude/GPT 这样的超大模型非常有效，它们拥有惊人的记忆能力，并且牢记如何正确导入模块、实例化类以及正确使用函数和方法。

另一方面，像 Qwen2.5 Coder 这样的“小型”32B 模型大多数时候会有正确的测试思路，但在导入模块或使用类时会失败——本质上，这些愚蠢的错误可以通过简单地为模型提供上下文来纠正。

为了查找上下文，我使用了一个两阶段方法：首先让模型找到与问题相关的重要单元测试文件，然后向模型提供这些文件中所有函数和类的骨架，并询问它想要检查哪些类/函数/方法。我还从这些文件中提取了所有导入语句，以便模型了解如何导入内容。

为了增加测试生成的多样性，我还使用了不同级别的上下文。例如，在 5 次生成请求中，1 次会有示例导入语句 + 所有相关的类/函数/方法。另一次请求仅使用导入语句作为上下文。最后，其他 3 次类似于原始 Agentless，没有上下文。此外，一个样本使用贪婪解码 (greedy decoding)，其他样本使用 temperature/top_p/min_p 采样以最大化多样性。

我还对原始提示词添加了一些其他小修改，有助于引导小型模型生成测试。例如，明确说明包含导入、初始化（如果需要）等。我想对于像 Claude 这样的 SOTA 模型，你不需要太明确，但对于 32B 模型，我观察到这确实有帮助。

上下文定位 (Context localization)

如果至少有 1 个 F2P 测试成功复现了问题，我将继续流程生成修复补丁。首先定位相关文件，然后找到这些文件中相关的类/方法/函数。最后，第三阶段生成精确的位置，例如编辑的行号。

第一和第二阶段使用贪婪解码，对于细粒度定位（第三阶段），我使用采样生成 2 倍不同的编辑位置。

上述所有步骤与 Agentless 原始做法非常相似，我只对提示词做了小改进，并使解析模型输出并在 repo 中查找文件的启发式方法更通用，对小的愚蠢错误更稳健。

关于运行时，我重构了代码，仅在最开始生成一次 repo 结构（包含所有文件、函数、类及其方法、行起止等的嵌套字典），因为原始代码会在每一步生成，每次消耗 5-20 秒。

修复补丁生成

对于每个编辑位置，我生成了 4 个修复样本（第一个使用贪婪解码，其他使用 temperature）。所以，总共有 8 个候选修复补丁。我使用的提示词如下：

repair_prompt_combine_topn_cot_diff = """
We are currently solving the following issue within our repository. Here is the issue text:
--- BEGIN ISSUE ---
{problem_statement}
--- END ISSUE ---

Below are some code segments, from a file. Here is the issue text:
--- BEGIN FILE ---
```
{content}
```
--- END FILE ---

Please first localize the bug (or bugs) based on the issue statement, and then generate *SEARCH/REPLACE* edits to fix the issue.

Every *SEARCH/REPLACE* edit must use this format:
1. The file path
2. The start of search block: <<<<<<< SEARCH
3. A contiguous chunk of lines to search for in the existing source code
4. The dividing line: =======
5. The lines to replace into the source code
6. The end of the replace block: >>>>>>> REPLACE

Here is an example:

```python
### mathweb/flask/app.py
<<<<<<< SEARCH
from flask import Flask
=======
import math
from flask import Flask
>>>>>>> REPLACE
```

Please note that the *SEARCH/REPLACE* edit REQUIRES PROPER INDENTATION. If you would like to add the line '        print(x)', you must fully write that out, with all those spaces before the code!
Wrap the *SEARCH/REPLACE* edit in blocks ```python...```.

Note that some issues (but not all) may require multiple *SEARCH/REPLACE* edits in potentially different files in order to completely fix the issue.
You are expected to provide all the edits needed to fix the issue but ONLY suggest edits that are NECESSARY to fix the issue.
"""

输出随后被处理以获得 Git diff。我尝试直接生成 git diff，注意到在许多情况下它无法生成有效的补丁，因为它搞乱了行计数或其他愚蠢的事情。在使用 SEARCH/REPLACE 方案（类似于 VsCode 的做法）后，我注意到有效补丁的数量大大增加。

评估候选修复补丁

获得 8 个候选修复补丁后，我首先通过 git apply 的干跑 (dry-run) 验证它们以确保它们有效。然后，我用每个 F2P 测试运行它们，只保留能够修复至少一个测试的补丁。最后，我运行模型在最初定位步骤中选择的 P2P 单元测试，并将其与参考值（未应用修复补丁时的单元测试结果）进行比较。如果结果相似或优于参考值，我提交该样本。

所有测试（F2P 和 P2P）都有超时设置，并并行运行以节省时间。

时间管理

我花了很多时间在解决方案上进行运行时优化，最后，我最好的提交运行时间约为 5-7 小时（每个样本约 3 到 10 分钟）。尽管如此，我还是添加了两个时间保护机制，如果全局运行时间超过 20 小时，则跳过剩余所有内容；如果该样本超过 12 分钟，则跳过该样本。

其他改进

above，我试图列出我提交的所有最重要细节。然而，我对 Agentless 进行了无数其他 minor 更改，使其运行更快，并且对“小型”本地模型更稳健/更好。我不会在此 extensively 列出它们，因为对于单个帖子来说太多了，而且老实说，我没有 properly ablate 许多它们，无法自信地说它们实际影响了多少解决方案。

关于分数和稳健性的思考

我多次提交我最好的解决方案，使用不同的种子或小的超参数变化，它的得分总是在 0.056LB 到 0.098LB 之间，所以我认为它在某种程度上是稳健的。

也就是说，有一次它失败了，我不知道为什么。它还执行由 LLM 生成的代码，这始终是一个潜在风险。

鉴于数据集有多难以及错误受到的惩罚有多重，我认为运气最终会发挥重要作用。我的解决方案也有很多移动部件，这些都是潜在的故障点（即使对所有内容进行了 try-catching）。所以，完全诚实地说，我不太自信最终会取得好成绩，但至少，我玩得很开心 🤣

尝试过但无效的方法

Agent 模式，即生成测试，运行，然后将错误反馈给模型进行修复（我只是很晚才尝试这个，所以我仍然认为如果有更多时间它可以工作）
推理模型（大量的思考 and "but wait..."，但解决方案证明并不比 qwen coder 好多少）
本地验证
- 我花了将近一周时间尝试设置带有 SweBench verified 的本地验证，但最后只设法生成了几个样本，主要是 Django（出于某种原因，其他 repo 不会导出包），这就是我在本地使用的。
- 我也只有一个 RTX4090，所以 24GB 的显存 (vRAM) 不允许我在本地测试太多东西，因为上下文被减少到约 8K tokens。因此，我放弃了本地验证，主要用于编码和调试东西，而不是实际上试图获得本地分数。这肯定不理想，但这是我在剩余时间和可用硬件下发现可行的

最终想法

我真的很享受这次竞赛并学到了很多 stuff。如果将来有第二轮，我绝对想参加。我对主办方只有两点建设性反馈：

分享更多“训练”样本或分享用于创建新样本的确切代码。我觉得创建一个本地验证集有很多（工程）摩擦。大多数 Kagglers 会受益于拥有，比如说，25 个训练样本和 51 个公共 LB 样本。
一个包含更多简单样本的数据集。我知道目标是就我们在现实中看到的情况而言要“忠实”，但从竞赛的角度来看，太多困难样本使得 LB 过于离散，并大大增加评估过程中的随机性。

祝大家好运 😸

私有排行榜结果披露（编辑：2025 年 7 月）

正如 @huikang 指出的，我的提交的私有结果是 9 个正确，2 个错误，109 个跳过样本。跳出竞赛约束思考一下，我相信我们可以通过使用 frontier 模型如 Claude、Gemini 和 GPT 来进一步提高结果，正如公共基准所示。然而，即使使用这些，考虑到错误预测的负惩罚，我认为我们离 90% 分数里程碑仍然相当远！非常期待看到这个任务在未来如何发展。

最后，再次感谢 Kaggle 和主办方 Andy 这次精彩的竞赛！

代码链接 https://www.kaggle.com/code/arc144/custom-agentless-fork?scriptVersionId=226880215

1st Place Solution Write Up