解决方案总结

我要感谢竞赛主持人和 Kaggle 组织了这次有趣的挑战。
首次在 NLP 竞赛中获得第一名感觉很棒！下面是我的解决方案总结。

1. 数据

我去除了重复项，最终剩下 35,960 个样本。创建了按 "Category" 分层的 5 折交叉验证拆分。

2. 建模

我将问题建模为后缀分类任务。

给定相同的上下文（前缀），模型被训练从一组候选项中预测正确的后缀。

上下文格式如下：

<| ( performance|>user
**Question:** {QuestionText}
**Choices:** {MC_Choices}
**Correct Answer:** {Answer}
**Common Misconceptions:** {MisconceptionCandidates}
**Student Answer:** {MC_Answer}
**Student Explanation:** {StudentExplanation}
<| ( eyes|>
<| ( performance|>assistant

后缀格式化为提交格式。

False_Correct:NA<| ( eyes|>

每个 QuestionId 基于问题的可能误解有 8、10 或 12 个可能的后缀候选项。

提取 [prefix ++ suffix0, prefix ++ suffix1, ...] 的最后一个 token 特征，送入 nn.Linear(hidden_size, 1) 获得 logits，然后计算交叉熵损失。
在实践中，我将输入组织为前缀共享格式：prefix ++ suffix0 ++ suffix1 ++ ...，使用带有 FlexAttention 的自定义注意力掩码。

def custom_mask(b, h, q_idx, kv_idx):
    causal = q_idx >= kv_idx
    is_prefix = suffix_ids[kv_idx] == -1
    same_suffix = (suffix_ids[q_idx] == suffix_ids[kv_idx])
    same_doc = doc_ids[q_idx] == doc_ids[kv_idx]
    return causal & (same_suffix | is_prefix) & same_doc

3. 训练

在今年早些时候参加 WSDM Cup 竞赛时，我实现了 offload_adam，以便在单个 A100 80G GPU 上高效训练高达 32B 的模型的全参数。因此在这次竞赛中，我可以在单个 A100 80G 或 RTX Pro 6000 Blackwell 上运行所有实验。大多数训练运行使用相同的超参数：epoch=1, batch_size=32, learning_rate=1e-5。

可能是由于标签噪声（主要是 Neither），使用不同种子时验证分数波动很大。我不得不使用不同种子的多次运行集成来获得稳定的验证分数。

我使用 Qwen3-8B 进行了 5 折 x 5 种子运行。3 种子集成似乎比较稳定，所以我使用了最困难的折数和 3 次运行的集成进行进一步实验。这种方法并不新鲜——"Feedback Prize - Predicting Effective Arguments" 的第一名解决方案也使用了 3 种子集成进行验证。

主要验证结果（种子集成 [seed1, seed2, seed3]）

模型	Loss	MAP@3
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	0.2716 [0.2809, 0.2871, 0.2835]	0.9444 [0.9426, 0.9428, 0.9421]
Qwen/Qwen3-8B	0.2677 [0.2813, 0.2777, 0.2756]	0.9455 [0.9433, 0.9450, 0.9437]
zai-org/GLM-Z1-9B-0414	0.2627 [0.2783, 0.2762, 0.2761]	0.9469 [0.9455, 0.9433, 0.9452]
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	0.2621 [0.2782, 0.2698, 0.2738]	0.9464 [0.9421, 0.9443, 0.9450]
Qwen/Qwen3-14B	0.2614 [0.2707, 0.2744, 0.2695]	0.9477 [0.9442, 0.9451, 0.9461]
Qwen/Qwen3-32B	0.2589 [0.2687, 0.2718, 0.2700]	0.9484 [0.9465, 0.9451, 0.9454]
zai-org/GLM-Z1-32B-0414	0.2560 [0.2713, 0.2681, 0.2682]	0.9480 [0.9450, 0.9463, 0.9462]
Qwen/Qwen3-32B+zai-org/GLM-Z1-32B-0414	0.2530	0.9496

实验结果清楚地表明：

我不应该信任单种子验证分数
模型越大，性能越好
信任 Loss 可能比 MAP@3 更好

我还比较了多折集成与多种子集成。多种子集成要好得多。因此，对于最终提交，我在完整数据集上训练了 Qwen/Qwen3-32B 和 zai-org/GLM-Z1-32B-0414，使用了 3 个不同的种子和略微不同的数据格式。

辅助 SFT Loss 实验

在竞赛的最后几天，我尝试使用 Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 为标签生成简短的理由，并使用辅助 SFT loss 在生成的内容上训练模型。

生成理由的示例：

True_Correct:NA
The categorization **True_Correct** with **NA** misconception is justified because:  
- **Answer correctness**: The student\'s answer \\( \\frac{1}{3} \\) is mathematically correct. With 9 total triangles and 6 shaded, 3 are unshaded. Simplifying \\( \\frac{3}{9} \\) yields \\( \\frac{1}{3} \\), matching the required simplest form.  
- **Explanation quality**: The explanation ("There is 9 triangles and 3 aren\'t shaded") correctly identifies the total parts (9) and unshaded parts (3), demonstrating valid reasoning for the fraction \\( \\frac{3}{9} = \\frac{1}{3} \\). It is concise but clear and mathematically sound.  
- **Misconception**: No error exists in the reasoning (e.g., miscounting shaded/unshaded parts or incorrect simplification), so "NA" applies. The explanation omits explicit simplification but implies it by providing the correct simplified answer.

这种方法显示对 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 有一些改进，但对 Qwen/Qwen3-8B 影响很小。我将其包含在集成中，尽管我估计它的贡献是边际的。无论如何，这是一个有趣的尝试。

4. 推理

推理主要有两个挑战：计算和内存。

计算

对于短序列预填充工作负载，主要瓶颈是 nn.Linear 层中的矩阵乘法。T4 虽然在白皮书中声称 FP16 可达 65TFLOPS，但在实践中只能达到不稳定的 20TFLOPS。

LMDeploy 中的 W8A8 INT8 内核可实现稳定的 40+TFLOPS。为了启用 W8A8 INT8 推理，我使用 SmoothQuant(alpha=0.75) 对模型进行了量化。集成性能在验证中与未量化模型几乎相同。

内存

我使用逐层推理在单个 T4 GPU 上启用 32B 模型推理：

初始化并仅在 GPU 上保留 2 个 Transformer 层
重叠 当前层的执行 与 从磁盘加载下一层的状态

我在每次前向运行中使用 640 个样本（40 个微批次，每批 16 个样本）以保持 GPU 忙碌。

使用 T4×2 完成 16,000 个样本的推理大约需要 65 分钟。实现问题上下文的前缀缓存/共享可以进一步减少推理时间（超过一半的 token 可以被缓存），但我没有实现它。

Kaggle Notebook 环境注意事项

Kaggle notebook 环境的存储给我带来了一些麻烦。

/kaggle/input - 非本地存储，访问非常慢
/tmp/ - 本地写时复制存储，容量有限，文件无法真正删除
/kaggle/working - 常规本地存储，可以删除文件以释放空间

我最初将层检查点存储在 /tmp 中，导致因容量溢出而崩溃。调试后，我只来得及提交 4 个模型的集成（上传了 6 个）。无论如何，结果仍然相当不错。

5. 总结

对我来说，这次竞赛最关键方面是找到获得可靠验证分数的正确方法。单种子验证分数高度不稳定且具有误导性。多种子集成提供了稳定且可信的验证指标，从而实现了有效的模型调整和选择。

提交 Notebook https://www.kaggle.com/code/tascj0/map-submit/notebook 训练代码 https://github.com/tascj/kaggle-map-charting-student-math-misunderstandings

1st Place Solution

第一名解决方案