第 9 名解决方案 - Kaggle 竞赛 writeup

第 9 名解决方案

作者： Rick (rikitomo0526)

排名： 第 9 名

发布时间： 2025-10-16

协作成员： Rick, neilus, wakama1994

非常感谢组织者和 Kaggle 工作人员，祝贺所有获奖者。
这次竞赛让我学到了很多——我很高兴能获得我的第一枚金牌！

概述

方法
- 使用 32B 因果语言模型 (Causal LM) 配合 vLLM 进行推理：5 折 × 2 个模型 = 共 10 个模型。
分数
- 公开排行榜 (Public LB): 0.950, 私有排行榜 (Private LB): 0.948
模型
- Qwen/Qwen2.5-32B-Instruct
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
推理时间
- 460 分钟

数据集

竞赛训练数据
- 基于 QuestionId, Category, Misconception 使用 StratifiedKFold 划分为 5 折。

训练

参考实现：
https://www.kaggle.com/code/sinchir0/lb-0-942-train-fullft-qwen3-8b-by-sfttrainer
将 65 个标签转换为特殊 tokens。
使用 SFTTrainer + AutoModelForCausalLM 配合 QLoRA 进行微调。
超参数
- Qwen/Qwen2.5-32B-Instruct
  - epoch=2, lr=1e-4, per_device_batch_size=8, gradient_accumulation_steps=2
  - r=32, α=64, lora_dropout=0.01, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"], modules_to_save=["lm_head"], trainable_token_indices={"embed_tokens": target_label_ids}
- DeepSeek-R1-Distill-Qwen-32B
  - epoch=2, lr=8e-5, per_device_batch_size=8, gradient_accumulation_steps=2
  - r=64, α=128, lora_dropout=0.01, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"], modules_to_save=["lm_head"], trainable_token_indices={"embed_tokens": target_label_ids}
环境
- Google Colab (A100 80GB)
训练时间
- 220 分钟 / 1 折

预处理

修正了已知的标注错误并统一了重复的 Misconception 标签。
统一后的标签在后处理阶段根据对应的 QuestionID 重新分布。

def wrong_corrections(df: pd.DataFrame) -> pd.DataFrame:
    false_to_true_ids = [12878, 12901, 13876, 14089, 14159, 14185]
    df["MC_Answer"] = np.where(
        df["row_id"].isin(false_to_true_ids),
        df["MC_Answer"].str.replace(r"\( 6 \)", r"\( 9 \)"),
        df["MC_Answer"]
    )

    true_to_false_ids = [14280, 14305, 14321, 14335, 14338, 14352, 14355, 14403, 14407, 14412, 14413, 14418]
    df["MC_Answer"] = np.where(
        df["row_id"].isin(true_to_false_ids),
        df["MC_Answer"].str.replace(r"\( 9 \)", r"\( 6 \)"),
        df["MC_Answer"]
    )
    return df

def replace_duplicate_misc(df: pd.DataFrame) -> pd.DataFrame:
    df["Misconception"] = df["Misconception"].replace({"Wrong_Fraction": "Wrong_fraction"})
    return df

提示词优化 (用于提高准确率)

稍微简化提示词略微提高了 CV 分数。
这也有助于减少训练和推理时间。

prompt_format = """Question: {QuestionText}
Answer: {MC_Answer}
Correct: {Correct}
Student Explanation: {StudentExplanation}
Label: """

推理

使用 vLLM + AWQ 进行推理。
检索添加的特殊 tokens 的 logits，并在所有模型之间平均分数（简单加法平均）。

结果 (MAP@3)

模型 / 设置	验证集 (Valid)	公开排行榜 (Public LB)	私有排行榜 (Private LB)
Qwen2.5-32B-Instruct (fold0)	0.950182	0.946	0.944
Qwen2.5-32B-Instruct (5-fold)		0.950	0.946
DeepSeek-R1-Distill-Qwen-32B (fold0)	0.947956	0.948	0.945
DeepSeek-R1-Distill-Qwen-32B (5-fold)		0.950	0.946
5-fold × 2 集成		0.950	0.948

未提升效果的尝试

重排序器 (Reranker, pairwise / pointwise)
为实现 top1/top2 重新排序而实施，但性能未超过基础结果。
70B / 72B 模型
CV 优于 32B，但推理时间 (~150 分钟) 对于最终 pipeline 来说太长了。
非 Qwen2.5 模型
尝试了 Llama 3.3-70B, gpt-oss-20b, Mistral, Hermes, Gemma 3, Qwen 3 等。
在我们的情况下，没有任何模型优于 Qwen 2.5 模型。
软标签 (Soft labels)
使用在同一数据集上训练的另一个 Qwen-32B 模型产生的 logits 重新训练 Qwen-32B。
在公开排行榜上没有观察到改进，但私有排行榜分数略有增加 (+0.001)。
LoRA 合并
合并五个折叠训练的 LoRA 适配器导致性能低于使用单个模型。
仅在 [Category: Misconception] 上训练并排除 True/False 样本
排除 True/False 样本并仅训练 Misconception 类别并未优于标准的类别分类模型。

参考文献

https://huggingface.co/Qwen/Qwen2.5-32B-Instruct
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

https://www.kaggle.com/code/sinchir0/lb-0-942-train-fullft-qwen3-8b-by-sfttrainer
https://huggingface.co/docs/peft/main/en/developer_guides/lora#efficiently-train-tokens-alongside-lora

9th Place Solution

第 9 名解决方案

概述

数据集

训练

预处理

提示词优化 (用于提高准确率)

推理

结果 (MAP@3)

未提升效果的尝试

参考文献

相关资源

同比赛其他方案