第五名解决方案 - Kaggle 竞赛

第五名解决方案

作者: Naoism (Master)
竞赛排名: 第 5 名
发布时间: 2025-10-17

引言

首先，我想向那些提供了精彩公共 Notebook 的作者们表示诚挚的感谢，它们在这次竞赛中提供了巨大的帮助。如果没有你们分享的见解，获得金牌是不可能的。特别是，我发现以下五个 Notebook 尤其有帮助。非常感谢。

[1] Gemma2-9B-it - CV 0.945
[2] DeepSeekMath-7B LB:0.944
[3] [LB 0.942][Train] FullFT Qwen3-8B by SFTTrainer
[4] [LB 0.942][Infer] FullFT Qwen3-8B by SFTTrainer
[5] Some considerations. Fine tuning Gemma2-2b-it.

策略概述：强调多样性

我在这次竞赛中的主要策略是确保训练方法的多样性。

基线模型已经能够正确预测整个数据集的约 94%。此外，由于许多人的分数都 clustered 在基线水平附近，为了进一步提高分数，我认为有必要专注于那一小部分难以预测的数据。具体来说，我假设以下三种类型的数据由于数量少且缺乏变化，容易导致模型分类错误：

学生答案正确，但解释错误的数据 (True_Misconception:XXX)
- 模型倾向于将这些预测为 Correct 或 Neither。
学生答案错误，但解释正确的数据 (False_Correct:NA)
- 模型倾向于错误地分配某些 Misconception 标签。
解释模糊的数据
- 这些很难区分是 Correct / Neither 还是 Neither / Misconception。

由于本次竞赛提供了 Question，答案的正确性可以事先确定。这意味着预测的 True/False 部分可以近乎 100% 准确地回答。因此，问题本质上是一个 37 类分类任务（Correct，Neither 和 35 种 Misconceptions）。

在这些类别中，Correct 和 Neither 可以是任何问题的标签，所以我推测模型更有可能将它们包含在预测候选项中。因此，我得出结论，我不需要 intensively 专注于改进上述第二个和第三个假设（Correct / Neither 之间的区分）。

相反，关键挑战是确保模型将正确的 Misconception 包含在解释模糊数据的前三个候选项中。

对于这种难以分类的数据，单个模型或训练方法可能会导致预测偏向于该模型特定的“怪癖”或“视角”。为了应对这一挑战，我决定采用包含多种方法的模型集成是至关重要的。我不确定这个假设是否正确，但我很欣慰它最终带来了金牌。

集成策略与 CV/LB 相关性

对于我的交叉验证 (CV) 策略，我最初使用了基于 Misconception 标签的 StratifiedKFold。然而，我注意到有些 StudentExplanation 是相同的，这可能导致评估过于乐观。随后我 switched 到 StratifiedGroupKFold，使用基于去除空格和句号后的 StudentExplanation 分组。

一旦单个模型的 Public LB 分数达到 0.947 左右，我观察到 CV 分数与 LB 分数之间的相关性几乎完全消失。模型集成也观察到了类似的现象。我推测这是因为之前确定的难以预测的数据很罕见，而且测试集可能包含训练数据中不存在的表达方式。

基于这个假设，我使用以下策略选择模型：

单个模型：基于 CV 和 LB 分数的组合进行选择。
集成：忽略 CV 分数，仅基于 Public LB 分数的改进做出采用决定。

我决定不使用单个模型，即使它达到了 0.949 这样的高 LB 分数，如果它不能有助于提高集成的 LB 分数。为了防止过拟合 LB 分数，我们最小化了对集成权重的调整。
这有点冒险，但最终得到了回报。

为了确保多样性并纳入我自己可能没想到方法，我直接将两个优秀公共 Notebook 中的模型包含在我的集成中。这也 frees up 我的时间来专注于训练其他模型。对于集成本身，我使用了每个模型预测概率的简单加权平均。再次感谢分享 Notebook 和模型权重的作者。

分类类别数量

训练数据包含 65 个唯一标签。然而，由于答案的 True/False 部分可以近乎完美地确定，问题可以减少为 37 类分类任务。

为了最大化模型多样性，我构建并集成了 65 类和 37 类模型。

数据增强

我进行了数据增强，以解决我假设的第一类困难数据：即“学生答案正确，但解释错误 (True_Misconception:XXX)"的情况。这仅应用于集成中的一个模型。

虽然准确性的提高微乎其微，但我相信它有助于整体的集成多样性。

过程：

对于每个 Question，提取“答案错误且包含特定 Misconception (False_Misconception:XXX)"的数据。
随机采样此提取数据的 10%。
仅重写采样数据的学生答案部分使其正确，并将标签更改为 True_Misconception:XXX。

后处理

通过分析训练数据，我发现每个 QuestionId 可能的 Misconceptions 类型是有限的。为了利用这一领域知识，我应用了以下后处理步骤：

从训练数据中，创建对应每个 QuestionId 的可能 Misconceptions 集合。
在推理期间，对于给定的 QuestionId，强制将在训练数据中未出现在该问题中的任何 Misconceptions 的预测概率设置为零。

创建 Misconception 映射的代码：

def create_question_misconception_map(self, train: pd.DataFrame) -> Dict[int, set]:
    question_misconception_map = {}
    
    train["Category_del_prefix"] = train.Category.apply(
        lambda x: x.split("_", 1)[-1] if "_" in x else x
    )
    train["target"] = train.Category_del_prefix + ":" + train.Misconception
    
    unique_question_ids = train["QuestionId"].unique()
    for question_id in unique_question_ids:
        question_data = train[train["QuestionId"] == question_id]
        possible_targets = set(question_data["target"].unique())
        question_misconception_map[question_id] = possible_targets
        
    return question_misconception_map

使用的模型列表

#	模型	类别数	训练方法	数据增强	CV (*1)	Public LB	Private LB	权重	备注
1	Gemma2-9b-it	37	LoRA	-	0.946	0.946	0.943	0.25	-
2	Qwen2.5-14B-Instruct	37	QLoRA	☑️	0.945	0.946	0.943	0.25	-
3	Gemma2-9b-it	65	QLoRA	-	-	0.947	0.942	0.25	-
4	Qwen2.5-14B-Instruct	37	QLoRA	-	0.938	-	-	0.1	-
5	Qwen3-8B	65	Full FT	-	-	0.942	0.938	0.1	公共 Notebook [4]
6	Deepseek-math-7b-instruct	65	QLoRA	-	-	0.944	0.940	0.25	公共 Notebook [2]

(*1) 模型 #1 使用 StratifiedKFold 评估；所有其他模型使用 StratifiedGroupKFold 评估。

使用的提示词 (Prompts)

我为每个模型使用了不同的提示词。（变量省略。）

模型 # 1

<start_of_turn>user
You are an expert math educator analyzing student responses for mathematical misconceptions.

### TASK
Your goal is to analyze the student's explanation for a given math problem and classify it into a specific category of understanding or misconception.

### INPUT
- Question:{row['QuestionText']}
- Answer Choices:{choices}
- Student's Answer:{row['MC_Answer']}
- Correct?:{correctness}
- Student's Explanation:{row['StudentExplanation']}
- Classification:<end_of_turn>
<start_of_turn>model

模型 #2

<|im_start|>user
Question: {row["QuestionText"]}
Answer: {row["MC_Answer"]}
{correctness}
Student Explanation: {row["StudentExplanation"]}<|im_end|>
<|im_start|>assistant

模型 #3

<start_of_turn>user
You are an expert math educator analyzing student responses for mathematical misconceptions.

Question: {question}
Correct Answer: {answer}
Student's Explanation: {explanation}

CLASSIFICATION GUIDELINES:
• True_Correct:NA = Student demonstrates correct understanding
• False_Correct:NA = Student gives correct answer but for wrong reasons
• True_Neither:NA = Correct answer but unclear/incomplete reasoning
• False_Neither:NA = Incorrect answer but no specific misconception identified
• True_Misconception:[Type] = Correct answer but demonstrates specific misconception
• False_Misconception:[Type] = Incorrect answer with identifiable misconception

TASK: Classify this student's response using EXACTLY ONE of these {len(all_labels)} labels:

{labels_text}

Classification:<end_of_turn>
<start_of_turn>model

模型 #4

<|im_start|>user
Question: {row["QuestionText"]}
Ground Truth: {ground_truth}
Student Explanation: {row["StudentExplanation"]}<|im_end|>
<|im_start|>assistant

5th place solution