第 4 名解决方案 (单模型私榜 0.948 公榜 0.951)

作者: THLUO (Grandmaster)
发布时间: 2025-10-17
竞赛排名: 第 4 名

首先，我要衷心感谢组织者和 Kaggle 举办了如此精彩的比赛。在整个比赛过程中，我个人学到了非常多的知识。在接下来的部分中，我将分享我的解决方案以及整个分数提升的过程。

1. 数据预处理

首先，我使用 StratifiedKFold 将数据分为 5 折。然后，我选择了一个交叉验证分数略低于平均水平的折作为验证集，以避免使用分数过高或过低的折。
正如在这里讨论的那样，训练数据中有 12 个错误的类别标签。即 12 行 QuestionId 为 31778 的数据，其 MC_Answer 为 9 且 Category 为 True，但正确答案应为 6。因此，我在训练和推理阶段都应用了讨论帖中描述的相同处理方法。

2. 建模

由于训练集和在线测试集共享相同的 15 个问题，因此确定所选答案是否正确不需要预测。因此，任务可以简化为仅预测解释的类型（类别中的 Correct, Misconception 或 Neither），这将预测标签的数量从 65 个减少到 37 个。
根据提供的 train.csv，可以统计确定每个问题有 4 到 6 个候选目标，这进一步减少了目标空间。

train['candidate_targets'] = train['QuestionId'].map(train.groupby("QuestionId")['target'].apply(set).apply(list))
train['neg_candidate_targets'] = train.apply(lambda x : [y for y in x["candidate_targets"] if x["target"] != y], axis=1)

虽然在线测试集可能会出现不同的场景——例如每个问题可能出现新的误解标签——但我观察到标记为 "Correct" 和 "Neither" 的样本占大多数。因此，我打算让模型主要关注当前可用样本中观察到的模式。

我的排序模型是使用点对点（pointwise）方法训练的，它在上下文窗口中一次看到一个标签，正如 "Eedi - Mining Misconceptions in Mathematics" 的第 1 名解决方案中所讨论的那样。我配置了训练过程，使得每一步只处理一个 row_id 样本。由于每个 row_id 有 4 到 6 个不同的候选标签，所有这些都必须包含在一个 batch 中，因此实际 batch size 为 4 到 6。

3. 骨干模型

我测试的实验模型主要来自 Qwen 系列。每个模型的实验结果总结如下（基于多折数据）：

Qwen3-32B >= Qwen3-Reranker-8B >= QWQ-32B >= Qwen3-14B >= Qwen2.5-Math-7B-Instruct > Qwen2.5-14B-Instruct > Qwen3-8B > Qwen2.5-7B-Instruct

由于 Qwen3-Reranker-8B 和 Qwen2.5-Math-7B-Instruct 表现出色，我决定使用 Qwen3-Reranker-8B 进行迭代优化，同时使用更大的 Qwen2.5-Math-72B-Instruct 作为教师模型进行知识蒸馏。关于教师模型的选择，由于我的 GPU 资源有限，我没有实验许多其他大规模模型。

4. 有效优化过程

1. Prompt_v1 示例 (基础提示词)

<|im_start|>system
You are an expert in detecting grade-school level math misconceptions. The task performs 2 steps: 
1.Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither)
2.Identifies the specific misconception present, if any.<|im_end|>
<|im_start|>user
Question: What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]
Student's Answer: \( \frac{1}{3} \)
This answer is correct.
Student's Explanation: I think that 1/3 is the answer, as it's the simplest form of 3/9.

Now we judge the student's explanation is Correct, Does this diagnosis is correct? (Yes/No)<|im_end|>
<|im_start|>assistant

Prompt_v1 + Qwen3-Reranker-8B 1 个轮次：交叉验证分数 0.9473，公榜分数 0.944，私榜分数 0.942
Prompt_v1 + Qwen3-Reranker-8B + 蒸馏 1 个轮次：交叉验证分数 0.9480，公榜分数 0.946，私榜分数 0.945

2. Prompt_v2 示例 (基础提示词 + 目标少样本)

<|im_start|>system
You are an expert in detecting grade-school level math misconceptions. The task performs 2 steps: 
1.Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither)
2.Identifies the specific misconception present, if any.<|im_end|>
<|im_start|>user
Question: What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]
Student's Answer: \( \frac{1}{3} \)
This answer is correct.
Correct's Explanation Samples:
- Because there are 9 triangles and 3 arent shaded it is 3/9 and they are both multiples of 3 so it's 1/3
- The answer is 3/9 which can be simplified to 1/3
- I think this because there are 9 triangles and 3 are not shaded and 3/9 is 1/3.
- there are 9 triangles shaded and 3 are not shaded=3/9=1/3
- I think this because the answer is three over nine then they can be simplified to one over three.
- There are 9 triangles and 3 is not shaded. 3/9 is a possibility but the question says to put it in its simplest form. 9 divided by 3 is 3 and 3 divided by 3 is one. You cannot simplify it anymore so the answer is 1/3.
--
Neither's Explanation Samples:
- i counted the in-shaded parts then the shaded part then i made them a fraction.
- there are 9 triangles in total and 3 are not shaded. this means that 3 times 9 is 3 and 1 is 9.
- I think this because it is hard of the number
- 1/3. 3/9 shaded, and to find simplest from we divide the top and the bottom by 3.
- Because I counted them and seen the awnser
- I think this is because the fraction of the shape shaded is 3/9 but simplified it is 1/3.
--
Misconception:Incomplete's Explanation Samples:
- there are 9 boxes altogether and 3 of them are not shaded.
- there are 9 triangles andd 3 are not shaded
- Because there is nine triangular spaces and three are unshaded
- There are 9 squares but as you can see theree are 3 not shaded.
- I think this because there are nine triangles and thee are not shaded so it's three over nine.
- i think this because 3 out of nine triangles are not shaded
--
Misconception:WNB's Explanation Samples:
- 6 is the total and 3 is blank
- because 3 arr white and 6 are blue.
- 6 triangles are shaded so 3 aren't shaded
- 3 of the triangles are not shaded and 6 are shaded so therefore it is 3/6
- i counted the white ones and the blue ones.
- because 3 are shaded and 6 are not
--
Student's Explanation: I think that 1/3 is the answer, as it's the simplest form of 3/9.

Now we judge the student's explanation exhibited the misconception of WNB, Does this diagnosis is correct? (Yes/No)<|im_end|>
<|im_start|>assistant

Prompt_v2 + Qwen3-Reranker-8B + 蒸馏 1 个轮次：交叉验证分数 0.9493，公榜分数 0.948，私榜分数 0.946。关于 prompt_v2 中少样本示例的数量，我在训练期间每个目标随机采样 1 到 3 个示例，在线推理期间每个目标采样 6 个示例。
Prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 全量数据 1 个轮次：公榜分数 0.950，私榜分数 0.947。
Prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 全量数据 2 个轮次：私榜分数 0.951，公榜分数 0.948。（最佳单模型）。这里有一个有趣的小插曲：在全量数据上训练 2 个轮次的模型在线推理时遇到了问题。由于 T4 GPU 在 vllm 中不支持 bfloat16 类型，将模型转换为 float16 导致溢出并产生大量 NaN 值。此时比赛临近截止日期，留给调整的时间有限。作为替代方案，我使用了本地验证阶段的 2 轮次模型作为骨干，用验证数据增强它，并再训练 2 个轮次，以近似在全量数据上训练 2 个轮次的效果。
Prompt_v1 + Qwen3-32B + 蒸馏 1 个轮次：交叉验证分数 0.9502，使用全量数据后，公榜分数 0.949，私榜分数 0.946。
我使用模型 3（我的最佳单模型）和模型 4 进行了集成。集成 achieved 公榜分数 0.952 和私榜分数 0.948。这是我最终选择的提交之一。然而，由于模型 4 尽管具有很高的交叉验证分数和公榜分数，但在私榜排行榜上的表现相对较低，集成的最终结果没有超过我的最佳单模型。

3. 结论

模型	任务	交叉验证	公榜 (lb)	私榜 (pb)	使用全量	推理时间
prompt_v1 + Qwen3-Reranker-8B + 1 轮次	CausalLM	0.9473	0.94419	0.94270		40 分钟
prompt_v1 + Qwen3-Reranker-8B + 蒸馏 + 1 轮次	CausalLM	0.9480	0.94674	0.94502		40 分钟
prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 1 轮次	CausalLM	0.9493	0.94875	0.94617		4 小时
prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 1 轮次	CausalLM	0.9493	0.95014	0.94737	✅	4 小时
prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 2 轮次	CausalLM	0.9499	0.95120	0.94873	✅	4 小时
prompt_v1 + Qwen3-32B-AWQ + 蒸馏 + 1 轮次	CausalLM	0.9502	0.94955	0.94655	✅	2.5 小时
最终选定模型 (最佳 Qwen3-Reranker-8B + Qwen3-32B-AWQ)	CausalLM	0.9518	0.95211	0.94835	✅	6.5 小时

5. 对我无效的方法

listwise ranker
思维链 (Chain of Thought, COT)：我个人在思维链上花费了最多时间，但最终没有任何改进。
在提示词中添加最相关的解释：我微调了一个检索模型 (Qwen3-Embedding-8B) 来查找与当前学生解释最相关的 N 个解释，并将它们纳入提示词中。这种方法取得了 0.9493 的不错交叉验证分数，但在公榜和私榜排行榜上的得分都变差了。

<|im_start|>system
You are an expert in detecting grade-school level math misconceptions. The task performs 2 steps: 
1.Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither)
2.Identifies the specific misconception present, if any.<|im_end|>
<|im_start|>user
Question: What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]
Student's Answer: \( \frac{1}{3} \)
This answer is correct.
Top relevant explanations with their tags:
- Because there are 9 triangles and 3 of them were not shaded (Correct)
- because there is 9 triangles and 3 of them were not shaded (Correct)
- there aree 9 triangles in total and 3 of them are not shaded. this gives you 3 / 9. (Correct)
--
Student's Explanation: I think that 1/3 is the answer, as it's the simplest form of 3/9.

Now we judge the student's explanation is Correct, Does this diagnosis is correct? (Yes/No)<|im_end|>
<|im_start|>assistant

6. 未尝试的方法

生成更多解释用于数据增强

7. 参考

kaggle_eedi https://www.kaggle.com/competitions/eedi-mining-misconceptions-in-mathematics/overview offline-install-vllm-0.10.0 https://www.kaggle.com/code/hiranorm/offline-install-vllm-0-10-0-i-qwenemdding-llama Qwen https://huggingface.co/Qwen vllm https://docs.vllm.ai/en/latest/

4th Place Solution (single model pb 0.948 lb 0.951)

第 4 名解决方案 (单模型私榜 0.948 公榜 0.951)

1. 数据预处理

2. 建模

3. 骨干模型

4. 有效优化过程

1. Prompt_v1 示例 (基础提示词)

2. Prompt_v2 示例 (基础提示词 + 目标少样本)

3. 结论

5. 对我无效的方法

6. 未尝试的方法

7. 参考

同比赛其他方案

4th Place Solution (single model pb 0.948 lb 0.951)

第 4 名解决方案 (单模型 私榜 0.948 公榜 0.951)

1. 数据预处理

2. 建模

3. 骨干模型

4. 有效优化过程

1. Prompt_v1 示例 (基础提示词)

2. Prompt_v2 示例 (基础提示词 + 目标少样本)

3. 结论

5. 对我无效的方法

6. 未尝试的方法

7. 参考

同比赛其他方案

第 4 名解决方案 (单模型私榜 0.948 公榜 0.951)