返回列表

4th Place Solution (single model pb 0.948 lb 0.951)

673. MAP - Charting Student Math Misunderstandings | map-charting-student-math-misunderstandings

开始: 2025-07-10 结束: 2025-10-15 个性化学习 数据算法赛
第 4 名解决方案 (单模型 私榜 0.948 公榜 0.951)

第 4 名解决方案 (单模型 私榜 0.948 公榜 0.951)

作者: THLUO (Grandmaster)
发布时间: 2025-10-17
竞赛排名: 第 4 名

首先,我要衷心感谢组织者和 Kaggle 举办了如此精彩的比赛。在整个比赛过程中,我个人学到了非常多的知识。在接下来的部分中,我将分享我的解决方案以及整个分数提升的过程。

1. 数据预处理

  • 首先,我使用 StratifiedKFold 将数据分为 5 折。然后,我选择了一个交叉验证分数略低于平均水平的折作为验证集,以避免使用分数过高或过低的折。
  • 正如在 这里 讨论的那样,训练数据中有 12 个错误的类别标签。即 12 行 QuestionId 为 31778 的数据,其 MC_Answer 为 9 且 Category 为 True,但正确答案应为 6。因此,我在训练和推理阶段都应用了讨论帖中描述的相同处理方法。

2. 建模

  • 由于训练集和在线测试集共享相同的 15 个问题,因此确定所选答案是否正确不需要预测。因此,任务可以简化为仅预测解释的类型(类别中的 Correct, Misconception 或 Neither),这将预测标签的数量从 65 个减少到 37 个。
  • 根据提供的 train.csv,可以统计确定每个问题有 4 到 6 个候选目标,这进一步减少了目标空间。
train['candidate_targets'] = train['QuestionId'].map(train.groupby("QuestionId")['target'].apply(set).apply(list))
train['neg_candidate_targets'] = train.apply(lambda x : [y for y in x["candidate_targets"] if x["target"] != y], axis=1)

虽然在线测试集可能会出现不同的场景——例如每个问题可能出现新的误解标签——但我观察到标记为 "Correct" 和 "Neither" 的样本占大多数。因此,我打算让模型主要关注当前可用样本中观察到的模式。

  • 我的排序模型是使用点对点(pointwise)方法训练的,它在上下文窗口中一次看到一个标签,正如 "Eedi - Mining Misconceptions in Mathematics" 的第 1 名解决方案 中所讨论的那样。我配置了训练过程,使得每一步只处理一个 row_id 样本。由于每个 row_id 有 4 到 6 个不同的候选标签,所有这些都必须包含在一个 batch 中,因此实际 batch size 为 4 到 6。

3. 骨干模型

我测试的实验模型主要来自 Qwen 系列。每个模型的实验结果总结如下(基于多折数据):

Qwen3-32B >= Qwen3-Reranker-8B >= QWQ-32B >= Qwen3-14B >= Qwen2.5-Math-7B-Instruct > Qwen2.5-14B-Instruct > Qwen3-8B > Qwen2.5-7B-Instruct

由于 Qwen3-Reranker-8B 和 Qwen2.5-Math-7B-Instruct 表现出色,我决定使用 Qwen3-Reranker-8B 进行迭代优化,同时使用更大的 Qwen2.5-Math-72B-Instruct 作为教师模型进行知识蒸馏。关于教师模型的选择,由于我的 GPU 资源有限,我没有实验许多其他大规模模型。

4. 有效优化过程

1. Prompt_v1 示例 (基础提示词)

<|im_start|>system
You are an expert in detecting grade-school level math misconceptions. The task performs 2 steps: 
1.Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither)
2.Identifies the specific misconception present, if any.<|im_end|>
<|im_start|>user
Question: What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]
Student's Answer: \( \frac{1}{3} \)
This answer is correct.
Student's Explanation: I think that 1/3 is the answer, as it's the simplest form of 3/9.

Now we judge the student's explanation is Correct, Does this diagnosis is correct? (Yes/No)<|im_end|>
<|im_start|>assistant
  • Prompt_v1 + Qwen3-Reranker-8B 1 个轮次:交叉验证分数 0.9473,公榜分数 0.944,私榜分数 0.942
  • Prompt_v1 + Qwen3-Reranker-8B + 蒸馏 1 个轮次:交叉验证分数 0.9480,公榜分数 0.946,私榜分数 0.945

2. Prompt_v2 示例 (基础提示词 + 目标少样本)

<|im_start|>system
You are an expert in detecting grade-school level math misconceptions. The task performs 2 steps: 
1.Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither)
2.Identifies the specific misconception present, if any.<|im_end|>
<|im_start|>user
Question: What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]
Student's Answer: \( \frac{1}{3} \)
This answer is correct.
Correct's Explanation Samples:
- Because there are 9 triangles and 3 arent shaded it is 3/9 and they are both multiples of 3 so it's 1/3
- The answer is 3/9 which can be simplified to 1/3
- I think this because there are 9 triangles and 3 are not shaded and 3/9 is 1/3.
- there are 9 triangles shaded and 3 are not shaded=3/9=1/3
- I think this because the answer is three over nine then they can be simplified to one over three.
- There are 9 triangles and 3 is not shaded. 3/9 is a possibility but the question says to put it in its simplest form. 9 divided by 3 is 3 and 3 divided by 3 is one. You cannot simplify it anymore so the answer is 1/3.
--
Neither's Explanation Samples:
- i counted the in-shaded parts then the shaded part then i made them a fraction.
- there are 9 triangles in total and 3 are not shaded. this means that 3 times 9 is 3 and 1 is 9.
- I think this because it is hard of the number
- 1/3. 3/9 shaded, and to find simplest from we divide the top and the bottom by 3.
- Because I counted them and seen the awnser
- I think this is because the fraction of the shape shaded is 3/9 but simplified it is 1/3.
--
Misconception:Incomplete's Explanation Samples:
- there are 9 boxes altogether and 3 of them are not shaded.
- there are 9 triangles andd 3 are not shaded
- Because there is nine triangular spaces and three are unshaded
- There are 9 squares but as you can see theree are 3 not shaded.
- I think this because there are nine triangles and thee are not shaded so it's three over nine.
- i think this because 3 out of nine triangles are not shaded
--
Misconception:WNB's Explanation Samples:
- 6 is the total and 3 is blank
- because 3 arr white and 6 are blue.
- 6 triangles are shaded so 3 aren't shaded
- 3 of the triangles are not shaded and 6 are shaded so therefore it is 3/6
- i counted the white ones and the blue ones.
- because 3 are shaded and 6 are not
--
Student's Explanation: I think that 1/3 is the answer, as it's the simplest form of 3/9.

Now we judge the student's explanation exhibited the misconception of WNB, Does this diagnosis is correct? (Yes/No)<|im_end|>
<|im_start|>assistant
  1. Prompt_v2 + Qwen3-Reranker-8B + 蒸馏 1 个轮次:交叉验证分数 0.9493,公榜分数 0.948,私榜分数 0.946。关于 prompt_v2 中少样本示例的数量,我在训练期间每个目标随机采样 1 到 3 个示例,在线推理期间每个目标采样 6 个示例。
  2. Prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 全量数据 1 个轮次:公榜分数 0.950,私榜分数 0.947。
  3. Prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 全量数据 2 个轮次:私榜分数 0.951,公榜分数 0.948。(最佳单模型)。这里有一个有趣的小插曲:在全量数据上训练 2 个轮次的模型在线推理时遇到了问题。由于 T4 GPU 在 vllm 中不支持 bfloat16 类型,将模型转换为 float16 导致溢出并产生大量 NaN 值。此时比赛临近截止日期,留给调整的时间有限。作为替代方案,我使用了本地验证阶段的 2 轮次模型作为骨干,用验证数据增强它,并再训练 2 个轮次,以近似在全量数据上训练 2 个轮次的效果。
  4. Prompt_v1 + Qwen3-32B + 蒸馏 1 个轮次:交叉验证分数 0.9502,使用全量数据后,公榜分数 0.949,私榜分数 0.946。
  5. 我使用模型 3(我的最佳单模型)和模型 4 进行了集成。集成 achieved 公榜分数 0.952 和私榜分数 0.948。这是我最终选择的提交之一。然而,由于模型 4 尽管具有很高的交叉验证分数和公榜分数,但在私榜排行榜上的表现相对较低,集成的最终结果没有超过我的最佳单模型。
Score Progress Chart

3. 结论

模型 任务 交叉验证 公榜 (lb) 私榜 (pb) 使用全量 推理时间
prompt_v1 + Qwen3-Reranker-8B + 1 轮次 CausalLM 0.9473 0.94419 0.94270 40 分钟
prompt_v1 + Qwen3-Reranker-8B + 蒸馏 + 1 轮次 CausalLM 0.9480 0.94674 0.94502 40 分钟
prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 1 轮次 CausalLM 0.9493 0.94875 0.94617 4 小时
prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 1 轮次 CausalLM 0.9493 0.95014 0.94737 4 小时
prompt_v2 + Qwen3-Reranker-8B + 蒸馏 + 2 轮次 CausalLM 0.9499 0.95120 0.94873 4 小时
prompt_v1 + Qwen3-32B-AWQ + 蒸馏 + 1 轮次 CausalLM 0.9502 0.94955 0.94655 2.5 小时
最终选定模型 (最佳 Qwen3-Reranker-8B + Qwen3-32B-AWQ) CausalLM 0.9518 0.95211 0.94835 6.5 小时

5. 对我无效的方法

  • listwise ranker
  • 思维链 (Chain of Thought, COT):我个人在思维链上花费了最多时间,但最终没有任何改进。
  • 在提示词中添加最相关的解释:我微调了一个检索模型 (Qwen3-Embedding-8B) 来查找与当前学生解释最相关的 N 个解释,并将它们纳入提示词中。这种方法取得了 0.9493 的不错交叉验证分数,但在公榜和私榜排行榜上的得分都变差了。
<|im_start|>system
You are an expert in detecting grade-school level math misconceptions. The task performs 2 steps: 
1.Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither)
2.Identifies the specific misconception present, if any.<|im_end|>
<|im_start|>user
Question: What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]
Student's Answer: \( \frac{1}{3} \)
This answer is correct.
Top relevant explanations with their tags:
- Because there are 9 triangles and 3 of them were not shaded (Correct)
- because there is 9 triangles and 3 of them were not shaded (Correct)
- there aree 9 triangles in total and 3 of them are not shaded. this gives you 3 / 9. (Correct)
--
Student's Explanation: I think that 1/3 is the answer, as it's the simplest form of 3/9.

Now we judge the student's explanation is Correct, Does this diagnosis is correct? (Yes/No)<|im_end|>
<|im_start|>assistant

6. 未尝试的方法

生成更多解释用于数据增强

同比赛其他方案