632. Eedi - Mining Misconceptions in Mathematics | eedi-mining-misconceptions-in-mathematics
首先,我要感谢主办方举办这次比赛,感谢 Kaggle 团队,以及来自 AIRI、Skoltech 和 VeinCV 的队友们:
rolf110
danyaivanov
特别感谢 pipmos!
我们的解决方案:
Qwen2.5-14B 作为检索器,获取前 25 个误解(misconceptions)。Qwen2.5-32B 作为重排序器,对检索器收到的每个误解进行评分。我们采用迭代方式,一次传递一个误解,以获得其导致错误答案的概率。整个 Notebook 运行大约需要 7 小时。我们没有使用任何高级框架(如 vllm)来运行推理流程。
所有模型均使用比赛数据进行微调,包括 4370 个(问题、概念、正确答案、错误答案)四元组,以及使用 gpt-4o 生成的 13921 个合成(问题、错误答案、正确答案、错误答案)样本。
MisconceptionId,我们生成了新的(QuestionText、ConstructName、Correct Answer、Incorrect Answer)样本,确保在最终的实数 + 合成数据集中,每个 MisconceptionId 至少有 7 个示例。然后我们过滤掉了那些在 QuestionText 中完全匹配的示例(大约移除了 500 个样本)。MisconceptionName 的 SFR-Embedding-2_R 嵌入来获取“最接近”的误解。我们确保了对于特定 MisconceptionId 的不同生成,每个参考都是不同的。prompt 模板如下:'''Please act as a professional math tutor.
Your goal is to create high quality math problems to help students learn math.
You will be given a misconception. Please create a multiple choice math question with two options: one correct answer and one distractor (incorrect answer). The distractor should be based on the Misconception.
Follow the instructions below.
To achieve the goal, you have four tasks:
1. Please generate a brief and formal description of the concept or skill being tested in the question. DO NOT mention or refer to the Misconception in the description.
2. Create a realistic and contextually appropriate math question that aligns with the description. Provide two options: the correct answer and a distractor (incorrect answer). The distractor should be a plausible answer that a student might choose if they hold the Misconception.
3. Check the question by solving it step-by-step to find out if it adheres to all principles.
4. Modify the created question and options according to your checking comment to ensure it is of high quality.
You have the following principles to guide you:
1. Ensure the question is realistic, natural, and contextually appropriate, adhering to common sense and fundamental mathematical principles.
2. Ensure the question asks for only one specific thing and is clearly stated.
3. Ensure your student can answer the question correctly by using only the information provided in the question. If visual data is needed (e.g. plot, histogram, bar, etc.), describe it's content explicitly in text.
4. Ensure the distractor is directly related to the question, stems from the Misconception, and is a plausible answer that a student might select if they have that Misconception.
5. If the created question already follows these principles upon verification, keep it without modification.
Your output should be in the following format:
DESCRIPTION: <Brief and formal description of the concept or skill being tested>
QUESTION: <Your created question>
CORRECT ANSWER: <Correct answer to the question>
DISTRACTOR: <Distractor (incorrect answer) to the question based on the provided Misconception>
VERIFICATION AND MODIFICATION: <Solve the question step-by-step and modify it to follow all principles>
FINAL QUESTION: <Your final created question>
FINAL CORRECT ANSWER: <Your final correct answer>
FINAL DISTRACTOR: <Your final distractor (incorrect answer)>
Here is an example of the final result corresponding to Misconception "{misconception}". Use it for your reference, but ensure the student can answer your created question without the given example:
DESCRIPTION: {description}
FINAL QUESTION: {question}
FINAL CORRECT ANSWER: {correct_answer}
FINAL INCORRECT ANSWER: {incorrect_answer}
'''
我们手动检查了一些生成的示例,它们相当不错,与真实数据非常接近。
我们发现对于生成来说重要的其他事项:
ConstructName)字段中生成了相当强烈的正确误解提示;但我们发现比赛数据中也存在这种情况,所以我们决定保持原样。我们使用了 20% 的比赛数据进行验证,并将其进一步分为 2 个子集:
MisconceptionId 无交集,并基于每个问题的误解数量进行分层。QuestionId 无交集,并基于每个问题的误解数量进行分层。我们使用 ID 测试来评估模型在其已知的误解上的表现,而使用 OOD 测试来评估模型对未见过的误解的泛化能力。然而,我们主要跟踪 OOD 数据,因为它与排行榜(LB)非常吻合,并且代表了最复杂的场景。
为了获取细粒度的困难负样本,我们首先在实数 + 合成数据上微调了 Qwen2.5-14B。然后我们使用完全相同的设置重复训练,但使用我们训练好的 Qwen2.5-14B 提取的困难负样本。在每个 epoch 之后,我们使用最新的检查点收集新的困难负样本。
我们在训练过程中还利用了批次内负采样(in-batch negative sampling)。为了缓解将看起来与正样本相同的误解视为负样本的问题,我们在计算批次中每个样本的损失之前,通过动态掩码将这些误解的相似度设置为 -inf。
| 数据 | Recall@25 | MAP@25 |
|---|---|---|
| CV OOD | 0.88 | 0.42 |
| Public LB | None | 0.438 |
| Private LB | None | 0.434 |
根据论文 Novice Learner and Expert Tutor 中提供的实验以及比赛期间其他参与者所做的一些研究,我们提出了将误解列表作为额外输入传递的想法,以更好地指导重排序器模型。
我们不希望模型将误解预测输出为文本,因为我们认为这不太可靠,多分类问题也不适合这里。最终,我们决定为每个误解单独解决一个二分类问题:
past_key_values。这有助于节省大量运行时,因为对于给定样本,我们只需要计算一次。past_key_values 作为缓存。我们使用 5 的批次大小,因此整个误解列表需要 5 次迭代。给定(问题,错误答案)对,我们创建 24 个 label = 0 和 1 个 label = 1 的样本进行训练。它们之间的主要区别在于我们构建“可能误解列表”的方式:
label = 0 样本。label = 1)时,我们包含从前 50 个中随机选择的 29 个负样本,以创建我们传递给模型的 30 个误解集合。为了缓解检索器引起的位置偏差,我们对每个输入样本的列表进行随机打乱。
输入提示词示例:
Question: {Question}
Brief description of the concept or skill being tested in the Question: {ConstructName}
Correct Answer: {Correct Answer}
Incorrect Answer: {Incorrect Answer}
List of possible misconceptions:
1. {Misconception_1}
...
30. {Misconception_30}
Misconception: {n}. {Misconception_n}
Does this Misconception lead to the Incorrect Answer?
为了应对 0/1 类之间的高度不平衡,我们将正类过采样了 4 倍。
我们使用 h2o-studio 框架来训练我们的模型。
| 数据 | MAP@25 |
|---|---|
| CV OOD | 0.571 |
| Public LB | 0.589 |
| Private LB | 0.558 |