第 14 名解决方案

作者: Andrey Galichin (AIRI, Skoltech, VeinCV)
发布日期: 2024-12-13
竞赛排名: 第 14 名

第 14 名解决方案

首先，我要感谢主办方举办这次比赛，感谢 Kaggle 团队，以及来自 AIRI、Skoltech 和 VeinCV 的队友们：
rolf110
danyaivanov
特别感谢 pipmos！

总结

整体推理流程

我们的解决方案：

使用 Qwen2.5-14B 作为检索器，获取前 25 个误解（misconceptions）。
使用 Qwen2.5-32B 作为重排序器，对检索器收到的每个误解进行评分。我们采用迭代方式，一次传递一个误解，以获得其导致错误答案的概率。

整个 Notebook 运行大约需要 7 小时。我们没有使用任何高级框架（如 vllm）来运行推理流程。

所有模型均使用比赛数据进行微调，包括 4370 个（问题、概念、正确答案、错误答案）四元组，以及使用 gpt-4o 生成的 13921 个合成（问题、错误答案、正确答案、错误答案）样本。

描述

合成数据

对于给定数据库中的每个 MisconceptionId，我们生成了新的（QuestionText、ConstructName、Correct Answer、Incorrect Answer）样本，确保在最终的实数 + 合成数据集中，每个 MisconceptionId 至少有 7 个示例。然后我们过滤掉了那些在 QuestionText 中完全匹配的示例（大约移除了 500 个样本）。
为了给 LLM 提供一些指导，我们传递了一个来自训练数据集的参考样本，该样本对应于所考虑生成的误解的“最接近”的误解。我们使用 MisconceptionName 的 SFR-Embedding-2_R 嵌入来获取“最接近”的误解。我们确保了对于特定 MisconceptionId 的不同生成，每个参考都是不同的。
为了构建提示词，我们参考了这篇论文中的思路。我们传递给 gpt-4o 的 prompt 模板如下：

'''Please act as a professional math tutor.
Your goal is to create high quality math problems to help students learn math.
You will be given a misconception. Please create a multiple choice math question with two options: one correct answer and one distractor (incorrect answer). The distractor should be based on the Misconception. 
Follow the instructions below.
To achieve the goal, you have four tasks:
1. Please generate a brief and formal description of the concept or skill being tested in the question. DO NOT mention or refer to the Misconception in the description.
2. Create a realistic and contextually appropriate math question that aligns with the description. Provide two options: the correct answer and a distractor (incorrect answer). The distractor should be a plausible answer that a student might choose if they hold the Misconception.
3. Check the question by solving it step-by-step to find out if it adheres to all principles.
4. Modify the created question and options according to your checking comment to ensure it is of high quality.
You have the following principles to guide you:
1. Ensure the question is realistic, natural, and contextually appropriate, adhering to common sense and fundamental mathematical principles.
2. Ensure the question asks for only one specific thing and is clearly stated.
3. Ensure your student can answer the question correctly by using only the information provided in the question. If visual data is needed (e.g. plot, histogram, bar, etc.), describe it's content explicitly in text.
4. Ensure the distractor is directly related to the question, stems from the Misconception, and is a plausible answer that a student might select if they have that Misconception.
5. If the created question already follows these principles upon verification, keep it without modification.

Your output should be in the following format:
DESCRIPTION: <Brief and formal description of the concept or skill being tested>
QUESTION: <Your created question>
CORRECT ANSWER: <Correct answer to the question>
DISTRACTOR: <Distractor (incorrect answer) to the question based on the provided Misconception>
VERIFICATION AND MODIFICATION: <Solve the question step-by-step and modify it to follow all principles>
FINAL QUESTION: <Your final created question>
FINAL CORRECT ANSWER: <Your final correct answer>
FINAL DISTRACTOR: <Your final distractor (incorrect answer)>

Here is an example of the final result corresponding to Misconception "{misconception}". Use it for your reference, but ensure the student can answer your created question without the given example:
DESCRIPTION: {description}
FINAL QUESTION: {question}
FINAL CORRECT ANSWER: {correct_answer}
FINAL INCORRECT ANSWER: {incorrect_answer}
'''

我们手动检查了一些生成的示例，它们相当不错，与真实数据非常接近。
我们发现对于生成来说重要的其他事项：

提示词中的验证和修改阶段有助于修复一些罕见的错误，并获得良好的问题和有效的答案。
在指定参考时，在提示词中添加"ensure the student can answer your created question without the given example"（确保学生可以在没有给定示例的情况下回答你创建的问题）非常重要。否则，在少数情况下，模型不会在生成的问题中指定一些重要信息，因为它存储在参考中，模型可能认为这些信息也会提供给学生。
我们唯一无法完全修复的是，在少数情况下（约 3-4% 的数据），LLM 在描述（ConstructName）字段中生成了相当强烈的正确误解提示；但我们发现比赛数据中也存在这种情况，所以我们决定保持原样。

验证

我们使用了 20% 的比赛数据进行验证，并将其进一步分为 2 个子集：

分布外 (OOD): 训练/测试之间 MisconceptionId 无交集，并基于每个问题的误解数量进行分层。
分布内 (ID): 训练/测试之间 QuestionId 无交集，并基于每个问题的误解数量进行分层。

我们使用 ID 测试来评估模型在其已知的误解上的表现，而使用 OOD 测试来评估模型对未见过的误解的泛化能力。然而，我们主要跟踪 OOD 数据，因为它与排行榜（LB）非常吻合，并且代表了最复杂的场景。

检索器 (Retriever)

为了获取细粒度的困难负样本，我们首先在实数 + 合成数据上微调了 Qwen2.5-14B。然后我们使用完全相同的设置重复训练，但使用我们训练好的 Qwen2.5-14B 提取的困难负样本。在每个 epoch 之后，我们使用最新的检查点收集新的困难负样本。

训练

negatives_range & num_negatives: 100 & 8
qlora config: rank 32, alpha 64
loss: InfoNCE Loss
mask_token_probability: 0.1 (仅应用于查询)
lr: 5e-5
epochs: 3

我们在训练过程中还利用了批次内负采样（in-batch negative sampling）。为了缓解将看起来与正样本相同的误解视为负样本的问题，我们在计算批次中每个样本的损失之前，通过动态掩码将这些误解的相似度设置为 -inf。

得分

数据	Recall@25	MAP@25
CV OOD	0.88	0.42
Public LB	None	0.438
Private LB	None	0.434

重排序器 (Reranker)

根据论文 Novice Learner and Expert Tutor 中提供的实验以及比赛期间其他参与者所做的一些研究，我们提出了将误解列表作为额外输入传递的想法，以更好地指导重排序器模型。

我们不希望模型将误解预测输出为文本，因为我们认为这不太可靠，多分类问题也不适合这里。最终，我们决定为每个误解单独解决一个二分类问题：

首先，我们将问题、概念、正确答案、错误答案以及检索器接收到的前 25 个误解列表作为列表传递给骨干网络，并存储 past_key_values。这有助于节省大量运行时，因为对于给定样本，我们只需要计算一次。
然后我们从列表中形成一个误解批次，并将其传递给骨干网络，其中每个元素都获得上一步的相同 past_key_values 作为缓存。我们使用 5 的批次大小，因此整个误解列表需要 5 次迭代。
我们获取最终的 logits 并将其馈送到从头开始训练的分类头。该头将 logits 空间映射到单个分数，指示该误解是否导致错误答案。
我们按预测分数对误解进行排序。

训练数据集

给定（问题，错误答案）对，我们创建 24 个 label = 0 和 1 个 label = 1 的样本进行训练。它们之间的主要区别在于我们构建“可能误解列表”的方式：

首先，我们使用训练好的检索器识别前 50 个负误解。
从这前 50 个误解中，我们随机选择 24 个负样本作为进一步采样的子集。
对于第 2 步中采样的每个负样本，我们添加 28 个随机负样本、正在评估的特定负样本以及与（问题，错误答案）对绑定的正确正误解。这导致每对产生 24 个 label = 0 样本。
当模型查询正误解（label = 1）时，我们包含从前 50 个中随机选择的 29 个负样本，以创建我们传递给模型的 30 个误解集合。

为了缓解检索器引起的位置偏差，我们对每个输入样本的列表进行随机打乱。

输入提示词示例：

Question: {Question}
Brief description of the concept or skill being tested in the Question: {ConstructName}
Correct Answer: {Correct Answer}
Incorrect Answer: {Incorrect Answer}
List of possible misconceptions:
1. {Misconception_1}
...
30. {Misconception_30}
Misconception: {n}. {Misconception_n}
Does this Misconception lead to the Incorrect Answer?

为了应对 0/1 类之间的高度不平衡，我们将正类过采样了 4 倍。

训练

我们使用 h2o-studio 框架来训练我们的模型。

loss: 带有 0.05 标签平滑的交叉熵损失，以解释合成数据引入的可能错误
qlora config: rank 16, alpha 32
lr & head lr: 2e-5 & 1e-5
epoch: 1
mask token probability: 0.1

得分

数据	MAP@25
CV OOD	0.571
Public LB	0.589
Private LB	0.558

无效的方法

使用 gpt4o-mini 重写正误解以改进过采样过程
在提示词中添加其他错误答案选项

公开数据

重排序器适配器和分类头权重 https://www.kaggle.com/models/andreygalichin/qwen2.5-32b-r32-a16-synth-merge-smooth-full/

14th Place Solution

第 14 名解决方案

总结

整体推理流程

描述

合成数据

验证

检索器 (Retriever)

训练

得分

重排序器 (Reranker)

训练数据集

训练

得分

无效的方法

公开数据

同比赛其他方案