第四名解决方案 - Eedi 数学误解挖掘

第四名解决方案

作者： Takoi, charmq, kami

发布日期： 2024-12-13

竞赛排名： 第 4 名

竞赛名称： Eedi - Mining Misconceptions in Mathematics

首先，我要感谢主办方组织了如此精彩的比赛。还要感谢我的队友 charmq 和 kami 的合作。

总结

竞赛方法
- 由于预计测试集中的大多数误解不会出现在训练集中，我们专注于提高对这些未见误解的准确率。
解决方案概述
- 数据生成
  - 使用 Qwen2.5-72B-Instruct-AWQ 生成数据。
  - 该过程专门针对训练数据集中不存在的误解。
- 误解生成
  - 使用 Qwen2.5-32B-instruct-AWQ 从问题、答案等生成误解。
- 检索器 (Retriever)
  - 训练
    - 创建了两个模型并使用 LoRA 进行微调。
      - Qwen2.5-14B-instruct
      - Qwen2.5-32B-instruct
    - 在训练期间，除了提供的 QuestionText 和 AnswerText 等数据外，还将之前误解生成阶段生成的误解添加到文本中。
  - 推理 & 检索：
    - 通过连接 Qwen2.5-14B-instruct 和 Qwen2.5-32B-instruct 的输出来进行检索。输出是为多个折 (folds) 生成的。
    - 对于每个问题 - 答案对，检索以下数量的误解：
      - 来自所有误解的 25 个最相似误解。
      - 专门来自训练集中不存在的误解中的 15 个最相似误解。
      - 移除重复项。
- 重排序器 (Reranker)
  - 训练
    - 使用 LoRA 微调多个 Qwen2.5-32B-instruct 模型，调整负样本数量并将生成的数据添加到训练数据集中。
  - 推理
    - 集成训练模型的 LoRA 组件以创建三个模型。
    - 最后，集成这三个模型的输出。
- 后处理
  - 通过降低分数来调整存在于训练集中的误解的预测：
    - 预测分数 * 0.40
- 在数据生成和推理（例如检索、重排序）期间，使用 vLLM 加速处理过程。

解决方案详情

验证策略

Group KFold (group = QuestionId)
5 折
不仅评估整体折分数，还评估专门针对训练数据中未包含的误解提取的验证数据的分数。

数据生成

使用 Qwen2.5-72B-Instruct-AWQ 为训练数据集中未出现的每个误解生成新问题、答案选项和错误答案。
创建了大约 8000 个问题，其中随机 sampled 2500 个用于训练重排序模型。
对于每个误解，从训练数据集中随机 sampled 100 个问题添加到提示中，作为数据生成的输入示例。

数据生成的提示如下，最大 token count 约为 20000。

"""You are an expert in mathematics. 
Refer to the examples below to create new problem with given misconception. 

Misconception: {MisconceptionText}

The output format shoud be below.

ConstructName:  
SubjectName: 
Math problem: 
Answer A text: 
Answer B text: 
Answer C text: 
Answer D text: 
Answer: 
Incorrect answer: 

The examples are below

Example 1: 

ConstructName: {ConstructName_1}
SubjectName: {SubjectName_1}
Math problem: {QuestionText_1}
Answer A text: {AnswerAText_1}
Answer B text: {AnswerBText_1}
Answer C text: {AnswerCText_1}
Answer D text: {AnswerDText_1}
Answer: {CorrectAnswer_1}
Incorrect answer: {IncorrectAnswer_1}
Misconception: {MisconceptionText_1} 

...

Example 100: 

ConstructName: {ConstructName_100}
SubjectName: {SubjectName_100}
Math problem: {QuestionText_100}
Answer A text: {AnswerAText_100}
Answer B text: {AnswerBText_100}
Answer C text: {AnswerCText_100}
Answer D text: {AnswerDText_100}
Answer: {CorrectAnswer_100}
Incorrect answer: {IncorrectAnswer_100}
Misconception: {MisconceptionText_100} 

"""

以下是生成问题的示例。72B 模型似乎具有很高的问题生成能力。

ConstructName: Calculate the circumference of a circle given the radius
SubjectName: Circles
Math problem: If the radius of a circle is \( 7 \) cm, what is the circumference of the circle?
Answer A text: \( 22 \) cm
Answer B text: \( 44 \) cm
Answer C text: \( 14 \) cm
Answer D text: \( 154 \) cm
Answer: B
Incorrect answer: A
Misconception: Thinks circumference is radius x pi

ConstructName: Simplify algebraic fractions by identifying and cancelling common factors
SubjectName: Simplifying Algebraic Fractions
Math problem: Simplify the following algebraic fraction:
\[
\frac{6x^2y}{9xy^2}
\]
Answer A text: \( \frac{2x}{3y} \)
Answer B text: \( \frac{6x}{9y} \)
Answer C text: \( \frac{2xy}{3y^2} \)
Answer D text: \( \frac{6x^2}{9y^2} \)
Answer: A
Incorrect answer: B
Misconception: Cannot identify a common factor when simplifying algebraic fractions

这些示例在数学上是有效的，并且在提示中包含大量示例（100 个案例）对于生成有效问题至关重要。

误解生成

使用 Qwen2.5-32B-instruct 生成误解。使用了以下提示：

"""You are an expert in mathematics.
Refer to the examples below to identify and describe the misconception that led to the incorrect answer.
Example1
ConstructName: Recognise and use efficient methods for mental multiplication
SubjectName: Mental Multiplication and Division
Math problem: Tom and Katie are discussing ways to calculate\\( 21\\times 12\\) mentally. Tom does\\( 12\\times 7\\) and then multiplies his answer by\\( 3\\); Katie does\\( 21\\times 6\\) and then doubles her answer. Who would get the correct answer?
Incorrect answer: Only Katie
Misconception: Does not correctly apply the distributive property of multiplication

Example2
ConstructName: Multiply a decimal by an integer
SubjectName: Mental Multiplication and Division
Math problem:\\( 9.4\\times 50=\\)
Incorrect answer:\\( 4700\\)
Misconception: When multiplying a decimal by an integer, ignores decimal point and just multiplies the digits

ConstructName:{ConstructName}
SubjectName:{SubjectName}Math problem:{QuestionText}
Incorrect answer:{AnswerText}
Misconception:
"""

检索器 (Retriever)

通过检查 MAP@25 以及 top-25 召回率来评估检索模型。
最终提交使用的检索模型
- 以下模型的集成
  - Qwen2.5-14B-instruct
    - 误解生成步骤中生成的误解也添加到了文本中。
    - 损失函数 : MultipleNegativesRankingLoss
    - 使用 LoRA 微调：
      - LoRA_rank: 32
      - LoRA_alpha: 64
    - 每个问题 - 答案对使用 1 个批次进行训练，包含 1 个正样本和 47 个负样本。
      - 增加负样本数量至关重要。
      - 用于训练的负样本仅限于训练数据中与正样本关联的样本。
      - 负样本是随机选择的。
    - map 25 : 0.511
    - recall 25 : 0.923
  - Qwen2.5-32B-instruct-GPTQ-Int4
    - 与 14B 几乎相同。
    - 每个问题 - 答案对使用 1 个批次进行训练，包含 1 个正样本和 255 个负样本。
    - 10 个轮次 (每个折在 A100 上耗时 10 小时)。
    - 由于推理时间长 (50 分钟/折)，最终提交仅使用了 2 个折。
    - map 25 : 0.554
    - recall 25 : 0.926
    - 生成数据的 map 25 : ~0.7
    - 生成数据的 recall 25 : ~0.99
提交期间的推理
- 利用 vllm 加速嵌入计算。由于无法直接使用，对其实现进行了一些修改以适应我们的用例。这种方法在保持准确性的同时实现了高效的推理。
- 连接上述模型的嵌入以生成最终表示。
- 检索项目数量
  - 来自所有误解的 25 个项目。
  - 来自训练数据中不存在的误解的 15 个项目：
    - 这是优先事项，因为预计测试集中的大多数误解不会出现在训练数据中。
  - 移除上述之间的重复项。
重排序候选的检索
- 以下模型集成用作重排序的候选：
  - Salesforce/SFR-Embedding-2_R
  - BAAI/bge-large-en-v1.5
  - Alibaba-NLP/gte-large-en-v1.5
- 与用于提交的模型相比，recall top 25 较低，因此未用于提交。
- 使用这些候选训练的重排序模型取得了更好的公共分数。

重排序器 (Reranker)

使用三个最终模型进行推理。
模型 1 & 模型 2
- Qwen2.5-32B-Instruct-GPTQ-Int4
  - 跨 4 个折进行训练。
  - 使用 LoRA 微调。
  - 训练期间使用 1 个正样本 : 9 个负样本的比例。
  - 将负样本限制为正样本中存在的样本。
    - 这种方法改善了交叉验证 (CV)。
- 折 1 结果：
  - MAP@25: 0.653
  - 仅使用训练数据中不存在的误解进行评估：MAP@25: 0.601
- 模型 1：
  - 集成折 1 和折 2 的 LoRA 组件。
- 模型 2：
  - 集成折 3 和折 4 的 LoRA 组件。
模型 3
- Qwen2.5-32B-Instruct-GPTQ-Int4
  - 除了竞赛数据外，还使用了 2500 个生成样本进行训练。
  - 跨 2 个折进行训练并集成它们的 LoRA 组件。
  - 训练期间使用 1 个正样本 : 19 个负样本的比例。
  - 2 个轮次 (每个折在 4 x A100 上耗时 10 小时)。
- 折 1 结果：
  - MAP@25: 0.664
  - 仅使用训练数据中不存在的误解进行评估：MAP@25: 0.605

通过集成使用 72B 模型生成的数据训练的重排序模型（模型 3），私有分数提高了 0.01。

后处理 (PostProcess)

通过降低分数来调整存在于训练集中的误解的预测：
- 预测分数 * 0.40
  - 测试了其他系数，但 0.40 产生了最高的公共分数。

代码

Takoi 部分代码 https://github.com/TakoiHirokazu/kaggle-Eedi-4th-solution charmq 部分代码 https://github.com/charmq00/kaggle_eedi_public 推理 Notebook https://www.kaggle.com/code/charmq/eedi-pp040-ret15-exp345-341-348-multi-32b-ret-c015

4th Place Solution