MAP2025 私有 & 公共榜单第 2 名解决方案

作者: HZM (leehann) & Baiph (pingfan)
发布时间: 2025-10-17
竞赛排名: 第 2 名 (Grandmaster)

MAP2025 私有 & 公共榜单第 2 名解决方案

非常感谢主办方，我很高兴能参加这次比赛。我的队友 Baiph 非常努力，我们对最终结果的贡献相同，我们俩采用了不同的训练 pipeline，这样我们才能在公共和私有榜单上都获得第 2 名。

P.S. 我正在寻找 NLP 或 LLM 领域的远程工作，首选 UTC +8 时区，如果可能的话请给我发邮件。

解决方案概述

本次竞赛的任务是根据学生的解释预测误解（misunderstandings）。这是一个非常简单的分类问题，但本次竞赛的指标是 MAP@3，这通常用于排序问题，所以我认为标签有时是模糊的，学生的解释可以被标记为多个目标。

基于上述分析，我们需要做的是去噪，或者根据 OOF（Out Of Fold）给学生的解释一些软标签，以防标签泄露。

关于数据，有超过 30 个标签的数据有限，所以我使用了许多商业 LLM 来生成外部数据。

关于模型选择，我们使用 LLM 模型作为骨干网络，我们尝试了 qwen3, qwen2.5, mistral, phi-4。

关于训练策略，我们都将分类问题转换为多选生成问题，只有这样，我们才能使用 vllm 及时完成推理。

HZM 部分

我来介绍我最好的私有单模型 qwen3-14b，它更容易理解。
训练 pipeline 可以分为 4 个步骤。

步骤 1：生成外部数据（我和 Baiph 都使用了这些数据）

为每个问题生成一个带有稀有候选项的学生输出。
结合标签进行对比学习。

例如：
如果我想要生成学生解释，其为 True_Misconception:Additive（在步骤 4 训练中用作硬标签），
我将使用 
False_Misconception:Additive_same_math_problem 和 False_Misconception:Additive_different_math_problem
+
True_Misconception:Additive_same_math_problem 和 True_Misconception:Additive_different_math_problem
放入对比示例，让 ChatGPT 或 Claude 生成更好的数据

我使用了 gpt-4, gpt-5, claude-sonnet, gemini, seed, doubao，生成了 80K 数据。

步骤 2：训练 LLM 标记外部数据

在所有 train.csv 上训练 4 个 LLM 模型，然后标记外部数据（在步骤 4 训练中用作软标签），使用的模型为 phi_4_reasoning_14b, qwen3_32b, mistral_12b, qwen2_5_72b

以下是训练提示词：
"""Analyze the student's answer and explanation.            
Determine if the student's answer is correct (True) or incorrect (False)
Evaluate if the explanation shows correct reasoning, contains a misconception, or is neither
If a misconception is present, identify the specific type
Select exactly ONE label from the 65 options below \n"""
"{0}"
"Your Answer: [Select one letter option from A through BM]\n"
"Input is :\n"
"Question: {1}\n"
"Option: {2}\n"
"Correct Answer: {3}\n"
"Student Answer: {4}\n"
"Student Explanation: {5}\n"

步骤 3：5 折训练 LLM 生成软标签

5 折训练 4 个 LLM 模型，平均 oof.csv（在步骤 4 训练中用作软标签）

步骤 4：qwen3-14b 的多损失设计

我修改了 SFTTrainer 并结合了生成损失（硬标签）和分类损失（软标签）
target_token_id 意味着每个选项将转换为 vocab 中的 token id。

class SFTChoiceTrainer(SFTTrainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None, return_choice_logit=False):

        target_token_id = self.model.config.target_token_id
        target_token_id = torch.tensor(target_token_id, device=model.device)

        labels = inputs['labels']
        mask = torch.isin(labels, target_token_id)
        labels[~mask] = -100
        inputs['labels'] = labels
        _, outputs = super().compute_loss(model, inputs, return_outputs=True, num_items_in_batch=num_items_in_batch)
        logits = outputs.logits
        loss = outputs.loss
        #print(outputs)
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()

        logits_target = []
        for i in range(len(shift_labels)):
            lbl = shift_labels[i].cpu().numpy()
            target_idx = np.where(lbl != -100)[0][-1]
            logits_target.append(shift_logits[i][target_idx][target_token_id])
        #
        # # (batch_size, 26)
        logits_target = torch.stack(logits_target, dim=0)
        # # (batch_size)
        labels_target = inputs['soft_label'].to(outputs.logits.device)
        soft_loss = F.cross_entropy(logits_target, labels_target)

        #weight = self._soft_weight()
        loss = loss +  soft_loss
        return (loss, outputs) if return_outputs else loss

经过上述步骤，fold-0 的 CV 可以达到 0.955，公共榜单是 0.950，私有榜单是 0.946

Baiph 部分

1. 误解扩展

将 65 个类别转换为 37 个类别，并使用 DeepSeek 扩展和解释简短的误解

mis2reason = {
        "SwapDividend": "Incorrectly swapping the positions of dividend and divisor in division operations.",
        "Tacking": "Arbitrarily adding zeros or decimal points to the end of numbers, believing the value remains unchanged or changes incorrectly.",
        "Additive": "Mistakenly using addition to solve problems that require other operations (multiplication, subtraction, etc.)",
        "Wrong_term": "Incorrectly identifying or handling terms in algebraic expressions.",
        "Wrong_Fraction": "Completely misunderstanding fraction concepts or representation methods.",
        "Incomplete": "Providing incomplete solutions missing crucial steps or explanations.",
        "Unknowable": "Mistakenly believing a problem is unsolvable or lacks information when it is actually solvable.",
        "Not_variable": "Treating variables as specific numerical values, or vice versa.",
        "Firstterm": "Overemphasizing the first term in a sequence while ignoring the importance of other terms.",
        "Irrelevant": "Using information or criteria unrelated to the problem for reasoning.",
        "Inverse_operation": "Incorrectly applying inverse operations or confusing relationships between operations.",
        "Multiplying_by_4": "Specific error: Always multiplying by 4 without considering the specific context.",
        "Base_rate": "Ignoring base probabilities or benchmark values, focusing only on specific cases.",
        "Definition": "Misunderstanding mathematical concept definitions or terminology meanings.",
        "WNB": """Mistakenly believing "the whole is not the sum of its parts" or similar part-whole relationships""",
        "Whole_numbers_larger": "Believing decimals with larger whole number parts are always larger, ignoring decimal parts",
        "Incorrect_equivalent_fraction_addition": "Incorrectly performing fraction addition operations",
        "Inversion": "Mistakenly reversing the order of numbers, fractions, or operations.",
        "Mult": "Mistakenly using multiplication to solve problems that require other operations.",
        "Adding_terms": "Incorrectly adding terms directly in algebraic expressions.",
        "FlipChange": "Incorrectly handling numerator-denominator conversions in fraction operations.",
        "Division": "Mistakenly using division to solve problems that require other operations.",
        "Duplication": "Incorrectly repeating numbers or operations.",
        "Interior": "Incorrectly handling interior angles or internal elements in geometric figures.",
        "Certainty": "Providing definite answers for uncertain problems, or vice versa.",
        "Shorter_is_bigger": "Believing numbers with fewer digits are larger.",
        "Wrong_fraction": "Misunderstanding fraction concepts, including numerator-denominator relationships.",
        "Adding_across": "Incorrectly adding across place values (e.g., adding tens to ones directly).",
        "Wrong_Operation": "Choosing completely wrong mathematical operations.",
        "Denominator-only_change": "Changing only the denominator while ignoring corresponding changes in the numerator.",
        "Scale": "Misunderstanding scale factors or proportional relationships.",
        "Longer_is_bigger": "Believing numbers with more digits are larger.",
        "Positive": "Mistakenly believing all mathematical results should be positive numbers.",
        "Ignores_zeroes": "Ignoring the place value or importance of zeros in numbers.",
        "Subtraction": "Mistakenly using subtraction to solve problems that require other operations.",
        "Correct": "Student Explanation is Correct",
        "Neither": "This explanation is confusing and it doesn't fall into any of the above categories"
    }

2. 提示词 (Prompt)

<|im_start|>user
You are now tasked with analyzing math problems and classifying student responses. Given a math problem, the student's chosen answer, whether it's correct, and the student's explanation, you need to determine the appropriate Misconception classification.
(1) Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither in Category; e.g., True_Correct)
(2) Identifies the specific misconception present, if any.

Below are the available Misconception classifications you can choose from.
Always provide your response using only the specified format.

A: Incorrectly swapping the positions of dividend and divisor in division operations.
B: Arbitrarily adding zeros or decimal points to the end of numbers, believing the value remains unchanged or changes incorrectly.
C: Mistakenly using addition to solve problems that require other operations (multiplication, subtraction, etc.)
D: Incorrectly identifying or handling terms in algebraic expressions.
E: Completely misunderstanding fraction concepts or representation methods.
F: Providing incomplete solutions missing crucial steps or explanations.
G: Mistakenly believing a problem is unsolvable or lacks information when it is actually solvable.
H: Treating variables as specific numerical values, or vice versa.
I: Overemphasizing the first term in a sequence while ignoring the importance of other terms.
J: Using information or criteria unrelated to the problem for reasoning.
K: Incorrectly applying inverse operations or confusing relationships between operations.
L: Specific error: Always multiplying by 4 without considering the specific context.
M: Ignoring base probabilities or benchmark values, focusing only on specific cases.
N: Misunderstanding mathematical concept definitions or terminology meanings.
O: Mistakenly believing "the whole is not the sum of its parts" or similar part-whole relationships
P: Believing decimals with larger whole number parts are always larger, ignoring decimal parts
Q: Incorrectly performing fraction addition operations
R: Mistakenly reversing the order of numbers, fractions, or operations.
S: Mistakenly using multiplication to solve problems that require other operations.
T: Incorrectly adding terms directly in algebraic expressions.
U: Incorrectly handling numerator-denominator conversions in fraction operations.
V: Mistakenly using division to solve problems that require other operations.
W: Incorrectly repeating numbers or operations.
X: Incorrectly handling interior angles or internal elements in geometric figures.
Y: Providing definite answers for uncertain problems, or vice versa.
Z: Believing numbers with fewer digits are larger.
a: Misunderstanding fraction concepts, including numerator-denominator relationships.
b: Incorrectly adding across place values (e.g., adding tens to ones directly).
c: Choosing completely wrong mathematical operations.
d: Changing only the denominator while ignoring corresponding changes in the numerator.
e: Misunderstanding scale factors or proportional relationships.
f: Believing numbers with more digits are larger.
g: Mistakenly believing all mathematical results should be positive numbers.
h: Ignoring the place value or importance of zeros in numbers.
i: Mistakenly using subtraction to solve problems that require other operations.
j: Student Explanation is Correct
k: This explanation is confusing and it doesn't fall into any of the above categories

Please analyze the given input and provide your classification.

### Question:
What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]

### Choices:
(A) \\( \\frac{1}{3} \\) (B) \\( \\frac{3}{9} \\) (C) \\( \\frac{3}{6} \\) (D) \\( \\frac{3}{8} \\)

### Selected Answer:
A. \\( \\frac{1}{3} \\)

### The selected answer is correct.

### Student Explanation:
I think that 1/3 is the answer, as it's the simplest form of 3/9.<|im_end|>
<|im_start|>assistant

3. 训练策略

LR=2e-4，BS=4*2 Epoch=2

额外数据结合蒸馏可以增强 LB 和 PB，但在融合后没有好处

模型	数据	LB	PB	用于最终	推理时间
Qwen25-14B-AWQ	total data	0.948	0.946	submission 1（最佳）	20min
Qwen3-14B-AWQ	fold1	0.948	0.943	submission 1（最佳）	20min
Qwen25-32B-AWQ	fold1	0.948	0.944	submission 1（最佳）	40min
Qwen25-32B-AWQ	total data	0.949	0.945	submission 1（最佳）	40min
QWQ-32B-AWQ	ext data 80k total data training distal	0.949	0.948	submission 2	40min
Qwen25-32B-AWQ	ext data 80k fold1 distal	0.949	0.947	submission 2	40min
Qwen3-14B-AWQ	ext data 80k fold1 distal	0.950	0.945	submission 2	20min

4. 集成 (Ensemble)

四个 37 分类模型
四个 65 分类模型
等权重平均融合

作者主页：HZM (leehann) Kaggle Grandmaster

作者主页：Baiph (pingfan) Kaggle Grandmaster

MAP2025_Private&Public 2nd

MAP2025 私有 & 公共榜单第 2 名解决方案

解决方案概述

HZM 部分

步骤 1：生成外部数据（我和 Baiph 都使用了这些数据）

步骤 2：训练 LLM 标记外部数据

步骤 3：5 折训练 LLM 生成软标签

步骤 4：qwen3-14b 的多损失设计

Baiph 部分

1. 误解扩展

2. 提示词 (Prompt)

3. 训练策略

4. 集成 (Ensemble)

同比赛其他方案