返回列表

MAP2025_Private&Public 2nd

673. MAP - Charting Student Math Misunderstandings | map-charting-student-math-misunderstandings

开始: 2025-07-10 结束: 2025-10-15 个性化学习 数据算法赛
MAP2025 私有 & 公共榜单第 2 名解决方案
作者: HZM (leehann) & Baiph (pingfan)
发布时间: 2025-10-17
竞赛排名: 第 2 名 (Grandmaster)

MAP2025 私有 & 公共榜单第 2 名解决方案

非常感谢主办方,我很高兴能参加这次比赛。我的队友 Baiph 非常努力,我们对最终结果的贡献相同,我们俩采用了不同的训练 pipeline,这样我们才能在公共和私有榜单上都获得第 2 名。

P.S. 我正在寻找 NLP 或 LLM 领域的远程工作,首选 UTC +8 时区,如果可能的话请给我发邮件。

解决方案概述

本次竞赛的任务是根据学生的解释预测误解(misunderstandings)。这是一个非常简单的分类问题,但本次竞赛的指标是 MAP@3,这通常用于排序问题,所以我认为标签有时是模糊的,学生的解释可以被标记为多个目标。

基于上述分析,我们需要做的是去噪,或者根据 OOF(Out Of Fold)给学生的解释一些软标签,以防标签泄露。

关于数据,有超过 30 个标签的数据有限,所以我使用了许多商业 LLM 来生成外部数据。

关于模型选择,我们使用 LLM 模型作为骨干网络,我们尝试了 qwen3, qwen2.5, mistral, phi-4。

关于训练策略,我们都将分类问题转换为多选生成问题,只有这样,我们才能使用 vllm 及时完成推理。

HZM 部分

我来介绍我最好的私有单模型 qwen3-14b,它更容易理解。
训练 pipeline 可以分为 4 个步骤。

步骤 1:生成外部数据(我和 Baiph 都使用了这些数据)

为每个问题生成一个带有稀有候选项的学生输出。
结合标签进行对比学习。

例如:
如果我想要生成学生解释,其为 True_Misconception:Additive(在步骤 4 训练中用作硬标签),
我将使用 
False_Misconception:Additive_same_math_problem 和 False_Misconception:Additive_different_math_problem
+
True_Misconception:Additive_same_math_problem 和 True_Misconception:Additive_different_math_problem
放入对比示例,让 ChatGPT 或 Claude 生成更好的数据

我使用了 gpt-4, gpt-5, claude-sonnet, gemini, seed, doubao,生成了 80K 数据。

步骤 2:训练 LLM 标记外部数据

在所有 train.csv 上训练 4 个 LLM 模型,然后标记外部数据(在步骤 4 训练中用作软标签),使用的模型为 phi_4_reasoning_14b, qwen3_32b, mistral_12b, qwen2_5_72b

以下是训练提示词:
"""Analyze the student's answer and explanation.            
Determine if the student's answer is correct (True) or incorrect (False)
Evaluate if the explanation shows correct reasoning, contains a misconception, or is neither
If a misconception is present, identify the specific type
Select exactly ONE label from the 65 options below \n"""
"{0}"
"Your Answer: [Select one letter option from A through BM]\n"
"Input is :\n"
"Question: {1}\n"
"Option: {2}\n"
"Correct Answer: {3}\n"
"Student Answer: {4}\n"
"Student Explanation: {5}\n"

步骤 3:5 折训练 LLM 生成软标签

5 折训练 4 个 LLM 模型,平均 oof.csv(在步骤 4 训练中用作软标签)

步骤 4:qwen3-14b 的多损失设计

我修改了 SFTTrainer 并结合了生成损失(硬标签)和分类损失(软标签)
target_token_id 意味着每个选项将转换为 vocab 中的 token id。

class SFTChoiceTrainer(SFTTrainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None, return_choice_logit=False):

        target_token_id = self.model.config.target_token_id
        target_token_id = torch.tensor(target_token_id, device=model.device)

        labels = inputs['labels']
        mask = torch.isin(labels, target_token_id)
        labels[~mask] = -100
        inputs['labels'] = labels
        _, outputs = super().compute_loss(model, inputs, return_outputs=True, num_items_in_batch=num_items_in_batch)
        logits = outputs.logits
        loss = outputs.loss
        #print(outputs)
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()

        logits_target = []
        for i in range(len(shift_labels)):
            lbl = shift_labels[i].cpu().numpy()
            target_idx = np.where(lbl != -100)[0][-1]
            logits_target.append(shift_logits[i][target_idx][target_token_id])
        #
        # # (batch_size, 26)
        logits_target = torch.stack(logits_target, dim=0)
        # # (batch_size)
        labels_target = inputs['soft_label'].to(outputs.logits.device)
        soft_loss = F.cross_entropy(logits_target, labels_target)

        #weight = self._soft_weight()
        loss = loss +  soft_loss
        return (loss, outputs) if return_outputs else loss

经过上述步骤,fold-0 的 CV 可以达到 0.955,公共榜单是 0.950,私有榜单是 0.946

Baiph 部分

1. 误解扩展

将 65 个类别转换为 37 个类别,并使用 DeepSeek 扩展和解释简短的误解

mis2reason = {
        "SwapDividend": "Incorrectly swapping the positions of dividend and divisor in division operations.",
        "Tacking": "Arbitrarily adding zeros or decimal points to the end of numbers, believing the value remains unchanged or changes incorrectly.",
        "Additive": "Mistakenly using addition to solve problems that require other operations (multiplication, subtraction, etc.)",
        "Wrong_term": "Incorrectly identifying or handling terms in algebraic expressions.",
        "Wrong_Fraction": "Completely misunderstanding fraction concepts or representation methods.",
        "Incomplete": "Providing incomplete solutions missing crucial steps or explanations.",
        "Unknowable": "Mistakenly believing a problem is unsolvable or lacks information when it is actually solvable.",
        "Not_variable": "Treating variables as specific numerical values, or vice versa.",
        "Firstterm": "Overemphasizing the first term in a sequence while ignoring the importance of other terms.",
        "Irrelevant": "Using information or criteria unrelated to the problem for reasoning.",
        "Inverse_operation": "Incorrectly applying inverse operations or confusing relationships between operations.",
        "Multiplying_by_4": "Specific error: Always multiplying by 4 without considering the specific context.",
        "Base_rate": "Ignoring base probabilities or benchmark values, focusing only on specific cases.",
        "Definition": "Misunderstanding mathematical concept definitions or terminology meanings.",
        "WNB": """Mistakenly believing "the whole is not the sum of its parts" or similar part-whole relationships""",
        "Whole_numbers_larger": "Believing decimals with larger whole number parts are always larger, ignoring decimal parts",
        "Incorrect_equivalent_fraction_addition": "Incorrectly performing fraction addition operations",
        "Inversion": "Mistakenly reversing the order of numbers, fractions, or operations.",
        "Mult": "Mistakenly using multiplication to solve problems that require other operations.",
        "Adding_terms": "Incorrectly adding terms directly in algebraic expressions.",
        "FlipChange": "Incorrectly handling numerator-denominator conversions in fraction operations.",
        "Division": "Mistakenly using division to solve problems that require other operations.",
        "Duplication": "Incorrectly repeating numbers or operations.",
        "Interior": "Incorrectly handling interior angles or internal elements in geometric figures.",
        "Certainty": "Providing definite answers for uncertain problems, or vice versa.",
        "Shorter_is_bigger": "Believing numbers with fewer digits are larger.",
        "Wrong_fraction": "Misunderstanding fraction concepts, including numerator-denominator relationships.",
        "Adding_across": "Incorrectly adding across place values (e.g., adding tens to ones directly).",
        "Wrong_Operation": "Choosing completely wrong mathematical operations.",
        "Denominator-only_change": "Changing only the denominator while ignoring corresponding changes in the numerator.",
        "Scale": "Misunderstanding scale factors or proportional relationships.",
        "Longer_is_bigger": "Believing numbers with more digits are larger.",
        "Positive": "Mistakenly believing all mathematical results should be positive numbers.",
        "Ignores_zeroes": "Ignoring the place value or importance of zeros in numbers.",
        "Subtraction": "Mistakenly using subtraction to solve problems that require other operations.",
        "Correct": "Student Explanation is Correct",
        "Neither": "This explanation is confusing and it doesn't fall into any of the above categories"
    }

2. 提示词 (Prompt)

<|im_start|>user
You are now tasked with analyzing math problems and classifying student responses. Given a math problem, the student's chosen answer, whether it's correct, and the student's explanation, you need to determine the appropriate Misconception classification.
(1) Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither in Category; e.g., True_Correct)
(2) Identifies the specific misconception present, if any.

Below are the available Misconception classifications you can choose from.
Always provide your response using only the specified format.

A: Incorrectly swapping the positions of dividend and divisor in division operations.
B: Arbitrarily adding zeros or decimal points to the end of numbers, believing the value remains unchanged or changes incorrectly.
C: Mistakenly using addition to solve problems that require other operations (multiplication, subtraction, etc.)
D: Incorrectly identifying or handling terms in algebraic expressions.
E: Completely misunderstanding fraction concepts or representation methods.
F: Providing incomplete solutions missing crucial steps or explanations.
G: Mistakenly believing a problem is unsolvable or lacks information when it is actually solvable.
H: Treating variables as specific numerical values, or vice versa.
I: Overemphasizing the first term in a sequence while ignoring the importance of other terms.
J: Using information or criteria unrelated to the problem for reasoning.
K: Incorrectly applying inverse operations or confusing relationships between operations.
L: Specific error: Always multiplying by 4 without considering the specific context.
M: Ignoring base probabilities or benchmark values, focusing only on specific cases.
N: Misunderstanding mathematical concept definitions or terminology meanings.
O: Mistakenly believing "the whole is not the sum of its parts" or similar part-whole relationships
P: Believing decimals with larger whole number parts are always larger, ignoring decimal parts
Q: Incorrectly performing fraction addition operations
R: Mistakenly reversing the order of numbers, fractions, or operations.
S: Mistakenly using multiplication to solve problems that require other operations.
T: Incorrectly adding terms directly in algebraic expressions.
U: Incorrectly handling numerator-denominator conversions in fraction operations.
V: Mistakenly using division to solve problems that require other operations.
W: Incorrectly repeating numbers or operations.
X: Incorrectly handling interior angles or internal elements in geometric figures.
Y: Providing definite answers for uncertain problems, or vice versa.
Z: Believing numbers with fewer digits are larger.
a: Misunderstanding fraction concepts, including numerator-denominator relationships.
b: Incorrectly adding across place values (e.g., adding tens to ones directly).
c: Choosing completely wrong mathematical operations.
d: Changing only the denominator while ignoring corresponding changes in the numerator.
e: Misunderstanding scale factors or proportional relationships.
f: Believing numbers with more digits are larger.
g: Mistakenly believing all mathematical results should be positive numbers.
h: Ignoring the place value or importance of zeros in numbers.
i: Mistakenly using subtraction to solve problems that require other operations.
j: Student Explanation is Correct
k: This explanation is confusing and it doesn't fall into any of the above categories

Please analyze the given input and provide your classification.

### Question:
What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]

### Choices:
(A) \\( \\frac{1}{3} \\) (B) \\( \\frac{3}{9} \\) (C) \\( \\frac{3}{6} \\) (D) \\( \\frac{3}{8} \\)

### Selected Answer:
A. \\( \\frac{1}{3} \\)

### The selected answer is correct.

### Student Explanation:
I think that 1/3 is the answer, as it's the simplest form of 3/9.<|im_end|>
<|im_start|>assistant

3. 训练策略

LR=2e-4,BS=4*2 Epoch=2

额外数据结合蒸馏可以增强 LB 和 PB,但在融合后没有好处

模型 数据 LB PB 用于最终 推理时间
Qwen25-14B-AWQ total data 0.948 0.946 submission 1(最佳) 20min
Qwen3-14B-AWQ fold1 0.948 0.943 submission 1(最佳) 20min
Qwen25-32B-AWQ fold1 0.948 0.944 submission 1(最佳) 40min
Qwen25-32B-AWQ total data 0.949 0.945 submission 1(最佳) 40min
QWQ-32B-AWQ ext data 80k total data training distal 0.949 0.948 submission 2 40min
Qwen25-32B-AWQ ext data 80k fold1 distal 0.949 0.947 submission 2 40min
Qwen3-14B-AWQ ext data 80k fold1 distal 0.950 0.945 submission 2 20min

4. 集成 (Ensemble)

  • 四个 37 分类模型
  • 四个 65 分类模型
  • 等权重平均融合
同比赛其他方案