673. MAP - Charting Student Math Misunderstandings | map-charting-student-math-misunderstandings
非常感谢主办方,我很高兴能参加这次比赛。我的队友 Baiph 非常努力,我们对最终结果的贡献相同,我们俩采用了不同的训练 pipeline,这样我们才能在公共和私有榜单上都获得第 2 名。
P.S. 我正在寻找 NLP 或 LLM 领域的远程工作,首选 UTC +8 时区,如果可能的话请给我发邮件。
本次竞赛的任务是根据学生的解释预测误解(misunderstandings)。这是一个非常简单的分类问题,但本次竞赛的指标是 MAP@3,这通常用于排序问题,所以我认为标签有时是模糊的,学生的解释可以被标记为多个目标。
基于上述分析,我们需要做的是去噪,或者根据 OOF(Out Of Fold)给学生的解释一些软标签,以防标签泄露。
关于数据,有超过 30 个标签的数据有限,所以我使用了许多商业 LLM 来生成外部数据。
关于模型选择,我们使用 LLM 模型作为骨干网络,我们尝试了 qwen3, qwen2.5, mistral, phi-4。
关于训练策略,我们都将分类问题转换为多选生成问题,只有这样,我们才能使用 vllm 及时完成推理。
我来介绍我最好的私有单模型 qwen3-14b,它更容易理解。
训练 pipeline 可以分为 4 个步骤。
为每个问题生成一个带有稀有候选项的学生输出。
结合标签进行对比学习。
例如:
如果我想要生成学生解释,其为 True_Misconception:Additive(在步骤 4 训练中用作硬标签),
我将使用
False_Misconception:Additive_same_math_problem 和 False_Misconception:Additive_different_math_problem
+
True_Misconception:Additive_same_math_problem 和 True_Misconception:Additive_different_math_problem
放入对比示例,让 ChatGPT 或 Claude 生成更好的数据
我使用了 gpt-4, gpt-5, claude-sonnet, gemini, seed, doubao,生成了 80K 数据。
在所有 train.csv 上训练 4 个 LLM 模型,然后标记外部数据(在步骤 4 训练中用作软标签),使用的模型为 phi_4_reasoning_14b, qwen3_32b, mistral_12b, qwen2_5_72b
以下是训练提示词:
"""Analyze the student's answer and explanation.
Determine if the student's answer is correct (True) or incorrect (False)
Evaluate if the explanation shows correct reasoning, contains a misconception, or is neither
If a misconception is present, identify the specific type
Select exactly ONE label from the 65 options below \n"""
"{0}"
"Your Answer: [Select one letter option from A through BM]\n"
"Input is :\n"
"Question: {1}\n"
"Option: {2}\n"
"Correct Answer: {3}\n"
"Student Answer: {4}\n"
"Student Explanation: {5}\n"
5 折训练 4 个 LLM 模型,平均 oof.csv(在步骤 4 训练中用作软标签)
我修改了 SFTTrainer 并结合了生成损失(硬标签)和分类损失(软标签)
target_token_id 意味着每个选项将转换为 vocab 中的 token id。
class SFTChoiceTrainer(SFTTrainer):
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None, return_choice_logit=False):
target_token_id = self.model.config.target_token_id
target_token_id = torch.tensor(target_token_id, device=model.device)
labels = inputs['labels']
mask = torch.isin(labels, target_token_id)
labels[~mask] = -100
inputs['labels'] = labels
_, outputs = super().compute_loss(model, inputs, return_outputs=True, num_items_in_batch=num_items_in_batch)
logits = outputs.logits
loss = outputs.loss
#print(outputs)
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
logits_target = []
for i in range(len(shift_labels)):
lbl = shift_labels[i].cpu().numpy()
target_idx = np.where(lbl != -100)[0][-1]
logits_target.append(shift_logits[i][target_idx][target_token_id])
#
# # (batch_size, 26)
logits_target = torch.stack(logits_target, dim=0)
# # (batch_size)
labels_target = inputs['soft_label'].to(outputs.logits.device)
soft_loss = F.cross_entropy(logits_target, labels_target)
#weight = self._soft_weight()
loss = loss + soft_loss
return (loss, outputs) if return_outputs else loss
经过上述步骤,fold-0 的 CV 可以达到 0.955,公共榜单是 0.950,私有榜单是 0.946
将 65 个类别转换为 37 个类别,并使用 DeepSeek 扩展和解释简短的误解
mis2reason = {
"SwapDividend": "Incorrectly swapping the positions of dividend and divisor in division operations.",
"Tacking": "Arbitrarily adding zeros or decimal points to the end of numbers, believing the value remains unchanged or changes incorrectly.",
"Additive": "Mistakenly using addition to solve problems that require other operations (multiplication, subtraction, etc.)",
"Wrong_term": "Incorrectly identifying or handling terms in algebraic expressions.",
"Wrong_Fraction": "Completely misunderstanding fraction concepts or representation methods.",
"Incomplete": "Providing incomplete solutions missing crucial steps or explanations.",
"Unknowable": "Mistakenly believing a problem is unsolvable or lacks information when it is actually solvable.",
"Not_variable": "Treating variables as specific numerical values, or vice versa.",
"Firstterm": "Overemphasizing the first term in a sequence while ignoring the importance of other terms.",
"Irrelevant": "Using information or criteria unrelated to the problem for reasoning.",
"Inverse_operation": "Incorrectly applying inverse operations or confusing relationships between operations.",
"Multiplying_by_4": "Specific error: Always multiplying by 4 without considering the specific context.",
"Base_rate": "Ignoring base probabilities or benchmark values, focusing only on specific cases.",
"Definition": "Misunderstanding mathematical concept definitions or terminology meanings.",
"WNB": """Mistakenly believing "the whole is not the sum of its parts" or similar part-whole relationships""",
"Whole_numbers_larger": "Believing decimals with larger whole number parts are always larger, ignoring decimal parts",
"Incorrect_equivalent_fraction_addition": "Incorrectly performing fraction addition operations",
"Inversion": "Mistakenly reversing the order of numbers, fractions, or operations.",
"Mult": "Mistakenly using multiplication to solve problems that require other operations.",
"Adding_terms": "Incorrectly adding terms directly in algebraic expressions.",
"FlipChange": "Incorrectly handling numerator-denominator conversions in fraction operations.",
"Division": "Mistakenly using division to solve problems that require other operations.",
"Duplication": "Incorrectly repeating numbers or operations.",
"Interior": "Incorrectly handling interior angles or internal elements in geometric figures.",
"Certainty": "Providing definite answers for uncertain problems, or vice versa.",
"Shorter_is_bigger": "Believing numbers with fewer digits are larger.",
"Wrong_fraction": "Misunderstanding fraction concepts, including numerator-denominator relationships.",
"Adding_across": "Incorrectly adding across place values (e.g., adding tens to ones directly).",
"Wrong_Operation": "Choosing completely wrong mathematical operations.",
"Denominator-only_change": "Changing only the denominator while ignoring corresponding changes in the numerator.",
"Scale": "Misunderstanding scale factors or proportional relationships.",
"Longer_is_bigger": "Believing numbers with more digits are larger.",
"Positive": "Mistakenly believing all mathematical results should be positive numbers.",
"Ignores_zeroes": "Ignoring the place value or importance of zeros in numbers.",
"Subtraction": "Mistakenly using subtraction to solve problems that require other operations.",
"Correct": "Student Explanation is Correct",
"Neither": "This explanation is confusing and it doesn't fall into any of the above categories"
}
<|im_start|>user
You are now tasked with analyzing math problems and classifying student responses. Given a math problem, the student's chosen answer, whether it's correct, and the student's explanation, you need to determine the appropriate Misconception classification.
(1) Assesses whether the explanation contains a misconception. (Correct, Misconception, or Neither in Category; e.g., True_Correct)
(2) Identifies the specific misconception present, if any.
Below are the available Misconception classifications you can choose from.
Always provide your response using only the specified format.
A: Incorrectly swapping the positions of dividend and divisor in division operations.
B: Arbitrarily adding zeros or decimal points to the end of numbers, believing the value remains unchanged or changes incorrectly.
C: Mistakenly using addition to solve problems that require other operations (multiplication, subtraction, etc.)
D: Incorrectly identifying or handling terms in algebraic expressions.
E: Completely misunderstanding fraction concepts or representation methods.
F: Providing incomplete solutions missing crucial steps or explanations.
G: Mistakenly believing a problem is unsolvable or lacks information when it is actually solvable.
H: Treating variables as specific numerical values, or vice versa.
I: Overemphasizing the first term in a sequence while ignoring the importance of other terms.
J: Using information or criteria unrelated to the problem for reasoning.
K: Incorrectly applying inverse operations or confusing relationships between operations.
L: Specific error: Always multiplying by 4 without considering the specific context.
M: Ignoring base probabilities or benchmark values, focusing only on specific cases.
N: Misunderstanding mathematical concept definitions or terminology meanings.
O: Mistakenly believing "the whole is not the sum of its parts" or similar part-whole relationships
P: Believing decimals with larger whole number parts are always larger, ignoring decimal parts
Q: Incorrectly performing fraction addition operations
R: Mistakenly reversing the order of numbers, fractions, or operations.
S: Mistakenly using multiplication to solve problems that require other operations.
T: Incorrectly adding terms directly in algebraic expressions.
U: Incorrectly handling numerator-denominator conversions in fraction operations.
V: Mistakenly using division to solve problems that require other operations.
W: Incorrectly repeating numbers or operations.
X: Incorrectly handling interior angles or internal elements in geometric figures.
Y: Providing definite answers for uncertain problems, or vice versa.
Z: Believing numbers with fewer digits are larger.
a: Misunderstanding fraction concepts, including numerator-denominator relationships.
b: Incorrectly adding across place values (e.g., adding tens to ones directly).
c: Choosing completely wrong mathematical operations.
d: Changing only the denominator while ignoring corresponding changes in the numerator.
e: Misunderstanding scale factors or proportional relationships.
f: Believing numbers with more digits are larger.
g: Mistakenly believing all mathematical results should be positive numbers.
h: Ignoring the place value or importance of zeros in numbers.
i: Mistakenly using subtraction to solve problems that require other operations.
j: Student Explanation is Correct
k: This explanation is confusing and it doesn't fall into any of the above categories
Please analyze the given input and provide your classification.
### Question:
What fraction of the shape is not shaded? Give your answer in its simplest form. [Image: A triangle split into 9 equal smaller triangles. 6 of them are shaded.]
### Choices:
(A) \\( \\frac{1}{3} \\) (B) \\( \\frac{3}{9} \\) (C) \\( \\frac{3}{6} \\) (D) \\( \\frac{3}{8} \\)
### Selected Answer:
A. \\( \\frac{1}{3} \\)
### The selected answer is correct.
### Student Explanation:
I think that 1/3 is the answer, as it's the simplest form of 3/9.<|im_end|>
<|im_start|>assistant
LR=2e-4,BS=4*2 Epoch=2
额外数据结合蒸馏可以增强 LB 和 PB,但在融合后没有好处
| 模型 | 数据 | LB | PB | 用于最终 | 推理时间 |
|---|---|---|---|---|---|
| Qwen25-14B-AWQ | total data | 0.948 | 0.946 | submission 1(最佳) | 20min |
| Qwen3-14B-AWQ | fold1 | 0.948 | 0.943 | submission 1(最佳) | 20min |
| Qwen25-32B-AWQ | fold1 | 0.948 | 0.944 | submission 1(最佳) | 40min |
| Qwen25-32B-AWQ | total data | 0.949 | 0.945 | submission 1(最佳) | 40min |
| QWQ-32B-AWQ | ext data 80k total data training distal | 0.949 | 0.948 | submission 2 | 40min |
| Qwen25-32B-AWQ | ext data 80k fold1 distal | 0.949 | 0.947 | submission 2 | 40min |
| Qwen3-14B-AWQ | ext data 80k fold1 distal | 0.950 | 0.945 | submission 2 | 20min |