公开第 8 名 / 私有第 24 名解决方案（错失金牌）

作者: I2nfinit3y 及其团队

发布时间: 2025-10-16

竞赛: Map - Charting Student Math Misunderstandings

感谢主办方以及本次竞赛中的每一个人。也要感谢每一位优秀的队友 @l1ghtsource、@lechengyan、@chronoscop、@danilamalinka！很遗憾我们错失了一个金牌提交 :(。但这是我第一次参加关于 LLM 的竞赛，我真的学到了很多。

我们的最终提交包含我的一个因果语言模型（排行榜分数 0.948）和来自 @l1ghtsource 的四个序列分类模型（一个排行榜分数 0.949，三个 0.948），并进行了一些后处理。我将首先解释我的部分，然后我的队友 will 补充关于他部分的更多细节。

I2nfinit3y 的部分

数据

所有训练数据，共 65 个类别。

模型和提示词 (Prompt)

我最好的因果语言模型是 qwen3-reranker-8b。我在提示词中提供了上下文（问题、答案、学生的解释、是否正确、常见误解和错误率）以及 65 个选项（每个选项对应一个类别），格式为 Markdown，然后让模型输出最可能的选项。

system_content = (
    "你是一名专注于教育评估的专家 AI 助手。"
    "你的任务是分析学生的推理，并从选项列表中选择单个最准确的"
    "分类。选项可能包括具体的数学"
    "误解或表明解释正确或无关的更广泛类别。"
)

def choice_collate_fn(batch):
    prompts = []
    labels = []
    for example in batch:

        user_content = "根据下面上下文中提供的学生推理，选择单个最佳描述选项。\n\n"
        
        user_content += "### 上下文和学生数据\n"
        user_content += f"- **问题**: {example['QuestionText']}\n"
        user_content += f"- **学生的答案**: {example['MC_Answer']}\n"
        user_content += f"- **学生的答案是否正确？**: {'是' if example.get('is_correct') == 1 else '否'}\n"

        if 'question_difficulty' in example:
            user_content += f"- **问题难度**: {example['question_difficulty']}\n"
        if 'common_misconception' in example:
            user_content += f"- **此问题的常见误解**: {example['common_misconception']}\n"
        
        user_content += f"\n### 待分析的学生解释\n"
        user_content += f"```{example['StudentExplanation']}```\n\n"
        
        user_content += "### 选项\n"
        for i, miscon in enumerate(target_classes):
            user_content += f"{choice_tokens[i]}. {miscon}\n"
        
        user_content += "\n### 你的选择:\n最准确的选项是"

        messages = [
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_content},
        ]

        prompt = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        
        prompts.append(prompt)
        labels.append(example['label'])

    tokenized = tokenizer(prompts, padding="longest", truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
    tokenized['labels'] = torch.tensor(labels, dtype=torch.long)
    return tokenized

在训练中，我只从最后一个 token 获取这 65 个选项的 token id，然后计算交叉熵损失。

choice_tokens = list(string.ascii_lowercase + string.ascii_uppercase + string.punctuation)[:n_classes]
    choice_token_ids = []
    
    for token in choice_tokens:
        encoded = tokenizer.encode(f"{token}", add_special_tokens=False)
        choice_token_ids.append(encoded[-1])

......

outputs = model(**model_inputs)
logits = outputs.logits
last_token_logits = logits[:, -1, :]
choice_logits = last_token_logits[:, choice_token_ids]
ce_loss = loss_fct(choice_logits, batch['labels'])

推理

我在推理中使用 vllm 和 logits processor，这样我只需要获取 65 个选项的 logprob。模型的最佳排行榜分数是 0.948。

训练参数

lora_rank = 16, lora_alpha=32, dropout=0.1, lr=5e-4, batch_size=128, lr_scheduler='linear'

无效尝试

全量微调模型：非常容易过拟合。
其他 7b~9b 模型：看来 qwen3-reranker-8b 是我实验中最好的模型。
更大的模型：我尝试过其他 14b 模型，但所有模型都会损害结果。很奇怪。
蒸馏：我尝试使用自蒸馏并使用更好的模型作为教师模型。但没有提升。
排序损失：Pairwise loss 和 Listwise loss。
合成数据：添加任何合成数据都会损害我的模型性能。
思维链 (Chain of thought)

lightsource 的部分

我的代码：https://github.com/l1ghtsource/map-misunderstandings-2025

有趣的是，我在一个多月前就有一个 0.948 的私有排行榜提交。那只是几个弱的 0.945 模型和来自公开 notebook 的模型。显然，进一步加强我的模型只会让它们在私有排行榜上变得更糟，这很奇怪。

预处理

首先，我执行了去重：

if USE_DEDUPLICATION:
    train = train.drop_duplicates(subset=['QuestionId', 'MC_Answer', 'StudentExplanation', 'Category', 'Misconception'])
    print(f'去重后 {train.shape=}')

然后我尝试了不同的目标：

if USE_FRACTION_COMBINE:
    print('使用分数合并')
    train.loc[train['Misconception'] == 'Wrong_fraction', 'Misconception'] = 'Wrong_Fraction'

if USE_CATEGORY_REDUCING:
    print('使用类别简化')
    train['Category'] = train['Category'].apply(lambda cat: cat.split('_')[-1])

if TARGET_TYPE == 'default':
    train['target'] = train['Category'] + ':' + train['Misconception']
elif TARGET_TYPE == 'category':
    train['target'] = train['Category']
elif TARGET_TYPE == 'misconception':
    train['target'] = train['Misconception']
else:
    print('目标将未定义！查看 TARGET_TYPE')
print(f'使用 {TARGET_TYPE=}')

我尝试移除 True/False 前缀并将相同的类别合并为一个，但这并没有在排行榜上产生好的结果。我还尝试训练一个具有两个头的模型：分别预测类别和误解。这在理论上是合理的，因为模型将能够预测训练数据集中不存在的组合。然而，我发现最好只使用原始目标（65 个类别）。

idx = train.apply(lambda row: row['category_for_fe'].split('_')[0], axis=1) == 'True'
correct = train.loc[idx].copy()
correct['c'] = correct.groupby(['QuestionId', 'MC_Answer'])['MC_Answer'].transform('count')
correct = correct.sort_values('c', ascending=False)
correct = correct.drop_duplicates(['QuestionId'])
correct = correct[['QuestionId', 'MC_Answer']]
correct['is_correct'] = 1

train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left')
train['is_correct'] = train['is_correct'].fillna(0)

idx_explanation = train.apply(lambda row: row['category_for_fe'].split('_')[0], axis=1) == 'True'
correct_info_df = train.loc[idx_explanation].copy()
correct_info_df['c'] = correct_info_df.groupby(['QuestionId', 'MC_Answer'])['MC_Answer'].transform('count')
correct_info_df = correct_info_df.sort_values('c', ascending=False)
canonical_correct_info = correct_info_df.drop_duplicates(subset=['QuestionId'])
canonical_correct_info = canonical_correct_info[['QuestionId', 'MC_Answer', 'StudentExplanation']].rename(
    columns={'MC_Answer': 'Correct_Answer', 'StudentExplanation': 'Correct_Explanation'}
)
train = train.merge(canonical_correct_info, on='QuestionId', how='left')
train['Correct_Answer'] = train['Correct_Answer'].fillna('N/A')
train['Correct_Explanation'] = train['Correct_Explanation'].fillna('N/A')

possible_answers = train.groupby('QuestionId')['MC_Answer'].agg(set).to_dict()

qa2labels = train.groupby(['QuestionId', 'MC_Answer'])['label'].unique().to_dict()

最后，我的提示词看起来像这样：

converter = LatexNodes2Text()

def delatex(text):
    if DO_DELATEX:
        return converter.latex_to_text(text)
    return text

def format_input(row):
    x = '是' if row['is_correct'] else '否'
    
    variants_text = ''
    if ADD_VARIANTS_TO_PROMPT:
        answers = possible_answers.get(row['QuestionId'], set())
        sorted_answers = sorted(answers, key=lambda v: (str(v)))
        labels = ['A', 'B', 'C', 'D']
        variants_lines = [f'{label}) {ans}' for label, ans in zip(labels, sorted_answers)]
        variants_text = '选项:\n' + '\n'.join(variants_lines) + '\n'

    possible_targets = ''
    if ADD_POSSIBLE_TARGETS:
        allowed = list(set(qa2labels.get((row['QuestionId'], row['MC_Answer']), [])))
        allowed = [str(x) for x in allowed]
        possible_targets = '可能的目标：' + ', '.join(allowed)
    
    if ADD_CORRECT_ANS_TO_PROMPT:
        return delatex(
            f'问题：{row["QuestionText"]}\n'
            f'{variants_text}'
            f'答案：{row["MC_Answer"]}\n'
            f'正确？{x}\n'
            f'学生解释：{row["StudentExplanation"]}\n'
            f'正确答案：{row["Correct_Answer"]}\n'
            f'{possible_targets}'
        )
    else:
        return delatex(
            f'问题：{row["QuestionText"]}\n'
            f'{variants_text}'
            f'答案：{row["MC_Answer"]}\n'
            f'正确？{x}\n'
            f'学生解释：{row["StudentExplanation"]}\n'
            f'{possible_targets}'
        )

我使用了以下内容：

答案选项：对于每个问题，总是有一组固定的四个答案选项
可能的目标：我为每个这样的对映射了 (question_id, answer) -> [cat:misc list]
Delatex（仅对 phi4 模型有帮助）
问题的正确答案

建模

在整个竞赛过程中，我使用了 90/10 的随机分割。在最后一周，我切换到在完整数据集上训练我最好的模型。结果，我有一个 0.949 的模型和三个 0.948 的模型。

这是我最终选择的模型：

最终模型示意图

所有（几乎）实验都可以在链接中查看：https://docs.google.com/spreadsheets/d/1yHdaMrEjK2xZncWzvF7Z0I0LkWl35BHMnv0dS3bd7So

我使用 Unsloth 通过 8-bit bnb QLoRA 训练了所有模型，这将实验速度提高了 2 倍，并允许我在单个 A100 上快速测试想法。

未成功的实验

1) 为每个 QuestionID 训练适配器：由于问题列表是固定的，我训练了 15 个 LoRA 适配器，并在推理期间为每组问题使用单独的适配器。这导致总分为 0.938 lb。

2) 作为第二阶段的 Reranker：我挖掘了困难样本进行训练（获取每个样本的前 7 个预测并排除正确的那个，将正确的视为正样本，其余 6 个视为负样本）。然后，使用 Open-Retrievals，我训练了 Qwen3-8B reranker。更多细节在此。

3) 使用 EEDI 获胜者模型作为初始模型

4) Target='Q_ID:CAT:MISC'

5) 合成数据生成

6) 数据增强

合成数据生成

对于合成数据生成，我使用了 GPT-4.1-mini。主要想法是为 (QuestionId, MC_Answer, Category, Misconception) 的每个组合创建额外的学生解释示例。

我创建了 3 个随机交替的提示词版本以增加可变性。每个提示词包括：

任务上下文：完整问题文本 + 所有答案选项
目标描述：对于每个 cat:misc 组合，通过 Claude Sonnet 4 编写了详细描述 - 学生解释中应该包含的确切内容
少样本示例：来自训练集的 5-10 个具有相同 (question, answer, target) 组合的真实示例
风格说明：两组文本多样性说明

生成代码：链接

不幸的是，我无法从合成数据中受益；它们总是恶化我的排行榜分数和交叉验证结果。

数据增强

对于文本增强，我实现了一个类，其中包含 8 种不同的增强技术，按随机顺序应用：

数值转换：

分数转单词："1/2" → "one half", "3/4" → "three quarters" (p=0.7)
数字转单词："24" → "twenty-four" (p=0.7)
运算符周围的空格："2+3" → "2 + 3" 或保持原样 (p=0.6)
小数格式："0.5" → ".5" 或 "0.50" (p=0.6)

文本变体：

同义词替换："because" → "since/as/due to", "divide" → "divided by", "times" → "multiplied by/x" (p=1.0)
缩写："it is" → "it's", "cannot" → "can't" (p=0.8)
短语标记注入：在开头添加 "I think", "maybe", "probably" (p=0.2)

现实错误：

键盘打字错误：将随机字符替换为键盘邻居 (a→q/w/s/z) (p=0.4)
字母重复："good" → "goodd" (p=0.2)
顺序洗牌：随机重新排序由 "then", "next", "after that" 分隔的步骤 (p=0.5)

这个想法是让模型对学生写作变体和打字错误更加 robust。不幸的是，增强并没有提高我的分数。它们要么没有效果，要么略微恶化了交叉验证结果。

融合与后处理

最终集成看起来像这样（1 个 I2nfinit3y 模型和 4 个我的模型）：

all_probs = [probs_infinity, probs1, probs2, probs3, probs4]
model_weights = [0.28, 0.29, 0.11, 0.25, 0.07]

n_samples, n_classes = probs_infinity.shape
probs = np.zeros_like(probs_infinity)

base_w = 0.6
agr_w = 0.3
conf_w = 0.1

for i in range(n_samples):
    row_probs = [all_probs[m][i] * model_weights[m] for m in range(len(all_probs))]
    base_score = np.sum(row_probs, axis=0)
    top_classes = [np.argmax(all_probs[m][i]) for m in range(len(all_probs))]
    agreement_bonus = np.zeros(n_classes)
    for cls in top_classes:
        agreement_bonus[cls] += 1
    agreement_bonus /= len(all_probs)
    confidence_bonus = np.max(np.stack(row_probs, axis=0), axis=0)
    probs[i] = base_score * base_w + agreement_bonus * agr_w + confidence_bonus * conf_w

在平均概率之前，我将 softmax 替换为 entmax(alpha=1.05)，这使得预测更加“清晰”。这在验证和排行榜上都效果很好，带来了一点提升。

然后我选择了稀有类别并将它们的概率乘以一个系数：

DO_RARE_MULTIPLY = True
COEF = 3
TOPN = 10

RARE_CLASSES = {
    'True_Misconception:Wrong_term': 8,
    'True_Misconception:WNB': 8,
    'True_Misconception:Mult': 8,
    'True_Misconception:Incomplete': 8,
    'True_Misconception:SwapDividend': 8,
    'False_Misconception:Incorrect_equivalent_fraction_addition': 7,
    'True_Misconception:Duplication': 6,
    'True_Misconception:Wrong_fraction': 6,
    'False_Misconception:Shorter_is_bigger': 6,
    'False_Misconception:Wrong_Operation': 6,
    'True_Misconception:Division': 5,
    'True_Misconception:Inversion': 5,
    'True_Misconception:FlipChange': 4,
    'True_Misconception:Denominator-only_change': 4,
    'True_Misconception:Definition': 3,
    'True_Misconception:Multiplying_by_4': 3,
    'True_Misconception:Subtraction': 2,
    'True_Misconception:Positive': 2,
    'True_Misconception:Incorrect_equivalent_fraction_addition': 2,
    'True_Misconception:Adding_across': 1,
    'True_Misconception:Base_rate': 1,
    'True_Misconception:Longer_is_bigger': 1,
    'True_Misconception:Not_variable': 1,
    'True_Misconception:Whole_numbers_larger': 1
}

rare_idx = {le.transform([cls])[0] for cls in RARE_CLASSES.keys() if cls in le.classes_}

if DO_RARE_MULTIPLY:
    adjusted_probs = probs.copy()
    for i in range(adjusted_probs.shape[0]):
        topN = np.argsort(-adjusted_probs[i])[:TOPN]
        for idx in topN:
            if idx in rare_idx:
                adjusted_probs[i, idx] *= COEF
        adjusted_probs[i] /= adjusted_probs[i].sum()
    probs = adjusted_probs

不幸的是，我没有花太多时间在这上面，但动态系数和 RARE_CLASSES 扩展可能带来了一些额外的收益。

然后是更简单的后处理：

if DO_QUESTION_ID_POSTPROCESSING:
    topk = np.argsort(-probs, axis=1)
    final_top3 = []
    for i, (qid, ans) in enumerate(zip(test['QuestionId'].values, test['MC_Answer'].values)):
        allowed = set(qa2labels.get((qid, ans), []))
        chosen = []
        for lbl in topk[i]:
            if lbl in allowed:
                chosen.append(lbl)
            if len(chosen) == 3:
                break
        while len(chosen) < 3 and chosen:
            chosen.append(chosen[-1])
        if not chosen:
            chosen = list(topk[i][:3])
        final_top3.append(chosen)
    top3 = np.array(final_top3)
else:
    top3 = np.argsort(-probs, axis=1)[:, :3]

flat_top3 = top3.flatten()
decoded_labels = le.inverse_transform(flat_top3)
top3_labels = decoded_labels.reshape(top3.shape)

if DO_CATEGORY_TRUE_FALSE_POSTPROC:
    adjusted = []
    for labels, corr in zip(top3_labels, test.is_correct.values):
        new_labels = []
        for lab in labels:
            parts = lab.split('_', 1)
            _, rest = parts
            prefix = 'True' if corr == 1 else 'False'
            new_labels.append(f"{prefix}_{rest}")
        adjusted.append(new_labels)
    top3_labels = np.array(adjusted)

preds = [' '.join(row) for row in top3_labels]

sub = pd.DataFrame({
    'row_id': test.row_id.values,
    'Category:Misconception': preds
})

第一部分使用映射 (question_id, answer) -> [possible targets] 来排除不可能的组合。第二部分 simply 根据答案的正确性更改 True/False 前缀。

Public 8th / Private 24th Solution (Miss a gold solution)