8th place solution

第8名方案

作者: heng (Grandmaster) 及队友 @goldenlock, @syzong, @leolu1998
比赛: NBME - Score Clinical Patient Notes
排名: 第8名 (Private LB: 0.893)

首先，我要感谢主办方举办这次比赛，也要感谢我们出色的队友 @goldenlock、@syzong 和 @leolu1998 在过去两个月的持续努力。（顺便说一句，我们在提交次数上是冠军，哈哈）

我们最终获得了第8名，Private LB 和 Public LB 的分数均为 0.893，表现相对稳定。但是我们在 Private LB 上的排名上升了很多，从 Public LB 的第16名上升到了 Private LB 的第8名。

训练与推理

我们主要参考了以下 Notebook：

非常感谢 @yasufuminakama。

任务内预训练 (ITPT)

我们利用 patient_notes.csv 中的 pn_history 数据，基于 deberta-v3-large 和 deberta-xlarge 进行任务内预训练。由于预训练数据是以每行作为一个句子，因此我们对数据进行了处理以去除换行符。代码如下：

def clean_spaces(txt):
    txt = re.sub('\n', ' ', txt)
    txt = re.sub('\t', ' ', txt)
    txt = re.sub('\r', ' ', txt)
    return txt

参数设置：

mlm_probability=0.15
num_train_epochs=30
per_device_train_batch_size=4
per_device_eval_batch_size=8
learning_rate=1.5e-5 #（xlarge)
learning_rate=3e-5 #（v3-large）
gradient_accumulation_steps=8

关于 mlm_probability 的对比测试，我们尝试了 0.1、0.2 和 0.5，但效果都不如默认参数（0.15）。ITPT 带来的提升很明显，单模型（5折）分数从 0.883 提升到了 0.886。

对抗训练 (FGM)

在基于 deberta-v3-large 进行微调时加入 FGM，CV 提升了约 0.0005，但在 deberta-xlarge 上没有效果。

后处理

在分析数据的过程中，我们发现了很多单词首字母未被识别的情况，如下图所示：

示例图片

def get_results_pp2(char_probs, th=0.5, texts=None):
    results = []
    for idx, char_prob in enumerate(char_probs):
        text = texts[idx]                            # 对照文本
        result = np.where(char_prob >= th)[0] + 1
        result = [list(g) for _, g in itertools.groupby(result, key=lambda n, c=itertools.count(): n - next(c))]
        temp = []
        for r in result:
            start = min(r)
            end = max(r)
            if start <= 1:           # 修复丢失文本第 0 个字符的情况
                start = 0
            elif start == end:
                start -= 1
            elif re.match(r'^[a-zA-Z0-9]$', text[start - 1]) and (text[start - 2] in ['\t', '\n', '\r', ',', '.', ':', ';', '-', '+', '"', '(', '/', '&', '*']):   # 修复没有空格而丢失单词第一个字符的情况
                start -= 1
            else:
                pass
            temp.append(f"{start} {end}")    
        temp = ";".join(temp)    
        results.append(temp)
    return results

这种后处理在单模型上可以提升 0.002，在模型融合后大约提升 0.001。

伪标签

我们首先融合了带有后处理的 5折 deberta xlarge 和 v3-large 模型，在 Public LB 上获得了 0.891 分，然后我们使用该模型对 patient_notes.csv 中的未标记数据进行伪标签（采样 2000 个未标记 pn_num）。

这种不分离折数的伪标签做法可能会导致离线分数过拟合（0.90+），参考讨论：

相关讨论 1
<a href="https://www.kaggle.com

第8名方案

训练与推理

任务内预训练 (ITPT)

对抗训练 (FGM)

后处理

伪标签

同比赛其他方案