5th place solution

第5名解决方案

作者： xia (队友: @bestpredict, @lxf615712)
比赛： NBME - Score Clinical Patient Notes
排名： 第5名

首先，我要感谢竞赛主办方举办了这场有趣的比赛。感谢我优秀的队友 @bestpredict 和 @lxf615712，感谢大家一个多月的辛勤工作。同时也要感谢 Kaggle 社区，我从各种 Notebooks 和讨论中学到了很多。感谢优秀的基线和实验结果，以及 case5 的相关讨论。

方法

MLM：进行了20个 epoch 的预训练。
分层采样与学习率：使用了分层采样，并设置了较大的学习率（large 模型 2e-5，xlarge 模型 1e-5，xxlarge 模型 5e-6）。
伪标签：使用了 2000-4000 条数据，采用 5折交叉验证。不同的模型选取不同的样本，对数量不足的 case_num 进行重采样。有些模型未使用伪标签，虽然 LB 分数更好，但在 PB（Private Board）上鲁棒性较差。
Multidrop：使用了 Multidrop 技术。
FGM：使用了 FGM 对抗训练。
Tokenizer：使用了 Deberta v2/v3 tokenizer。

模型

Deberta-v1-large
Deberta-v1-xlarge
Deberta-v2-xlarge
Deberta-v2-xxlarge
Deberta-v3-large

模型融合

基于字符概率的5个模型融合。在最后几天，我们加入了 v2 xxlarge 模型，它的 CV 分数为 884，LB 分数为 883（2折），表现并不算好，但在模型融合中对 LB 有提升，对 PB 却有负面影响。我们当时没有选择 30 多个 PB 表现更好的提交。请相信你的 CV（交叉验证）。

后处理 + 阈值

空间迁移 + OOF 集合（参考 word probs）。我们针对某些特征设置了特定的阈值（低于所有最佳 case_num 阈值的 CV 表现在 PB 上更好）。

def convert_offsets_to_word_indices(preds_offsets, texts, case_nums, feature_nums, th=0.5):
    predicts = []
    for text, preds, case_num, feature_num in zip(texts, preds_offsets, case_nums, feature_nums):
        encoded_text = tokenizer(text, add_special_tokens=True, max_length=CFG.max_len, padding="max_length", return_offsets_mapping=True)
        offset_mapping = encoded_text['offset_mapping']
        sep_index = encoded_text["input_ids"].index(tokenizer.sep_token_id)
        result = np.zeros(len(preds))
        
        results = np.zeros(sep_index)
        for idx, (offset, pred) in enumerate(zip(offset_mapping[:sep_index], preds)):
            start = offset[0]
            results[idx] = preds[start]
        sample_pred_scores = results
        
        # 针对特定 case_num 和 feature_num 设置不同的阈值
        if str(feature_num)[-1] == '3' and (str(case_num) == '0' or str(case_num) == '3'):
            result = [1 if s >= 0.54 else 0 for s in results]
        elif str(feature_num)[-1] == '3' and (str(case_num) == '1'):
            result = [1 if s >= 0.45 else 0 for s in results]
        elif str(feature_num)[-1] == '3' and str(case_num) == '6':
            result = [1 if s >= 0.52 else 0 for s in results]
        elif str(case_num) == '5' and (str(feature_num) == '503'):
            result = [1 if s >= 0.49 else 0 for s in results]
        elif str(case_num) == '5' and (str(feature_num) == '504'):
            result = [1 if s >= 0.4 else 0 for s in results]
        elif str(case_num) == '5' and (str(feature_num) == '508'):
            result = [1 if s >= 0.49 else 0 for s in results]
        elif str(case_num) == '5' and (str(feature_num) == '509'):
            result = [1 if s >= 0.4 else 0 for s in results]
        elif str(case_num) == '5' and (str(feature_num) == '510'):
            result = [1 if s >= 0.55 else 0 for s in results]
        elif str(case_num) == '5' and (str(feature_num) == '511'):
            result = [1 if s >= 0

第5名解决方案

方法

模型

模型融合

后处理 + 阈值

同比赛其他方案