33rd Solution

第33名方案

作者：y.takayama
比赛排名：第33名

首先，感谢比赛主办方和参赛者。在这次比赛中，我学到了很多关于NLP的见解和知识。
这是我第一次在Kaggle上参加NLP比赛，也是第一次接触token分类（Token Classification）任务。

我基于以下基线进行了修改。主要的不同点和重点如下：

训练

预处理：将换行符（LF）替换为分隔符（sep token）。
这次换行符似乎对于判断话语类型很重要，因为它可能是话语类型发生变化的点。
数据增强：将每个文档中除特殊标记外的10%的标记替换为掩码标记。
https://www.kaggle.com/spidermandance/masking-feedback-prize
多样本丢弃：与基线相同。
模型：BERT主干 + 15类输出，与基线相同。
优化器：AdamW。
权重衰减：0.01（LayerNorm和bias除外）。
学习率调度器：线性预热（预热率：0.05）。
轮次：6。
损失函数：交叉熵损失（标签平滑=0.1）。标签平滑很有效。
最大长度（推理时）：1536。
交叉验证：KFold（n_splits=5）。
Batch size * 梯度累积步数：固定为4。
我在单GPU V100/A100（Google Colab Pro+）上训练了所有模型，为了防止OOM，我使用了梯度累积。只有在训练deberta-large时我才用到了A100。大多数情况下Batch size为1，累积步数为4。

包含下一节描述的后处理步骤。

编号	模型	最大长度※	最大学习率	CV	Public LB	Private LB
1	allenai/longformer-large-4096	1536	1e-5	0.6814	0.683	0.694
2	funnel-transformer/large	1536	8e-6	0.6926	0.698	0.709
3	microsoft/deberta-large	1536	1e-5	0.6956	0.700	0.712
4	microsoft/deberta-v3-large	1024	1e-5	0.699	0.703	0.715
5	1+2+3	-	-	0.7059	0.705	0.717
6	1+2+3+4 简单平均	-	- 同比赛其他方案 1st solution with code(cv:0.748 lb:0.742) 2nd Place - Weighted Box Fusion and Post Process 3rd Place Solution w code and notebook 4th place solution - 🎖️ my first gold medal 🎖️ (+source code available!) 5'th place : simultaneous span segmentation and classification + WBF