1st place solution

第1名解决方案

作者：gezi (Grandmaster) | 比赛排名：第1名

感谢比赛主办方举办如此有趣的比赛。感谢所有参与并在比赛期间分享经验的人。我从讨论中学到了很多。

交叉验证策略
按 anchor 分组并按 score 分层，同时有一些词同时出现在 anchor 和 target 中，确保将它们放在同一折中。
神经网络模型细节
a. Pearson loss（皮尔逊损失）对我来说效果最好。
b. 训练5个 epochs，从第2个 epoch 开始 AWP 训练。
AWP 在我最近的所有 NLP 比赛中都帮了大忙。
c. Groupby['anchor', 'context'] ['target'] -> targets，添加到输入 (anchor[SEP]target[SEP]CPC_TEXT[SEP]targets) 中产生了最好的模型。
Groupby['anchor', 'context[0]'] ['target'] -> targets，添加到输入中对集成帮助很大，我将 context[0] 定义为 sector（部门），例如 F21 -> F。
记得从 targets 中排除当前的 target。
d. 在每个训练步骤中随机打乱 targets。（测试不够充分，我记得在 LB 上提升很大）。
e. 冻结 bert embedding 层（可能差别不大，但我在最终模型中使用了它）。
冻结 embedding 层没有损害性能，意味着我们不需要过多的微调，因为我们的目标是简单的短词相似度。
f. 对 bert (2e-5, 3e-5) 和其他部分 (1e-3) 使用不同的学习率，这在添加需要较大 lr 的 LSTM 时特别有用。
g. 添加 BI-LSTM 头部帮助很大。
Deberta-v3-large CV 858-> 861，“prompt is all you need” 给了我一个提示，我们不需要对 bert 模型进行太多的微调/更改，所以我尝试在 bert 之上添加 LSTM 并冻结 bert embedding 层。
h. 在 fc 之前的 BI-LSTM 之上使用线性注意力池化。
i. Lr（学习率）对最佳单模型 deberta-v3-large 影响很大，2e-5 比 3e-5 好得多。
Deberta-v3-large CV 861 -> 8627
j. 将 rnn 输出维度 * 2（从 bert 输出维度如 1024 变为 2048）对一些弱模型如 bert-for-patents 和 simcse-bert-for-patent 帮助很大。
所以对于弱模型，我们可能需要让模型更宽。
k. 一种可能的方法是使用 token classification 来预测 1 个实例中的所有 targets 分数。
实现起来似乎有点复杂，我不知道它是否有助于提高分数，还没尝试过。

模型	CV	backbone lr	base lr	scheduler	rnn dim * 2	权重	1 Fold LB	1 Fold PB	Full train LB	Full train PB	5 Folds LB	5 Folds PB
microsoft/deberta-v3-large	8627	2e-5	1e-3	linear	No	1	8599	8710	8604	8745	8604 (可能波动至 8615)	8761
anferico/bert-for-patents	8451	3e-5	1e-3	cosine	Yes	0.4
ahotrod/electra_large_discriminator_squad2_512	8514	2e-5	1e-3	cosine	No	0.3
Yanhao 同比赛其他方案 2nd Place Solution 3rd place solution 5th solution: prompt is all you need 7th place solution - the power of randomness 8th place solution: Predicting Targets at Once Led Us to Gold

第1名解决方案

同比赛其他方案