27th place solution (public 72th)

第27名方案 (公开榜第72名)

作者：calpis10000
发布时间：2022年7月3日

感谢 USPTO 和 Kaggle 主办了这场激动人心的比赛，也感谢各位 Kagglers 的精彩对决。

(2022/10/24 更新) 已添加训练和推理代码。

代码

训练代码 (Training) https://github.com/calpis10000/uspppm 推理代码 (Inference) https://www.kaggle.com/code/calpis10000/pppm-ens-063/notebook

摘要

我创建了20个多样化的模型，并使用 Optuna 进行权重调整来完成集成。

不同的任务
不同的主干网络
不同的池化头
不同的预处理

每个单独的实验并没有显著提高 CV（交叉验证）分数，但集成显示了效果。据推测，模型的多样性对我起了作用。

CV	公开榜	私有榜
0.8535	0.8514	0.8655

有效的方法

任务

回归 - 使用 MSELoss
分类 - 使用 CrossEntropyLoss
- 在分类任务中，皮尔逊系数计算如下：

def label_to_score(label):
    return (label*[0,0.25,0.5,0.75,1.0]).sum(axis=1)

def metric_pearson(predictions, labels):
    pred_score = label_to_score(predictions) # "predictions" 是模型输出
    label_score = label_to_score(labels) # "labels" 是真实标签
    pearson = np.corrcoef(pred_score, label_score)[0][1]       
    return pearson

主干网络

deberta-v3-large, bert-for-patents, electra-large 等。
deberta-v3-large 是最好的单模型。

池化头

注意力机制或拼接 CLS-Token
我尝试了 Conv1D 或 LSTM 头，但因为模型学习效果不佳而放弃了。

预处理

在某些模型中，我像下面这样将上下文添加为标记：

[subgrp=A][context=A47]HUMAN NECESSITIES. FURN...

无效的方法

伪标签 - 数据增强或使用测试数据
MLM 预训练

模型详情

实验编号	CV分数	任务	主干网络	头部	权重
exp122	0.8133	Cls	anferico/bert-for-patents	Attention	0.8588
exp094	0.8143	Reg	anferico/bert-for-patents	CLS-Token	0.3617
exp127	0.8174	Cls 同比赛其他方案 1st place solution 2nd Place Solution 3rd place solution 5th solution: prompt is all you need 7th place solution - the power of randomness

第27名方案 (公开榜第72名)

代码

摘要

有效的方法

任务

主干网络

池化头

预处理

无效的方法

模型详情

同比赛其他方案