23th Simple solution

604. The Learning Agency Lab - PII Data Detection | pii-detection-removal-from-educational-data

开始: 2024-01-17 结束: 2024-04-23 数据安全与隐私数据算法赛

第23名简单解决方案

第23名简单解决方案

作者：nishimoto

排名：第 21 名

获得的投票：12

感谢比赛！我的解决方案很简单，没有使用新方法，但我想分享出来。

基线（Baseline）

作为基线，我主要使用了下面的 Kaggle Notebook。感谢 @emiz6413！
https://www.kaggle.com/code/emiz6413/train-deberta-v3-single-model-lb-0-966

所有模型都使用了 DeBERTa‑v3‑large。我使用了训练集+MPWare 数据（仅使用 30% 的负样本）。
对于验证，我将训练数据按 document % 4 != 2 划分，验证数据为 document % 4 == 2。

集成（Ensemble）

我对训练数据进行了一些改动以实现集成：

Ensemble1：加入困难样本（DeBERTa 预测错误的样本），并使用 25% 的负样本作为训练数据。
Ensemble2：从 MPWare 数据中剔除短文（len_tokens < 100）或全为负标签（所有标签为 “O”）的数据。

后处理（Post‑processing）

我将某些词语（如 “Mr.”、“Dr.” …）或非首字母大写的缩写改为 “O”。

对我无效的尝试

数据增强（Data augmentation）
Longformer‑base 与 Longformer‑large
Funnel‑transformer（验证集好但公开榜差）
MLM（验证集好但公开榜差）
PEFT
SiFT
RandomMask

相关链接

基线 Notebook DeBERTa‑v3‑single‑model LB 0.966

同比赛其他方案

1st place solution - Ensemble of diverse Deberta architectures and postprocessing

2nd place solution

3rd Place Solution

4th place solution - Llama3 🦙 70B is all you need

5th place solution