604. The Learning Agency Lab - PII Data Detection | pii-detection-removal-from-educational-data
感谢Kaggle和比赛主办方的辛勤工作与努力!祝贺所有参赛者!感谢我的队友 @hustzx、@xiamaozi11、@dylanzhao2012 的出色创意与工作!
我们的方案在交叉验证(CV)和排行榜(LB)上具有很高的一致性。它是多个 Deberta‑v3‑large 模型的集成,不使用任何预处理/后处理和阈值。
首先感谢 nb: @nbroad 提供的数据集 pii‑dd‑mistral‑generated(非常有用!)
我们统计了比赛的 PII 词汇表并保存为 pii_cp.json,并筛选了 pii_200k_parsed.csv。随后我们得到了更大的混合词汇表 pii_mix.json,并使用以下提示生成了大量数据。
prompt_template = """Assuming you are a student enrolled in a massively open online course, please help me continue writing essay based on the given topic or designing your own topic that details your experience of applying a specific tool or approach to address a complex challenge.
This essay should not only narrate the process but also critically analyze the effectiveness of the chosen tool or approach, reflecting on its strengths and potential limitations.
If you mention your personal information, you can use information such as the following 7 types:
NAME_STUDENT - The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names.
EMAIL - A student's email address.
USERNAME - A student's username on any platform.
ID_NUM - A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.
PHONE_NUM - A phone number associated with a student.
URL_PERSONAL - A URL that might be used to identify a student.
STREET_ADDRESS - A full or partial street address that is associated with the student, such as their home address.
The information:
{info}
The topic:
{topic}
Please continue writing the essay:
"""
with open('data/pii_cp.json', 'r', encoding='utf-8') as f:
pii_cp = json.load(f)
with open('data/pii_mix_v1.json', 'r', encoding='utf-8') as f:
pii_mix = json.load(f)
NAME_STUDENT = pii_cp['NAME_STUDENT']
topics = []
with open('data/train.json', 'r', encoding='utf-8') as f:
jdata = json.load(f)
for d in jdata:
text = d['full_text']
ts = text.split('\n\n')[:random.choice([1, 2])]
t = '\n\n'.join(ts)
if len(t) > 256:
continue
for name in NAME_STUDENT:
t = t.replace(name, 'NAME_STUDENT')
topics.append(t)
with open('data/test.json', 'r', encoding='utf-8') as f:
jdata = json.load(f)
for d in jdata:
text = d['full_text']
ts = text.split('\n\n')[:random.choice([1, 2])]
t = '\n\n'.join(ts)
if len(t) > 256:
continue
for name in NAME_STUDENT:
t = t.replace(name, 'NAME_STUDENT')
topics.append(t)
topics += ['Please design your own topic'] * 4000
我们使用 MultiLayer DebertaV2ForTokenClassification 并采用加权 CrossEntropy 损失,这提升了模型稳定性,使其对阈值不敏感。
weight = [10.] * (num_labels - 1) + [1.]
loss_fct = nn.CrossEntropyLoss(weight=torch.from_numpy(np.array(weight)).float())
args = TrainingArguments(
output_dir=f'{task}/{fold}',
fp16=True,
max_grad_norm=10,
weight_decay=0.01,
learning_rate=2e-5,
adam_epsilon=1e-6,
warmup_ratio=0.05,
lr_scheduler_type="cosine",
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=5,
evaluation_strategy="epoch",
save_strategy="epoch",
save_total_limit=1,
report_to="none",
logging_steps=500,
metric_for_best_model="fbeta",
greater_is_better=True
)
| pii词汇 | 数据 | 模型 | 折叠 | LB分数 |
|---|---|---|---|---|
| pii_cp.json | Nb: pii_cp_data + nb | deberta-v3-large | 1 | 0.962 |
| pii_mix.json | mm: pii_cp_data + Qwen_7B_1k + Mistral_7B_1k | deberta-v3-large | 2 | 0.960 |
| pii_mix.json | mm6: pii_cp_data + Mistral_7B_8k | deberta-v3-large | 1 | 0.966 |
| pii_mix.json | mm8: pii_cp_data + Qwen_14B_8k | deberta-v3-large | 1 | 0.964 |
最终我们在 LB 上得到 0.96670,而最高分笔记本(我们未提交)可以达到 0.96816,逼近第6名。它仅集成了上述 5 个单模型。因此我们分享的推理代码是得分最高的版本。
不依赖任何预处理/后处理和阈值,促使我们持续设计更具鲁棒性的解决方案。