返回列表

12th place solution

604. The Learning Agency Lab - PII Data Detection | pii-detection-removal-from-educational-data

开始: 2024-01-17 结束: 2024-04-23 数据安全与隐私 数据算法赛
第12名方案

比赛: PII 检测与去除(Kaggle)

排名: 第12名

团队成员: Decalogue、zx、xiamaozi11、Dylan Zhao

发布日期: 2024-04-27

第12名方案

感谢Kaggle和比赛主办方的辛勤工作与努力!祝贺所有参赛者!感谢我的队友 @hustzx@xiamaozi11@dylanzhao2012 的出色创意与工作!

我们的方案在交叉验证(CV)和排行榜(LB)上具有很高的一致性。它是多个 Deberta‑v3‑large 模型的集成,不使用任何预处理/后处理和阈值。

数据集

首先感谢 nb: @nbroad 提供的数据集 pii‑dd‑mistral‑generated(非常有用!)

我们统计了比赛的 PII 词汇表并保存为 pii_cp.json,并筛选了 pii_200k_parsed.csv。随后我们得到了更大的混合词汇表 pii_mix.json,并使用以下提示生成了大量数据。

提示模板

prompt_template = """Assuming you are a student enrolled in a massively open online course, please help me continue writing essay based on the given topic or designing your own topic that details your experience of applying a specific tool or approach to address a complex challenge.
This essay should not only narrate the process but also critically analyze the effectiveness of the chosen tool or approach, reflecting on its strengths and potential limitations.

If you mention your personal information, you can use information such as the following 7 types:
NAME_STUDENT - The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names.
EMAIL - A student's email address.
USERNAME - A student's username on any platform.
ID_NUM - A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.
PHONE_NUM - A phone number associated with a student.
URL_PERSONAL - A URL that might be used to identify a student.
STREET_ADDRESS - A full or partial street address that is associated with the student, such as their home address.

The information:
{info}

The topic:
{topic}

Please continue writing the essay:
"""

主题生成

with open('data/pii_cp.json', 'r', encoding='utf-8') as f:
    pii_cp = json.load(f)
with open('data/pii_mix_v1.json', 'r', encoding='utf-8') as f:
    pii_mix = json.load(f)
NAME_STUDENT = pii_cp['NAME_STUDENT']

topics = []
with open('data/train.json', 'r', encoding='utf-8') as f:
    jdata = json.load(f)
for d in jdata:
    text = d['full_text']
    ts = text.split('\n\n')[:random.choice([1, 2])]
    t = '\n\n'.join(ts)
    if len(t) > 256:
        continue
    for name in NAME_STUDENT:
        t = t.replace(name, 'NAME_STUDENT')
    topics.append(t)

with open('data/test.json', 'r', encoding='utf-8') as f:
    jdata = json.load(f)
for d in jdata:
    text = d['full_text']
    ts = text.split('\n\n')[:random.choice([1, 2])]
    t = '\n\n'.join(ts)
    if len(t) > 256:
        continue
    for name in NAME_STUDENT:
        t = t.replace(name, 'NAME_STUDENT')
    topics.append(t)

topics += ['Please design your own topic'] * 4000

生成的数据

  • mm: pii_cp_data + Qwen_7B_1k + Mistral_7B_1k(使用 pii_cp.json)
  • mm6: pii_cp_data + Mistral_7B_8k(使用 pii_mix.json)
  • mm8: pii_cp_data + Qwen_14B_8k(使用 pii_mix.json)

模型

我们使用 MultiLayer DebertaV2ForTokenClassification 并采用加权 CrossEntropy 损失,这提升了模型稳定性,使其对阈值不敏感。

weight = [10.] * (num_labels - 1) + [1.]
loss_fct = nn.CrossEntropyLoss(weight=torch.from_numpy(np.array(weight)).float())

训练参数

args = TrainingArguments(
    output_dir=f'{task}/{fold}', 
    fp16=True,
    max_grad_norm=10,
    weight_decay=0.01,
    learning_rate=2e-5,
    adam_epsilon=1e-6,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    per_device_train_batch_size=2, 
    per_device_eval_batch_size=2, 
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    report_to="none",
    logging_steps=500,
    metric_for_best_model="fbeta",
    greater_is_better=True
)

单模型 LB 分数

pii词汇 数据 模型 折叠 LB分数
pii_cp.json Nb: pii_cp_data + nb deberta-v3-large 1 0.962
pii_mix.json mm: pii_cp_data + Qwen_7B_1k + Mistral_7B_1k deberta-v3-large 2 0.960
pii_mix.json mm6: pii_cp_data + Mistral_7B_8k deberta-v3-large 1 0.966
pii_mix.json mm8: pii_cp_data + Qwen_14B_8k deberta-v3-large 1 0.964

最终我们在 LB 上得到 0.96670,而最高分笔记本(我们未提交)可以达到 0.96816,逼近第6名。它仅集成了上述 5 个单模型。因此我们分享的推理代码是得分最高的版本。

不依赖任何预处理/后处理和阈值,促使我们持续设计更具鲁棒性的解决方案。

同比赛其他方案