```html Public第9名 Private第25名解决方案 - PII数据检测

Public第9名 Private第25名解决方案：CV与LB结果一致，但PB下降

作者：李哲成

比赛：PII检测 - 从教育数据中移除个人身份信息

排名：Public第9名 | Private第25名

首先，感谢Kaggle和THE LEARNING AGENCY LAB举办这次比赛，也感谢团队中每位成员的付出。虽然结果不尽如人意，但我们学到了很多，会继续努力。祝所有获奖者恭喜！

完整代码

这是本次比赛的GitHub仓库，包含几乎所有代码：https://github.com/Lizhecheng02/Kaggle-PII_Data_Detection

微调

AWP（对抗性权重扰动）

使用自定义AWP类编写CustomTrainer来增强模型的鲁棒性。这是我们团队在NLP比赛中常用的方法，确实取得了不错的效果。（相关代码可在GitHub的models目录下找到）
Wandb Sweep

使用此工具可以尝试不同超参数的各种组合，以选出产生最佳微调结果的参数。（相关代码可在GitHub的models目录下找到）
在所有文档中将\n\n替换为|

在这种情况下，我们训练了一组模型，采用4折交叉验证，LB分数为0.977。虽然LB有所改进，但PB没有改善。

后处理

纠正学生姓名的错误标签顺序。（B, B -> B, I）
特别注意地址中出现的\n，应该用I标签标记。
过滤掉电子邮件和电话号码，排除不属于这两个类别的明显错误结果。（无显著改善）
处理像Dr.这样的头衔被预测为B标签的情况。（无显著改善）

def pp(new_pred_df):
    df = new_pred_df.copy()
    i = 0
    while i < len(df):
        st = i
        doc = df.loc[st, "document"]
        tok = df.loc[st, "token"]
        pred_tok = df.loc[st, "label"]
        if pred_tok == 'O':
            i += 1
            continue
        lab = pred_tok.split('-')[1]
        cur_doc = doc
        cur_lab = lab
        last_tok = tok
        cur_tok = last_tok

        while i < len(df) and cur_doc == doc and cur_lab == lab and last_tok == cur_tok:
            last_tok = cur_tok + 1
            i += 1
            cur_doc = df.loc[i, "document"]
            cur_tok = df.loc[i, "token"]
            if i >= len(df) or df.loc[i, "label"] == 'O':
                break
            cur_lab = df.loc[i, "label"].split('-')[1]

        if st - 2 >= 0 and df.loc[st - 2, "document"] == df.loc[st, "document"] and df.loc[st - 1, "token_str"] == '\n' and df.loc[st - 2, "label"] != 'O' and df.loc[st - 2, "label"].split('-')[1] == lab:
            df.loc[st - 1, "label"] = 'I-' + lab
            df.loc[st - 1, "score"] = 1
            for j in range(st, i):
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab
            continue

        for j in range(st, i):
            if j == st:
                if df.loc[j, "label"] != 'B-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'B-' + lab
            else:
                if df.loc[j, "label"] != 'I-' + lab:
                    df.loc[j, "score"] = 1
                    df.loc[j, "label"] = 'I-' + lab

        if lab == 'NAME_STUDENT' and any(len(item) == 2 and item[0].isupper() and item[1] == "." for item in df.loc[st:i-1, 'token_str']):
            for j in range(st, i):
                df.loc[j, "score"] = 0
                df.loc[j, "label"] = 'O'

    return df

集成

平均集成

使用概率平均的方法获得最终结果。由于召回率比精确率更重要，我将阈值设为0.0，以避免错过任何潜在正确召回。

for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][12] < 0.0:
            final_pred = pred_without_O
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

投票集成

在最终提交中，我们集成了7个模型，如果至少两个模型预测了相同的标签，则接受该标签作为正确预测。

for tmp_pred in single_pred:
    for text_id in tmp_pred:
        max_id = 0
        for word_idx in tmp_pred[text_id]:
            max_id = tmp_pred[text_id][word_idx].argmax(-1)
            tmp_pred[text_id][word_idx] = np.zeros(tmp_pred[text_id][word_idx].shape)
            tmp_pred[text_id][word_idx][max_id] = 1.0
        for word_idx in tmp_pred[text_id]:
            final_token_pred[text_id][word_idx] += tmp_pred[text_id][word_idx]

for text_id in final_token_pred:
    for word_idx in final_token_pred[text_id]:
        pred = final_token_pred[text_id][word_idx].argmax(-1)
        pred_without_O = final_token_pred[text_id][word_idx][:12].argmax(-1)
        if final_token_pred[text_id][word_idx][pred] >= 2:
            final_pred = pred
            tmp_score = final_token_pred[text_id][word_idx][final_pred]
        else:
            final_pred = 12
            tmp_score = final_token_pred[text_id][word_idx][final_pred]

推理

双GPU推理

使用T4*2 GPU相比单GPU可以加倍推理速度。集成8个模型时，最大max_length为896；如果集成7个模型，max_length可以设为1024，这是更理想的值。（相关代码可在GitHub的submissions目录下找到）
转换非英文字符（使LB降低）

def replace_non_english_chars(text):
    mapping = {
        'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'å': 'a',
        'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
        'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
        'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'ø': 'o',
        'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
        'ÿ': 'y',
        'ç': 'c',
        'ñ': 'n',
        'ß': 'ss'
    }

    result = []
    for char in text:
        if char not in string.ascii_letters:
            replacement = mapping.get(char.lower())
            if replacement:
                result.append(replacement)
            else:
                result.append(char)
        else:
            result.append(char)

    return ''.join(result)

两阶段LLM（未成功）

我们使用GPT-4 API标注了约10,000个非学生姓名，因为学生姓名是最常见的标签类型。我们希望提高模型预测这一特定标签类型的准确性。

我尝试在姓名相关标签上微调Mistral-7b模型，但LB分数显示显著下降。

因此，我尝试使用Mistral-7b进行Few-shot生成，以判断被预测为"name student"标签的内容是否实际上是姓名。（在这里我们不能期望模型区分它是否是学生姓名，而只能排除明显不是姓名的预测）。

提示如下，这产生了非常微小的LB改善，少于0.001。

f"I'll give you a name, and you need to tell me if it's a normal person name, cited name or even not a name. Do not consider other factors.\nExample:\n- Is Matt Johnson a normal person name? Answer: Yes\n- Is Johnson. T a normal person name? Answer: No, this is likely a cited name.\n- Is Andsgjdu a normal person name? Answer: No, it is even not a name.\nNow the question is:\n- Is {name} a normal person name? Answer:"

提交

模型	LB	PB	选择
7个在LB上超过0.974的单模型	0.978	0.964	是
2个4折交叉验证模型，LB分数分别为0.977和0.974	0.978	0.961	是
3个单模型集成LB分数0.979，加上一组4折交叉验证模型LB分数0.977（使用投票集成）	0.979	0.963	是
2个单模型集成	0.972	0.967	否
4个单模型集成	0.979	0.967	否

代码

LB 0.978 PB 0.964 https://www.kaggle.com/code/lizhecheng/ensemble-replace-and-no-replace LB 0.978 PB 0.961 https://www.kaggle.com/code/lizhecheng/ensemble-replace-and-no-replace/notebook LB 0.979 PB 0.963 https://www.kaggle.com/code/lizhecheng/vote-ensemble-replace-and-no-replace

结语

感谢我的 teammates，通过Kaggle认识已经半年多了。我很幸运能和你们一起学习进步。@rdxsun、@bianshengtao、@xuanmingzhang777、@tonyarobertson。

过犹不及。——孔子

```

[Public 9th Private 25th Solution] Consistent CV and LB Results in a Decrease on the PB.

Public第9名 Private第25名解决方案：CV与LB结果一致，但PB下降

完整代码

微调

后处理

集成

推理

两阶段LLM（未成功）

提交

代码

结语

同比赛其他方案

[Public 9th Private 25th Solution] Consistent CV and LB Results in a Decrease on the PB.

Public第9名 Private第25名 解决方案：CV与LB结果一致，但PB下降

完整代码

微调

后处理

集成

推理

两阶段LLM（未成功）

提交

代码

结语

同比赛其他方案

Public第9名 Private第25名解决方案：CV与LB结果一致，但PB下降