第11名方案：相信CV

首先，我要感谢 Kaggle 和 NBME 组织了如此精彩的比赛，并祝贺所有获奖者。我在截止日期前大约 10 天开始着手这个项目，最初的目标是银牌；最终获得金牌令我激动不已。我的解决方案基于纯粹的工程技术而非任何秘密技巧，源于评分最高的基线。由于时间限制，我不得不完全依赖我的 CV 策略，该策略改编自基线（5折 GroupKFold）。

分词器

基线在分词器方面遇到了一些小问题，我在整个过程中发现并解决了这些问题。

标签中的假阴性 (FN)

基线中的 create_label 函数不准确地将以单个关键词结尾的句子标记化。分词后，这些短语被标记为阴性。为了解决这个问题，我修改了以下代码：

从

                if start_idx == -1:
                    start_idx = end_idx

改为

                if start_idx == -1:
                    start_idx = end_idx - 1

修剪后的偏移量

如讨论所述，roBERTa 分词器默认处理修剪空格后的偏移量。为了解决这个问题，我们需要在加载时设置 trim_offsets=False。然而，我发现对于某些模型（例如 Electra, Ernie, Albert, Funnel, MPNet），即使设置了 trim_offsets 为 False，偏移量仍然是被修剪的。我创建了一个小函数来手动纠正这一点：

def fix_offsets(offset_mapping, text):
    n = len(offset_mapping)
    if n == 0:
        return []
    if n == 1:
        return offset_mapping[0]
    re = [offset_mapping[0]]
    last_e = 0
    for i in range(1, n-1):
        s, e = offset_mapping[i]
        if text[s] != ' ' and s != last_e:
            s = last_e
        last_e = e
        re.append((s, e))
    re.append(offset_mapping[n-1])
    return re

未对齐的标注

在基线的 get_results 函数中，有一行代码给所有索引加 1：

result = np.where(char_prob > th)[0] + 1

这假设所有标记都以空格开头，这可能导致丢失第一个字符。为了解决这个问题，我删除了“加 1”的操作，而是去除了两边的空格：

def my_get_results(char_logits, texts, th=0):
    results = []
    for i, char_prob in enumerate(char_logits):
        result = np.where(char_prob > th)[0]
        result = [list(g) for _, g in itertools.groupby(result, key=lambda n, c=itertools.count(): n - next(c))]
        temp = []
        for r in result:
            s, e = min(r), max(r)
            while texts[i][s] == ' ':
                s += 1
            while texts[i][e] == ' ':
                e -= 1
            temp.append(f"{s} {e+1}")
        result = temp
        result = ";".join(result)
        results.append(result)
    return results

模型

我的模型很简单，使用 Hugging Face trainer 进行训练。由于超参数调整的时间有限，我采用了以下设置并保持所有其他默认值：

对于大型模型：lr=1e-5, bs=4, weight_decay=0, warmup_ratio=0.2, epoch=10。对于无法放入 24GB RAM 的模型（xlarge 类），我使用了 bs=2, lr=5e-6。
对于基础模型：lr=3e-5, bs=16, weight_decay=0, warmup_ratio=0.2, epoch=10。

我结合了 14 个模型来创建伪标签，权重由 Optuna 在 (0, 1) 范围内确定。

同比赛其他方案

1st solution

#2 solution

3rd Place Solution: Meta Pseudo Labels + Knowledge Distillation

4th place solution: Deberta models & postprocess

5th place solution

11th Solution: Trust CV

第11名方案：相信CV

分词器

标签中的假阴性 (FN)

修剪后的偏移量

未对齐的标注

模型

同比赛其他方案