0.577 single model with full code | 优胜方案

0.577 单模型完整代码

作者: Abhishek Thakur (Grandmaster) | 比赛排名: 第14名

实现该模型的第一步是在旧数据集上训练一个 MLM（掩码语言模型）。我提取了之前 Feedback 比赛的所有数据，并使用 transformers 库中提供的 run_mlm.py 脚本对 deberta-v3-large 进行了微调。最终的困惑度大约在 4.5 左右。我本可以训练更久，但我没有那样做。

接下来是比较有趣的部分。我的灵感来源于 @nbroad 的 Kernel，他展示了如何将 Token 分类任务应用于这个问题。正是这个 Kernel 激发了我对这场比赛的兴趣，我决定试一试。这个想法简单但有效。

文章文本的表示形式如下：

[CLS]some text [CLS_{discourse_type}] some valid discourse text [END_{discourse_type}] ..... [SEP]

映射关系：

disc_types = [
    "Claim",
    "Concluding Statement",
    "Counterclaim",
    "Evidence",
    "Lead",
    "Position",
    "Rebuttal",
]
cls_tokens_map = {label: f"[CLS_{label.upper()}]" for label in disc_types}
end_tokens_map = {label: f"[END_{label.upper()}]" for label in disc_types}

我按如下方式整理数据：

将数据转换为旧格式（这部分我已经公开了我的 Kernel）
对于每个 token，如果它们不是有效的论述，则为论述类型添加 O 标签
对于每个 token，如果它们不是有效的论述，则为论述有效性添加 -100 作为标签
如果 token 属于论述 CLS 或 END token 之一，或者是有效的论述，我将其论述类型 ID 作为标签
如果 token 属于论述 CLS 或 END token 之一，或者是有效的论述，我将其论述有效性 ID 作为标签
论述有效性标签仅保留给 CLS_{discourse_type} 和 END_{discourse_type} token

现在，对于每篇文章，我们都有：

input_ids
input_types（论述类型 id，每个 input id 对应一个）
input_labels（实际标签，大多数是 -100，只有论述的开始和结束位置有实际标签）
attention_mask

现在是时候训练模型了。我在模型的嵌入层做了一个关键的改动，这带来了很好的提升。改动是为 discourse_types（input_types）添加另一个嵌入层，并将结果加到原始嵌入上：

class DebertaV2Embeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        pad_token_id = getattr(config, "pad_token_id", 0)
        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
        self.word_embeddings = nn.Embedding(config.vocab_size, self.embedding_size, padding_idx=pad_token_id)
        self.disc_type_embeddings = nn.Embedding(9, self.embedding_size)
        .
        .
        .

    def forward(
        self, input_ids=None, token_type_ids=None, disc_type_ids=None, position_ids=None, mask=None, inputs_embeds=None
    ):
        .
        .
        if self.config.disc_type_vocab_size > 0:
            disc_type_embeddings = self.disc_type_embeddings(disc_type_ids)
            embeddings += disc_type_embeddings
        .
        .
        .
        return embeddings

除此之外，我使用了多项式学习率调度器和 PyTorch 的 AdamW 优化器。

在推理过程中，我只是对 CLS_{discourse_type} 和 END_{discourse_type} token 的概率取平均值来获得最终标签。

就是这样。使用单模型（5折）获得好分数并没有太复杂。运行一些变体并对模型进行平均，我们在排行榜上获得了第 14 名。

同比赛其他方案

Team Hydrogen: 1st place solution

2nd place solution (updated with code/notebooks)

3rd Place Solution - Span MLM + T5 Augmentations

4th Place Solution - Team ...

5th place solution