摘要

两阶段流程：检索 + 重排序
流程图如下：

pipeline

1. 检索

输入数据：
- 主题：`title + description + [SEP-Depth] + level + [SEP-context] + context + [SEP-children] + children`
- 内容：`title + description + text + [SEP-Kind] + kind`
训练集/验证集划分：StratifiedGroupKFold (y=channel, group=topic_id)，仅使用1折
模型：
- Bi-Encoder (Sentence-Transformers)
- 损失函数：NT-Xent loss (论文链接)
- 预训练模型：
  - (1) xlm-roberta-base
  - (2) sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- 分词器：添加特殊标记 ([SEP-Depth] 等)
- batch_size: 256, max_len=128

预训练模型	训练数据	Rec@10	Rec@50	f2@10	pub@10	pri@10
xlm-roberta-base	train	76.8	91.1	50.3	46.9	46.9
paraphrase-multilingual-mpnet-base-v2	train	78.5	91.5	51.5	47.2	47.4
paraphrase-multilingual-mpnet-base-v2	train+valid	93.3	99.0	62.1	48.9	49.5

2. 候选选择

通过模型计算嵌入向量，并计算所有主题和所有内容之间的余弦相似度
根据每个模型的余弦相似度选择前50个 -> 选择重复的候选者
前10名：public=53.4, private=55.4

3. 重排序

输入数据：

    title + description + [SEP-Depth] + level + [SEP-context] + context + \
    [SEP-children] + children + [SEP] + \
    title + description + text + [SEP-Kind] + kind

训练集/验证集划分：与阶段1相同
模型：
- Cross-Encoder
- 损失函数：BCE loss
- 对抗学习：FGM
- batch_size: 128, max_len=256

32th place solution

摘要

1. 检索

2. 候选选择

3. 重排序

同比赛其他方案