3rd solution | 优胜方案

第3名方案

作者: heng (Grandmaster) 及队友 xiamaozi11, syzong, sayoulala, yzheng21
比赛: Learning Equality - Curriculum Recommendations

首先，我要感谢主办方举办这场高质量的比赛，也要感谢我出色的队友们 @xiamaozi11 @syzong @sayoulala @yzheng21，我们大家都为这次比赛付出了辛勤的努力。我从优秀的 Notebooks 和讨论中学到了很多，基本上我们使用的方法都来自 Kaggle 社区。感谢这些慷慨而聪明的 Kagglers！

概要

CV 策略
阶段一：检索器
阶段二：排序器
寻找阈值
后处理
模型集成

训练流程

Training Pipeline

CV 策略

我们只使用了 4,000 个随机主题（category != 'source'）作为留出数据。这些主题作为验证数据，从未在任何训练过程中使用。这种简单的 CV 策略出乎意料地稳定。

在比赛的最后一个月，我们将 4,000 个主题改为 1,000 个，在我们开始集成之前，结果仍然相对一致。

检索器

我们使用了无监督 SIMCSE（Simple Contrastive Learning of Sentence Embeddings: https://github.com/princeton-nlp/SimCSE）方法来训练检索器模型。

训练检索器

仅使用 correlations.csv 中的正样本进行无监督 simcse 训练
对于验证集，从同一语言中为每个验证主题随机选择 100 个负样本
内容文本格式: title [SEP] kind [SEP] description [SED] text, maxlen = 256 (字符串级别)
主题文本格式: title [SEP] channel [SEP] category [SEP] level [SEP] language [SEP] description [SEP] context [SEP] parent_description [SEP] children_description, maxlen = 256 (字符串级别)
simcse_unsup_loss

def simcse_unsup_loss(feature_topic, feature_content) -> 'tensor':
    y_true = torch.arange(0, feature_topic.size(0), device=device)
    sim = F.cosine_similarity(feature_topic.unsqueeze(1), feature_content.unsqueeze(0), dim=2)
    sim = sim / 0.05
    loss = F.cross_entropy(sim, y_true)
    loss = torch.mean(loss)
    return loss

来源: https://github.com/yangjianxin1/SimCSE/blob/master/model.py

训练代码示例:

for step, (inputs_topic, inputs_content, labels) in enumerate(train_loader):
        inputs_topic = collate(inputs_topic)
        for k, v in inputs_topic.items():
            inputs_topic[k] = v.to(device)
        inputs_content = collate(inputs_content)
        for k, v in inputs_content.items():
            inputs_content[k] = v.to(device)
        batch_size = labels.size(0)
        with torch.cuda.amp.autocast(enabled=CFG.apex):
            feature_topic = model(inputs_topic)
            feature_content = model(inputs_content)
            loss = simcse_unsup_loss(feature_topic, feature_content)

在 1,000 个主题验证数据上的表现:

同比赛其他方案

1st Place Solution

2nd Place Solution

4th place solution

5th place solution

6th place solution

模型	F2@5	Top50 最大正样本得分