5th place solution (Training Strategy) | 优胜方案

第5名方案（训练策略）

作者：Takamichi Toda (tird) | 比赛：Rainforest Connection Species Audio Detection | 排名：第5名

恭喜所有参赛者，非常感谢组织者举办这次比赛！这是一场非常艰难但有趣的比赛：)

在这个帖子中，我将介绍我们关于训练策略的方法。
关于模型集成的部分将由我的队友撰写。

我们的团队集成了各自最好的模型。

我的模型是带有SED头部的Resnet18。该模型的LWLRAP得分为 Public LB=0.949 / Private LB=0.951，我使用Google Colab和Theo Viel的npz数据集（32 kHz, 128 mels）进行训练。非常感谢Theo Viel！！

我们的方法分为3个阶段，
其他队友在一些细节上有所不同，比如基础模型和超参数，但这些默认策略大致相同。

第1阶段：预训练

※我认为这部分并不重要。队友Ahmet跳过了这一部分。

该阶段将学习从Imagenet迁移到声谱图。

Theo Viel的npz数据集可以被视为128x3751大小的图像。
我根据t_min和t_max的声音点将图像裁剪为512大小。
我使用tp_train和30个采样的fp_train来训练这些图像。

参数：

Adam
learning_rate=1e-3
CosineAnnealingLR(max_T=10)
epoch=50

后续的第2和第3阶段继续使用此训练好的权重。

第2阶段：伪标签重标记

第2阶段的目的是改进模型并由此模型生成伪标签。

使用第1阶段训练好的权重。

我认为关键点在于仅计算有标签帧的梯度损失。正标签仅从tp_train.csv采样，负标签仅从fp_train.csv采样。
我将正标签设为1，负标签设为-1。

tp_dict = {}
for recording_id, df in train_tp.groupby("recording_id"):
    tp_dict[recording_id+"_posi"] = df.values[:, [1,3,4,5,6]]

fp_dict = {}
for recording_id, df in train_fp.groupby("recording_id"):
    fp_dict[recording_id+"_nega"] = df.values[:, [1,3,4,5,6]]
    
def extract_seq_label(label, value):
    seq_label = np.zeros((24, 3751))  # label, sequence
    middle = np.ones(24) * -1
    for species_id, t_min, f_min, t_max, f_max in label:
        h, t = int(3751*(t_min/60)), int(3751*(t_max/60))
        m = (t + h)//2
        middle[species_id] = m
        seq_label[species_id, h:t] = value
    return seq_label, middle.astype(int)

# 提取正标签和中间点
fname = "00204008d" + "_posi"
posi_label, posi_middle = extract_seq_label(tp_dict[fname], 1) 

# 提取负标签和中间点
fname = "00204008d" + "_nega"
nega_label, nega_middle = extract_seq_label(fp_dict[fname], -1)

损失函数如下：

def rfcx_2nd_criterion(outputs, targets):
    clipwise_preds_att_ti = outputs["clipwise_preds_att_ti"]
    posi_label = ((targets == 1).sum(2) > 0).float().to(device)
    nega_label = ((targets == -1).sum(2) > 0).float().to(device)
    posi_y = torch.ones(clipwise_preds_att_ti.shape).to(device)
    nega_y = torch.zeros(clipwise_preds_att_ti.shape).to(device)
    posi_loss = nn.BCEWithLogitsLoss(reduction="none")(clipwise_preds_att_ti, posi_y)
    nega_loss = nn.BCEWithLogitsLoss(reduction="none")(clipwise_preds_att_ti, nega_y)
    posi_loss = (posi_loss * posi_label).sum()
    nega_loss = (nega_loss * nega_label).sum()
    loss = posi_loss + nega_loss
    return loss

图像通过滑动窗口进行裁剪和堆叠。
我将窗口大小设置为512，通过一点点覆盖来裁剪整个60秒音频数据的范围。每次覆盖49个像素，考虑到重要的声音可能位于分割的边界处。

同比赛其他方案

1st place solution

2nd place solution

3rd Place Solution

4th place solution (5th public LB)

#6 Solution 🐼Tropic Thunder🐼 🚫 No hand labels🚫