top 7th solution - `sumix` augmentation did all the work

MSU-YSDA-HSE 第7名解决方案：sumix增强策略成为关键突破

作者：tingir（Kaggle Master）
发布时间：2023年5月25日 20:45 UTC
竞赛排名：BirdClef 2023 第7名
团队成员：Danil Gaynanov、Mihail Dremin、slime

核心亮点

我们在最佳推理方案中的关键技术：

集成19个模型的预测结果
融合torchvision的efficientnet_b2和timm的rexnet150
使用OpenVINO加速推理过程
提交版本采用知识蒸馏技术，在完整数据集上训练所有模型（以rkl前缀标识）

其中5个采用rkl-rx-wdfbn-rx-ema-200-full前缀的模型，在私有测试集上达到0.7471分，在公开测试集上获得0.83194分（rx代表rexnet150，这些是经过蒸馏优化的模型）

自定义数据增强策略

想象您身处一个喧闹的聚会中，可以轻松聚焦特定对话群体。这类似于从音频中提取特定声源的任务。那么当音频中同时包含哭泣和大笑时？这包含了两种行为特征，对鸟类声音同样适用。基于此我们提出了sumup增强策略：

def sumup(waves: torch.Tensor, labels: torch.Tensor):
    batch_size = len(labels)
    perm = torch.randperm(batch_size)

    waves = waves + waves[perm]

    return {
        "waves": waves,
        "labels": torch.clip(labels + labels[perm], min=0, max=1)
    }

进阶优化：sumix策略

当音频中一种鸟类的声音明显大于另一种时，虽然包含两种鸟类，但较弱信号的概率可能较低。为此我们提出针对音频领域的sumix增强：

def sumix(waves: torch.Tensor, labels: torch.Tensor, max_percent: float = 1.0, min_percent: float = 0.3):
    batch_size = len(labels)
    perm = torch.randperm(batch_size)
    coeffs_1 = torch.rand(batch_size, device=waves.device).view(-1, 1) * (
        max_percent  - min_percent
    ) + min_percent
    coeffs_2 = torch.rand(batch_size, device=waves.device).view(-1, 1) * (
        max_percent  - min_percent
    ) + min_percent
    label_coeffs_1 = torch.where(coeffs_1 >= 0.5, 1, 1 - 2 * (0.5 - coeffs_1))
    label_coeffs_2 = torch.where(coeffs_2 >= 0.5, 1, 1 - 2 * (0.5 - coeffs_2))
    labels = label_coeffs_1 * labels + label_coeffs_2 * labels[perm]

    waves = coeffs_1 * waves + coeffs_2 * waves[perm]
    return {
        "waves": waves,
        "labels": torch.clip(labels, 0, 1)
    }

sumix相比sumup的优势在于同时调节音量大小，相当于额外的音频增强效果。实验表明该策略适用于各类音频分类任务。值得注意的是，我们使用了线性权重策略，但标签权重的最佳变化曲线仍是开放问题——可能当min_percent=0.3时无需使用斜率。

关于噪声处理的思考：虽然我们尝试添加背景噪声但未获提升。因为竞赛数据本身包含较多噪声样本，sumix和sumup已通过混合策略自然地扩展了噪声多样性。

训练流程详解

concatmix增强（概率0.5）实现代码
sumix(min_percent=0.3, max_percent=1.0)（概率1.0）
梅尔频谱图上的mixup增强，Beta(1.5, 1.5)（概率1.0）
梅尔频谱图上的cutmix增强，Beta(1.5, 1.5)（概率0.5）
优化器：AdamW，固定学习率1e-3，权重衰减1e-2
损失函数：逐类别的二元交叉熵（包含主标签和辅助标签）
训练时随机抽取5秒连续音频，验证时使用前5秒
所有归一化层关闭权重衰减
指数移动平均（EMA）衰减率0.999

20个模型集成的CV结果与线上分数高度相关，最佳CV集成在私有测试集获得0.75021分，公开测试集0.83328分（我们的公开榜第3名提交）。

知识蒸馏策略

使用第一阶段训练好的模型作为教师模型，指导相同架构的学生模型。蒸馏损失函数设计为：

$$L = 0.33 \times \text{BCE}_{\text{targets}} + 0.34 \times \text{KL}(p_{\text{student}} || p_{\text{teacher}}) + 0.33 \times \text{KL}(p_{\text{teacher}} || p_{\text{student}})$$

模型架构细节

rexnet150：dropout=0.3，drop_path=0.2
rexnet150-time：带时序注意力头的改进版，从基础版蒸馏得到
efficientnet-b2：dropout=0.3，drop_path=0.2
efficientnet-b2-time：带时序注意力头的改进版

通过上述技术，5个在完整数据集上训练的rexnet150模型（batch_size=336，200轮次）在单张A100上达到0.7471分。进一步融合efficientnet-b2和时序注意力机制，其中5个带注意力头的模型集成获得私有测试集0.75151分，公开测试集0.83261分。

未尝试的技术方向

非BirdClef2021/2022的外部数据
自监督学习（SSL）
声音事件检测（SED）模型
BirdNet集成
eca_nfnet模型

无效的尝试

在BirdClef2021/2022上预训练（无提升且拖慢实验）
各类背景噪声（含往届竞赛方案）
学习率调度器
高斯/有色噪声
tf_efficientnet_b2_ns等模型
SwinV2/ConvNeXt架构
流形混合增强
焦点损失函数
加权采样器
SpecAugment频谱增强

推理加速方案

在Kaggle内核（Intel Xeon 2.2GHz）上的测试显示：ONNX利用2个CPU核心，而OpenVino通过层融合和CPU特定编译优化，推理速度显著提升。

图示：推理时间对比

附录：MultiHeadAttentionClassifier实现

class MultiHeadSelfAttention(nn.Module):
    def __init__(
        self,
        input_channel: int,
        head_size: int,
        num_heads: int,
        attention_dropout: float,
    ) -> None:
        super().__init__()
        hidden_dim = head_size * num_heads
        self.hidden_dim = hidden_dim

        self.head_size = head_size
        self.num_heads = num_heads

        self.key = nn.Linear(input_channel, hidden_dim)
        self.query = nn.Linear(input_channel, hidden_dim)

        self.attention_dropout = nn.Dropout(attention_dropout)
        self.sqrt_head_size = sqrt(head_size)

        self.value = nn.Linear(input_channel, hidden_dim)

    def tranpose_for_scores(self, x: Tensor) -> Tensor:
        new_x_shape = x.size()[:-1] + (self.num_heads, self.head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)

    def get_key(self, x: Tensor) -> Tensor:
        return self.tranpose_for_scores(self.key(x))

    def get_query(self, x: Tensor) -> Tensor:
        return self.tranpose_for_scores(self.query(x))

    def get_value(self, x: Tensor) -> Tensor:
        return self.tranpose_for_scores(self.value(x))

    def forward(self, x: Tensor) -> Tensor:
        key = self.get_key(x)
        query = self.get_query(x)
        value = self.get_value(x)

        attention_scores = torch.matmul(query, key.transpose(-1, -2)) 
        attention_scores /= self.sqrt_head_size
        attention_scores = F.softmax(attention_scores, dim=-1)
        attention_scores = self.attention_dropout(attention_scores)
        x = torch.matmul(attention_scores, value)
        x = x.permute(0, 2, 1, 3).contiguous()
        x = x.view(x.shape[:2] + (self.hidden_dim,))
        return x

class MultiHeadAttentionClassifier(MultiHeadSelfAttention):
    def __init__(
        self,
        input_channel: int,
        head_size: int = 32,
        num_heads: int = 24,
        attention_dropout: float = 0.3,
        num_classes: int = 264,
        dropout: float = 0.3
    ) -> None:
        super().__init__(input_channel, head_size, num_heads, attention_dropout)
        self.query = nn.Parameter(
            torch.empty(num_heads, num_classes, head_size),
            requires_grad=True
        )
        nn.init.normal_(self.query)
        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Conv1d(num_classes, num_classes, kernel_size=self.hidden_dim, groups=num_classes)
        )

    def get_query(self, x: Tensor) -> Tensor:
        return self.query

    def forward(self, x: Tensor) -> Tensor:
        x = x.view(x.shape[:-2] + (-1,))
        x = x.permute(0, 2, 1)
        
        out = super().forward(x)
        return self.classifier(out).squeeze()

作者主页 tingir (Kaggle Master) 竞赛主页 BirdClef 2023 - 鸟类声音识别挑战赛