553. BirdCLEF 2023 | birdclef-2023
我们在最佳推理方案中的关键技术:
torchvision的efficientnet_b2和timm的rexnet150rkl前缀标识)其中5个采用rkl-rx-wdfbn-rx-ema-200-full前缀的模型,在私有测试集上达到0.7471分,在公开测试集上获得0.83194分(rx代表rexnet150,这些是经过蒸馏优化的模型)
想象您身处一个喧闹的聚会中,可以轻松聚焦特定对话群体。这类似于从音频中提取特定声源的任务。那么当音频中同时包含哭泣和大笑时?这包含了两种行为特征,对鸟类声音同样适用。基于此我们提出了sumup增强策略:
def sumup(waves: torch.Tensor, labels: torch.Tensor):
batch_size = len(labels)
perm = torch.randperm(batch_size)
waves = waves + waves[perm]
return {
"waves": waves,
"labels": torch.clip(labels + labels[perm], min=0, max=1)
}
当音频中一种鸟类的声音明显大于另一种时,虽然包含两种鸟类,但较弱信号的概率可能较低。为此我们提出针对音频领域的sumix增强:
def sumix(waves: torch.Tensor, labels: torch.Tensor, max_percent: float = 1.0, min_percent: float = 0.3):
batch_size = len(labels)
perm = torch.randperm(batch_size)
coeffs_1 = torch.rand(batch_size, device=waves.device).view(-1, 1) * (
max_percent - min_percent
) + min_percent
coeffs_2 = torch.rand(batch_size, device=waves.device).view(-1, 1) * (
max_percent - min_percent
) + min_percent
label_coeffs_1 = torch.where(coeffs_1 >= 0.5, 1, 1 - 2 * (0.5 - coeffs_1))
label_coeffs_2 = torch.where(coeffs_2 >= 0.5, 1, 1 - 2 * (0.5 - coeffs_2))
labels = label_coeffs_1 * labels + label_coeffs_2 * labels[perm]
waves = coeffs_1 * waves + coeffs_2 * waves[perm]
return {
"waves": waves,
"labels": torch.clip(labels, 0, 1)
}
sumix相比sumup的优势在于同时调节音量大小,相当于额外的音频增强效果。实验表明该策略适用于各类音频分类任务。值得注意的是,我们使用了线性权重策略,但标签权重的最佳变化曲线仍是开放问题——可能当min_percent=0.3时无需使用斜率。
关于噪声处理的思考:虽然我们尝试添加背景噪声但未获提升。因为竞赛数据本身包含较多噪声样本,sumix和sumup已通过混合策略自然地扩展了噪声多样性。
concatmix增强(概率0.5)实现代码sumix(min_percent=0.3, max_percent=1.0)(概率1.0)mixup增强,Beta(1.5, 1.5)(概率1.0)cutmix增强,Beta(1.5, 1.5)(概率0.5)AdamW,固定学习率1e-3,权重衰减1e-220个模型集成的CV结果与线上分数高度相关,最佳CV集成在私有测试集获得0.75021分,公开测试集0.83328分(我们的公开榜第3名提交)。
使用第一阶段训练好的模型作为教师模型,指导相同架构的学生模型。蒸馏损失函数设计为:
$$L = 0.33 \times \text{BCE}_{\text{targets}} + 0.34 \times \text{KL}(p_{\text{student}} || p_{\text{teacher}}) + 0.33 \times \text{KL}(p_{\text{teacher}} || p_{\text{student}})$$
rexnet150:dropout=0.3,drop_path=0.2rexnet150-time:带时序注意力头的改进版,从基础版蒸馏得到efficientnet-b2:dropout=0.3,drop_path=0.2efficientnet-b2-time:带时序注意力头的改进版通过上述技术,5个在完整数据集上训练的rexnet150模型(batch_size=336,200轮次)在单张A100上达到0.7471分。进一步融合efficientnet-b2和时序注意力机制,其中5个带注意力头的模型集成获得私有测试集0.75151分,公开测试集0.83261分。
eca_nfnet模型tf_efficientnet_b2_ns等模型在Kaggle内核(Intel Xeon 2.2GHz)上的测试显示:ONNX利用2个CPU核心,而OpenVino通过层融合和CPU特定编译优化,推理速度显著提升。
图示:推理时间对比
class MultiHeadSelfAttention(nn.Module):
def __init__(
self,
input_channel: int,
head_size: int,
num_heads: int,
attention_dropout: float,
) -> None:
super().__init__()
hidden_dim = head_size * num_heads
self.hidden_dim = hidden_dim
self.head_size = head_size
self.num_heads = num_heads
self.key = nn.Linear(input_channel, hidden_dim)
self.query = nn.Linear(input_channel, hidden_dim)
self.attention_dropout = nn.Dropout(attention_dropout)
self.sqrt_head_size = sqrt(head_size)
self.value = nn.Linear(input_channel, hidden_dim)
def tranpose_for_scores(self, x: Tensor) -> Tensor:
new_x_shape = x.size()[:-1] + (self.num_heads, self.head_size)
x = x.view(new_x_shape)
return x.permute(0, 2, 1, 3)
def get_key(self, x: Tensor) -> Tensor:
return self.tranpose_for_scores(self.key(x))
def get_query(self, x: Tensor) -> Tensor:
return self.tranpose_for_scores(self.query(x))
def get_value(self, x: Tensor) -> Tensor:
return self.tranpose_for_scores(self.value(x))
def forward(self, x: Tensor) -> Tensor:
key = self.get_key(x)
query = self.get_query(x)
value = self.get_value(x)
attention_scores = torch.matmul(query, key.transpose(-1, -2))
attention_scores /= self.sqrt_head_size
attention_scores = F.softmax(attention_scores, dim=-1)
attention_scores = self.attention_dropout(attention_scores)
x = torch.matmul(attention_scores, value)
x = x.permute(0, 2, 1, 3).contiguous()
x = x.view(x.shape[:2] + (self.hidden_dim,))
return x
class MultiHeadAttentionClassifier(MultiHeadSelfAttention):
def __init__(
self,
input_channel: int,
head_size: int = 32,
num_heads: int = 24,
attention_dropout: float = 0.3,
num_classes: int = 264,
dropout: float = 0.3
) -> None:
super().__init__(input_channel, head_size, num_heads, attention_dropout)
self.query = nn.Parameter(
torch.empty(num_heads, num_classes, head_size),
requires_grad=True
)
nn.init.normal_(self.query)
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Conv1d(num_classes, num_classes, kernel_size=self.hidden_dim, groups=num_classes)
)
def get_query(self, x: Tensor) -> Tensor:
return self.query
def forward(self, x: Tensor) -> Tensor:
x = x.view(x.shape[:-2] + (-1,))
x = x.permute(0, 2, 1)
out = super().forward(x)
return self.classifier(out).squeeze()