17th Place Solution : file-level post-process

第17名方案：文件级后处理

作者：Spoiler Alert
比赛排名：第17名

特征工程

Log MEL 频谱图

sr = 32000
fmin = 20
fmax = sr // 2

n_channel = 128
n_fft = 2048
hop_length = 512
win_length = n_fft

数据增强

随机剪辑
从音频中随机截取5秒的片段，并且只保留信噪比（SNR）高于 1e-3 的片段。

def signal_noise_ratio(spec):
    spec = spec.copy()

    col_median = np.median(spec, axis=0, keepdims=True)
    row_median = np.median(spec, axis=1, keepdims=True)

    spec[spec < row_median * 1.25] = 0.0
    spec[spec < col_median * 1.15] = 0.0
    spec[spec > 0] = 1.0

    spec = cv2.medianBlur(spec, 3)
    spec = cv2.morphologyEx(spec, cv2.MORPH_CLOSE, np.ones((3, 3), np.float32))

    spec_sum = spec.sum()
    try:
        snr = spec_sum / (spec.shape[0] * spec.shape[1] * spec.shape[2])
    except:
        snr = spec_sum / (spec.shape[0] * spec.shape[1])

    return snr

MixUp
在 LogMelSpec 上使用 beta(8.0, 8.0) 分布进行 mixup。
使用 beta(0.4, 0.4) 和 beta(1.0, 1.0) 会增加更多的 True Positive（真阳性），但也会导致更多的 False Positive（假阳性）。
噪声
在波形中以独立的概率和幅度叠加最多4种噪声。
噪声是从训练样本中提取的。

def signal_noise_split(audio):
    S, _ = spectrum._spectrogram(y=audio, power=1.0, n_fft=2048, hop_length=512, win_length=2048)
    
    col_median = np.median(S, axis=0, keepdims=True)
    row_median = np.median(S, axis=1, keepdims=True)
    S[S < row_median * 3] = 0.0
    S[S < col_median * 3] = 0.0
    S[S > 0] = 1
    
    S = binary_erosion(S, structure=np.ones((4, 4)))
    S = binary_dilation(S, structure=np.ones((4, 4)))
    
    indicator = S.any(axis=0)
    indicator = binary_dilation(indicator, structure=np.ones(4), iterations=2)
    
    mask = np.repeat(indicator, hop_length)
    mask = binary_dilation(mask, structure=np.ones(win_length - hop_length), origin=-(win_length - hop_length)//2)
    mask = mask[:len(audio)]
    signal = audio[mask]
    noise = audio[~mask]
    return signal, noise

模型

CNN
9层 CNN。
在每个 ConvBlock2D 内，先对频率进行平均池化，再对时间进行最大池化。
每个 ConvBlock2D 内包含 SqueezeExcitationBlock。
Pixel shuffle: (n_channel, n_freq, n_time) -> (n_channel * 2, n_freq / 2, n_time)。
CRNN
9层 CNN 之后接 2层双向 GRU。
CNN + Transformer Encoder
9层 CNN 之后接带有 8个注意力头的 Encoder。

训练

标签平滑：alpha 0.05
平衡采样器：每种鸟类随机选择最多 150 个样本
基于 ebird_code 的分层 5 折交叉验证
损失函数：BCEWithLogitsLoss
优化器：Adam(lr=1e-3)
学习率调度器：CosineAnnealingLR(Tmax=10)

后处理

如果模型对任何鸟类获得了置信度较高的预测，则在

第17名方案：文件级后处理

特征工程

数据增强

模型

训练

后处理

同比赛其他方案