第二名方案

首先，感谢 Kaggle 和组织者。我从这次经历中学到了很多。

1. 模型

在早期，将 bsx4xHxW 输入馈送到 2D‑CNN 产生的结果比单图像方法更差。我开始思考为什么？

由于标签（LPD、GPD 等）对位置敏感，我们应该注意通道维度。2D‑CNN 不擅长捕捉通道内的位置信息，因为通道方向没有填充。这就是为什么我们需要将频谱拼接成一张图像，而不是使用 bsx16xHxW 的图像（双香蕉 montage）。因此，我决定对频谱使用 3D‑CNN，对原始 EEG 信号使用 2D‑CNN 模型。

总方案如下：
方案图

mne 0.5‑20Hz 表示使用 MNE 工具进行滤波。
scipy.signal 表示使用 scipy.signal 进行滤波。
关于 reshape 算子和 STFT 参数，请参阅下面的代码。
括号中的数字表示集成时的最终权重。

1.1 x3d‑l（频谱模型）

在双香蕉 montage、±1024 裁剪和 0.5‑20 Hz 滤波后，使用 STFT 提取频谱，然后输入到 3D‑CNN（x3d‑l）。

输入数据是一个 16 通道的频谱图像。

X3d‑l 交叉验证 0.21+，公开排行榜 0.25，私有排行榜 0.29。

STFT 代码如下，作为一个 nn.Module 使用：

class Transform50s(nn.Module):
    def __init__(self, ):
        super().__init__()
        self.wave_transform = torchaudio.transforms.Spectrogram(n_fft=512, win_length=128, hop_length=50, power=1)

    def forward(self, x):
        image = self.wave_transform(x)
        image = torch.clip(image, min=0, max=10000) / 1000
        n, c, h, w = image.size()
        image = image[:, :, :int(20 / 100 * h + 10), :]
        return image

class Transform10s(nn.Module):
    def __init__(self, ):
        super().__init__()
        self.wave_transform = torchaudio.transforms.Spectrogram(n_fft=512, win_length=128, hop_length=10, power=1)

    def forward(self, x):
        image = self.wave_transform(x)
        image = torch.clip(image, min=0, max=10000) / 1000
        n, c, h, w = image.size()
        image = image[:, :, :int(20 / 100 * h + 10), :]
        return image

class Model(nn.Module):
    def __init__(self):
        super().__init__()

        model_name = "x3d_l"
        self.net = torch.hub.load('facebookresearch/pytorchvideo',
                                  model_name, pretrained=True)

        self.net.blocks[5].pool.pool = nn.AdaptiveAvgPool3d(1)
        # self.net.blocks[5]=nn.Identity()
        # self.net.avgpool = nn.Identity()
        self.net.blocks[5].dropout = nn.Identity()
        self.net.blocks[5].proj = nn.Identity()
        self.net.blocks[5].activation = nn.Identity()
        self.net.blocks[5].output_pool = nn.Identity()

    def forward(self, x):
        x = self.net(x)
        return x

class Net(nn.Module):
    def __init__(self, num_classes=1):
        super().__init__()

        self.preprocess50s = Transform50s()
        self.preprocess10s = Transform10s()

        self.model = Model()

        self.pool = nn.AdaptiveAvgPool3d(1)
        self.fc = nn.Linear(2048, 6, bias=True)
        self.drop = nn.Dropout(0.5)

    def forward(self, eeg):
        # do preprocess
        bs = eeg.size(0)
        eeg_50s = eeg
        eeg_10s = eeg[:, :, 4000:6000]
        x_50 = self.preprocess50s(eeg_50s)
        x_10 = self.preprocess10s(eeg_10s)
        x = torch.cat([x_10, x_50], dim=1)
        x = torch.unsqueeze(x, dim=1)
        x = torch.cat([x, x, x], dim=1)
        x = self.model(x)
        # x = self.pool(x)
        x = x.view(bs, -1)
        x = self.drop(x)
        x = self.fc(x)
        return x

x3d-l 频谱模型

1.2 单图像（One‑Image）

它是一个 2D 视觉模型 efficientnetb5，公开排行榜 0.26 2490，私有排行榜 0.304877。

对于 2D 模型，我像许多人一样将 16 通道频谱图像拼接起来。

        image = torch.reshape(image, shape=[n, 2, -1, w])
        x1 = image[:, 0:1, ...]
        x2 = image[:, 1:2, ...]
        image = torch.cat([x1, x2], dim=-1)
        image = torch.cat([image, image, image], dim=1)

我检查了提交分数，将这两个频谱模型组合后，得到私有排行榜 0.28。

1.3 EEG 模型

将 EEG（bs×16×10000）视为图像。于是扩展 dim=1（bs×1×16×10000），但时间维度太大，然后我做了一个 reshape。

模型定义如下：

class Net(nn.Module):
    def __init__(self,):
        super(Net, self).__init__()
        self.model = timm.create_model('efficientnet_b5', pretrained=True, in_chans=3)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(2048, out_features=6, bias=True)
        self.dropout = nn.Dropout(p=0.5)

    def extract_features(self, x):
        feature1 = self.model.forward_features(x)
        return feature1

    def forward(self, x):
        bs = x.size(0)
        reshaped_tensor = x.view(bs, 16, 1000, 10)
        reshaped_and_permuted_tensor = reshaped_tensor.permute(0, 1, 3, 2)
        reshaped_and_permuted_tensor = reshaped_and_permuted_tensor.reshape(bs, 16 * 10, 1000)
        x = torch.unsqueeze(reshaped_and_permuted_tensor, dim=1)

        x = torch.cat([x, x, x], dim=1)
        bs = x.size(0)

        x = self.extract_features(x)
        x = self.pool(x)
        x = x.view(bs, -1)
        x = self.dropout(x)
        x = self.fc(x)
        return x

然后，EEG 被 reshape 为 bs×3×160×1000。使用 efficientnetb5，公开排行榜 0.230978，私有排行榜 0.282873。

有 3 个模型，但略有不同：mne 滤波 efficientnetb5、scipy.signal 滤波 efficientnetb5 以及 mne 滤波 hgnetb5。相应的分数也略有不同。

1.4 双头模型（EEG+频谱）

我使用 x3d‑l 提取频谱特征（仅使用 Transform50s），使用 efficientnetb5 提取原始 EEG 特征，如方案图所示。连接最后的特征。公开排行榜 0.24，私有排行榜 0.29？不确定。

2. 预处理

2.1 双香蕉 montage，EEG 为 16×10000
2.2 使用 0.5‑20 Hz 滤波
2.3 裁剪至 ±1024

3. 训练

3.1 第一阶段，15 个 epoch，损失权重为 voters_num/20，AdamW 学习率 0.001，余弦学习率调度。
3.2 第二阶段，5 个 epoch，损失权重为 1，voters_num≥6，AdamW 学习率 0.0001，余弦调度。
3.3 使用 eeg_label_offset_seconds。我为每个 eegid 随机选择一个偏移，并在每个训练迭代中根据 eegid 对目标取平均。
3.4 数据增强：镜像 EEG，左右脑数据翻转。
3.5 10 折，然后从验证集移动 1000 个样本到训练集，保留 709 个样本在验证集。并使用 vote_num≥6 进行验证。

4. 集成

通过组合这些模型，我认为可以达到当前的分数。然而我的分数是 6 模型集成，提升并不明显（私有排行榜 0.28 → 0.27）。

最终集成包括：

2 个频谱模型（x3d、efficientnetb5），均使用 mne 滤波
3 个原始 EEG 模型（efficientnetb5 使用 mne 滤波、efficientnetb5 使用 butter 滤波、1 个 hgnetb5 使用 mne 滤波）
1 个 EEG‑频谱混合模型，使用 butter 滤波

权重为 [0.1, 0.1, 0.2, 0.2, 0.2, 0.2]。

注：with‑mne.filter 表示使用 mne 库进行滤波，butter 滤波表示使用 scipy.signal，只是为了增加多样性。

一些思考

我认为原始 EEG 在这项任务中更为重要。将原始 EEG 数据馈送到 2D 视觉模型有点像人类观察 EEG 信号的方式。观察时间和通道维度！

2nd place solution