8th Place Solution for the Open Problems – Single-Cell Perturbations Competition

Open Problems – Single-Cell Perturbations竞赛第8名解决方案

作者：aper（Kaggle MASTER）
发布日期：2023年12月6日

祝贺所有获奖者，感谢Kaggle组织了如此有趣的竞赛。同时也感谢其他分享想法和笔记本的Kagglers。

背景

方法概述

模型

我通过一系列实验设计和微调了一个简单的神经网络，旨在降低CV和LB分数。使用了特征增强和SMILES片段作为模型输入。

模型架构如下：

class SingleCellModel(nn.Module):
    def __init__(self, dim_size, mul_ratio, labels=18211):
        super().__init__()
        self.ce_layer = nn.Linear(labels, dim_size)
        self.sm_layer = nn.Linear(labels, dim_size)

        hidden_size = dim_size * 2
        self.fc1 = nn.Linear(hidden_size, hidden_size*mul_ratio)
        self.fc2 = nn.Linear(hidden_size*mul_ratio, hidden_size)
        
        self.act = nn.GELU()
        self.out = nn.Linear(hidden_size, labels)
 
    def forward(self, cell_type, sm_name):
        x1 = self.act(self.ce_layer(cell_type))
        x2 = self.act(self.sm_layer(sm_name))

        x = torch.concat([x1, x2], dim=-1)

        x = self.act(self.fc1(x))
        x = self.act(self.fc2(x))

        x = self.out(x)
        return x

数据增强

训练过程采用了策略性的数据增强方法。最初仅使用cell_type和sm_name的均值，随后探索了中位数、最小值、最大值和分位数等统计值，发现中位数能显著提升LB分数。

进一步实验了特征组合：50%概率随机选择均值或中位数，25%概率从均值、中位数、Q1和Q2中随机选择。

验证策略

使用按cell_type分层的K折交叉验证，尝试了5、10、15、20折分割。发现10折和15折时LB分数更高。

提交详情

不同增强模型的LB分数如下：

中位数 / 0.549
均值和中位数混合 / 0.549
均值、中位数、Q1和Q2混合 / 0.551

最终提交采用加权平均（权重0.35/0.35/0.3），将LB分数提升至0.547。由于各模型预测分布不同但分数相近，集成后泛化能力更好。

未成功的方法

伪标签
Dropout
归一化
数据筛选（对照组等）

参考资料

Learning single-cell perturbation responses using neural optimal transport Nature Methods论文 OP2 - Feature Augment & Fragments of SMILES Kaggle代码笔记本