13th place brief writeup (public 107th)

第13名简短方案总结 (公开榜第107名)

更新于 2020/03/19

感谢主办方和所有有帮助的讨论。特别感谢 @hengck23 的慷慨分享。

我很幸运能获得金牌。我原本预期公开榜/私有榜是随机划分的，所以对于未见过的图形单词没有做特殊处理。

看起来针对宏平均召回率的后期处理技巧真的很重要。

模型

改进的 SE-ResNeXt50。将2-stride卷积替换为1-stride卷积+maxblur-pool，基本遵循 @hengck23 的建议。
阶段1：使用 stem-conv(stride=2) 训练；阶段2：将 stem-conv 替换为3个 convs(stride=1)。
优化器：Adam + ReduceLROnPlateau，约200个 epochs。
5折集成 (5 fold ensemble)。

损失函数

针对词根、元音、辅音的交叉熵。
- 5折中有2折使用了 OHEM loss 进行微调。
多标签二元交叉熵，用于分类分解后的图形成分是否存在。
- 图形单词（或词根、元音或辅音）可以分解如图所示。存在61种可能的分解部分（除去 '0' 部分）。以下是我生成多热标签的确切代码。

# 加载数据
train = pd.read_csv(C.datadir/'train.csv')
train_labels = train[['grapheme_root', 'vowel_diacritic', 'consonant_diacritic']].values.astype(np.int64)

# 成分标签
parts_pre = pd.read_csv(C.datadir/'class_map.csv').component
parts = np.sort(np.unique(np.concatenate([list(e) for e in parts_pre])))
parts = parts[parts != '0']  # 0 没有意义
print("parts:", parts)  # 显示 61 个部分

train_labels_comp = []
for grapheme in train['grapheme'].values:
    train_labels_comp.append([part in list(grapheme) for part in parts])
train_labels_comp = np.array(train_labels_comp).astype(np.int64)
print("train_labels_comp.shape", train_labels_comp.shape)
if True: # 调试
    print("train_labels_comp", train_labels_comp[0].tolist())
    print("train_labels_comp", train_labels_comp[1].tolist())
    print("train_labels_comp", train_labels_comp[2].tolist())

预处理与数据增强

与 @hengck23 的建议 100% 相同。

预处理：仅使用 cv2.INTER_AREA 进行缩放。
数据增强：OneOf(基础增强) + DropBlock。

后处理

不使用针对分解图形成分的预测结果。
通过以下方式最大化期望召回率： argmax( softmax(logits) / np.power(class_count, 1) )。注意最终提交时对词根使用了1.15的系数（对私有榜得分无影响）。
- 可能存在更好的阈值优化方法，因为经过上述优化后混淆矩阵看起来仍然不对称。但由于样本量小，我放弃了进一步改进。
- 下图是我的模型的“宏平均召回率得分 V.S. 系数 n”图表（非最终提交模型）。你可以确认峰值在系数=1左右。你还可以发现，与私有榜相比，改进幅度并不大。
同比赛其他方案

1st place solution /w code

2nd place solution

3rd place solution

4th place solution

5th place solution

第13名简短方案总结 (公开榜第107名)

模型

损失函数

预处理与数据增强

后处理

同比赛其他方案