9th place solution : 6 Transformers + 2 LightGBMs

第9名方案：6个Transformers + 2个LightGBMs

作者： nyanp, tito | 比赛排名： 第9名

首先，感谢主办方举办这场精彩的比赛！这是我参加过的最艰难的比赛之一，但付出的努力非常值得。

我们团队（tito @its7171 和 nyanp @nyanpn）的预测模型由以下模型组成：

tito 的 Transformer：LB 0.813
nyanp 的 SAINT+ Transformer：LB 0.808
nyanp 的 LightGBM：LB 0.806

对这些模型进行简单的融合得分达到了 LB 0.814 / Private 0.816。

Pipeline（流水线）

我们将整个 train.csv 数据转换为 hdf5 格式，并且只将推理过程中出现的 user_id 加载到 numpy 数组中（与将整个训练数据保存在 RAM 中相比，节省了 97% 的 RAM）。我们估计 hdf I/O 带来的开销约为 45 分钟。这些开销使我们能够将 tito 的大型 Transformer 与 nyanp 的特征工程流水线结合起来。

Pipeline Diagram

Transformer (tito, LB 0.814)

这是一个仅包含编码器的 Transformer 模型，基于 @claverru 的 nice kernel。

摘要

仅针对序列中最后一个问题的 answered_correctly 进行训练和预测。
所有特征都被拼接起来（仅添加了位置编码）。
按时间戳顺序使用讲座数据。
窗口大小 300-600
批大小 1000
drop_out 0
n_encoder_layers 3-5
数据增强：以一定比率将 content_ids 替换为虚拟 ID
在最后一个任务中只保留一个问题以避免泄露

特征

嵌入或稠密形式通过交叉验证（CV）决定。

(嵌入) content id
(嵌入) part id
(嵌入) same task question size
(稠密) answered_correctly
(稠密) had_explanation
(稠密) elapsed time
(稠密) lag time
(稠密) diff of timestamp from the last question

模型组合

为了避免在集成时多次调用 model.predict() 带来的开销，我制作了一个连接四个模型的组合模型。

inputs = tf.keras.Input(shape=(input_shape, n_features))
out1 = model1(inputs[:,-window_size1:,:])
out2 = model2(inputs[:,-window_size2:,:])
out3 = model3(inputs[:,-window_size3:,:])
out4 = model4(inputs[:,-window_size4:,:])
combo_model = tf.keras.Model(inputs, [out1,out2,out3,out4])

SAINT+ (nyanp, LB 0.808)

d_model = 256
window_size = 200
n_layers = 3
attention dropout = 0.03
question, part, lag 被嵌入到编码器中
response, elapsed time, has_explanation 被嵌入到解码器中

为了防止泄露，除了上三角注意力掩码外，我还对每个 task_container_id 开头以外的问题的损失进行了掩码处理。在训练期间，每个批次中具有相同 task_container_id 的问题被混洗，并调整损失权重以减少掩码的影响。这个掩码使 LB 提高了 0.0003。

（注意：我认为间接泄露仍然存在，但我花了 80% 的时间在 LightGBM

第9名方案：6个Transformers + 2个LightGBMs

Pipeline（流水线）

Transformer (tito, LB 0.814)

摘要

特征

模型组合

SAINT+ (nyanp, LB 0.808)

同比赛其他方案