2nd Place | Single LightGBM and Target Encoding

649. Playground Series - Season 5, Episode 4 | playground-series-s5e4

开始: 2025-04-01 结束: 2025-04-30 音视频处理数据算法赛

第二名 | 单个 LightGBM 和目标编码

第二名 | 单个 LightGBM 和目标编码

作者: Farukcan Saglam (greysky)
发布日期: 2025-05-01
竞赛排名: 第 2 名

我的解决方案由单个 LightGBM 模型和特征的目标编码组成。我已经公开分享了 90% 的工作。

我不擅长撰写解决方案说明，所以我只是分享了数据生成 notebook。

最终处理后的训练数据集包含 1552 个特征和 794868 行。幸运的是，通过仔细的数据类型转换和避免不必要的复制操作，我能够在 Kaggle CPU 上训练模型。训练大约需要 4 小时。我使用了 5 个不同的种子在所有数据上进行了训练。

LightGBM 超参数

objective = 'l2'
metric = 'rmse'
n_iter = 12000
max_depth = 15
learning_rate = 0.008
num_leaves = 480
colsample_bytree = 0.25

新特征

Mul_Hpp_Elm = Host_Popularity_percentage * round(Episode_Length_minutes)
Mul_Gpp_Elm = Guest_Popularity_percentage * round(Episode_Length_minutes)
Rounded_Episode_Length_minutes = round(Episode_Length_minutes) // 2
Rounded_Host_Popularity_percentage = round(Host_Popularity_percentage) // 2
Rounded_Guest_Popularity_percentage = round(Guest_Popularity_percentage) // 2

目标编码特征 | pair_size = [1, 2, 3, 4, 5, 6]

Podcast_Name
Episode_Length_minutes
Episode_Num
Episode_Sentiment
Host_Popularity_percentage
Guest_Popularity_percentage
Number_of_Ads
Publication_Day
Publication_Time
Rounded_Episode_Length_minutes
Rounded_Host_Popularity_percentage
Rounded_Guest_Popularity_percentage

目标编码的描述性统计（按列）

均值、标准差、最小值、最大值（全局聚合）
均值、标准差、最小值、最大值（按 pair_size 聚合）
均值、标准差、最小值、最大值（按源列聚合）

同比赛其他方案

1st Place - RAPIDS cuML Stack - 3 Levels!

3rd Place - Target Encoding and 3 Levels

Rank 4 approach - lots of features, lots of simple models and a ridge blend!

5th place: 100 OOFs, laziness, and a blunder or two

6th Place: Select Feature Combinations based on RMSE Scores