11th Place Solution (LightGBM with meta features)

505. American Express - Default Prediction | amex-default-prediction

开始: 2022-05-25 结束: 2022-08-24 信贷风控数据算法赛

第11名方案 (LightGBM with meta features)

第11名方案 (LightGBM with meta features)

作者：Shaorooon
原文发布时间：2022-08-25

※原标题为第12名方案。由于排名修正，现为第11名。

感谢所有参与比赛的人以及所有参与组织工作的人。通过这次比赛我学到了很多。

分数与结果

我的最佳提交
- 本地 CV：0.79922
- Public LB: 0.80088
- Private LB: 0.80852
最终结果
- Public: 第6名 → Private: 第12名

特征工程

基础特征来自公开 Notebook
- AMEX Features - The Best of Both Worlds
删除了一些特征
- 向上取整的特征
- 重复特征（例如：所有类别特征的 groupby 计数）
增加了一些特征
- 按时间段聚合的特征
  - 最近3个月和最近6个月的 min, max, mean, std
- 时间序列中的比率差分特征
  - 例如：last - latest_3month_mean, last_3month_mean / last_6month_mean
- 空值计数特征
- 日期特征
- 元特征 (Meta features - 最重要的特征！)
  - 制作方法：
    1. 将 Train_labels 分配给训练数据（在按 cid 聚合之前）并训练模型。
    2. 对训练数据进行 OOF (Out of Fold) 预测。
    3. 按时间段聚合 OOF 预测值。
  - 使用该特征，我在单模型中将 Public LB 从 0.799 提升到了 0.800。
  - 参考了 DSB2019 第二名的方案方法。
    - DSB2019 2nd Place Solution

验证策略

使用 StratifiedKFold (k=5)。
我认为在衡量 Private 表现时，Public LB 比本地 CV 更重要。
- 训练集和 Public 集的数据量大致相同。
- 从时间上看，Public 数据比训练数据更接近 Private 数据。
- 即使在对抗验证之后，训练集与 Private 集之间的距离也比 Public 集与 Private 集之间的距离更远。
  - train/private：AUC 0.99
  - public/private：AUC 0.82
在关注 Public LB 的同时，我们也查看了本地 CV 以确定是否有改进。
- 还检查了本地 logloss，因为 amex_metric 不够稳定。
- 很难发现 Public LB 的细微变化。

模型

LightGBM
- 使用 dart。
- 超参数与基础 Notebook 相同。
  - AMEX Features - The Best of Both Worlds
- 获取最佳 amex metric 模型 (使用 callback)
  - 相关讨论链接

特征选择

对抗验证
- 删除在 train/private 对抗验证中重要性高的特征。
  - 删除的特征: R_1, D59, S_11, B_29
  - 更改后，AUC 为 0.8。

同比赛其他方案

1st solution(update github code)

2nd place solution - team JuneHomes (writeup)

3rd solution--simple is the best

5th Place Solution - Team 💳VISA💳(Summary&zakopuro's part)

9th Place Solution ( XGBoost+LGBM+NN )