13th place solution - pmts_year_1139T postprocess | 优胜方案

第13名方案 - pmts_year_1139T 后处理

作者：Yuya Shintani（团队负责人）
合作成员：fuku4ki、Kento Okumura
比赛排名：第13名
发布日期：2024-05-28

首先感谢 Kaggle 主办方以及比赛的主办团队。说实话，我们并不完全了解哪些做法真正有效，下面简要介绍一下我们的方案。

背景

业务背景：Home Credit 信用风险模型稳定性竞赛
数据背景：Home Credit 数据

方案概述

在特征工程阶段，我们利用除 credit_bureau_b1、other_1、deposit_1、debitcard_1 之外的其他表，手工创建了 772 个特征，并通过手动和基于相关性的特征选择将其压缩至 411 个。

在建模阶段，我们构建了 10 个模型（分别使用 LightGBM、XGBoost、CatBoost 和 HistGradientBoostingClassifier），使用 RidgeClassifier 将这些模型的输出进行堆叠，并使用 CalibratedClassifierCV 完成概率校准。随后通过随机种子平均（5 个种子）得到最终的预测值。

在后处理阶段，我们依据 credit_bureau_a_2 表中 pmts_year_1139T 的最大值，对每一年份的分数进行负向校正。

方案细节

特征工程

特征工程主要由 @kentookumura 完成。最初我们决定不使用缺失率较高的 credit_bureau_b_2、other_1、deposit_1、debitcard_1 表。基于 EDA 结果以及以往比赛方案的经验，我们手工创建了特征，并采用手动与相关性相结合的方式将特征从 772 个削减到 411 个。以下是我们认为效果较好的关键点：

时代（Era）

df_base = df_base.with_columns(
    ((pl.col("first_birth_259D") / 10).floor() * 10).alias("era").cast(pl.Int32),
)

开始工作时的年龄

df = df.with_columns(
    ((pl.col("empl_employedfrom_271D") - pl.col("birth_259D")).dt.total_days() // 365).cast(pl.Int32).alias("agestartofemploymentA"),
)

就业期限

df_base = df_base.with_columns(
    (pl.col("first_birth_259D") - pl.col("first_agestartofemploymentA")).alias("durationofemploymentA"),
)

除后缀 “D” 之外的日期处理

def handle_dates(df):
    for col in df.columns:
        if col[-1] in ("D",):
            df = df.with_columns(pl.col(col) - pl.col("date_decision"))
            df = df.with_columns(pl.col(col).dt.total_days())
            df = df.with_columns(pl.col(col).cast(pl.Float32))
        elif "year" in col:
            df = df.with_columns(pl.col(col) - pl.col("date_decision").dt.year())
            df = df.with_columns(pl.col(col).cast(pl.Int32))

合并 tax_registry 表

针对某些 case_id 包含多个提供者的信息，我们推断各表列的对应关系，将其合并为一张表。

字符串类型的聚合（众数与唯一值数量）

为保证可重复性，在 Polars 中我们使用了以下方式：

pl.col(col).drop_nans().drop_nulls().mode().sort().first()

去除在训练期间波动较大的特征

我们手动检查并删除了随 WEEK_NUM 变化幅度较大的特征。

建模

建模主要由 @uplus26e7 完成。我们采用基于 WEEK_NUM 的 StratifiedGroupKFold(k=5) 进行交叉验证。尝试了多种不需要特征缩放或缺失值填补的 GBDT 模型，效果不佳。为获得多样化的模型，我们使用不同参数训练了 10 个模型，并使用 RidgeClassifier 进行堆叠，最后使用 Scikit‑Learn 的 CalibratedClassifierCV 对预测概率进行校准。上述模型还进行 5 次随机种子平均，以得到最终的推理结果。

模型	本地交叉验证 AUC（5 次种子平均）	主要参数
XGBoost	0.8569388045
CatBoost	0.8543810988
LightGBM	0.8576141787	boosting="gbdt", extra_tree=True
LightGBM	0.8569003931	boosting="gbdt"
LightGBM	0.8068982316	boosting="rf"
LightGBM	0.7993733596	boosting="rf", extra_tree=True
LightGBM	0.8546620954	boosting="dart"
LightGBM	0.8517172791	boosting="dart", extra_tree=True
HistGradientBoostingClassifier	0.8496922895
LightGBM	0.8574351195	boosting="gbdt", extra_tree=True, data_sample_strategy="goss"
CalibratedClassifierCV (RidgeClassifier)	0.859322

后处理

@kentookumura 通过细致的 EDA 与实验发现，credit_bureau_a_2 表中的 pmts_year_1139T 很可能是最近的 date_decision 年份。并且发现日期列的变换在数据变更中没有加入。基于这些发现，我们实现了后处理：根据 pmts_year_1139T 的最大值对预测分数进行负向校正。

submission = pd.read_csv("submission.csv")
pmts_year = ...  # 取每条 case_id 对应的 pmts_year_1139T 最大值
submission.loc[pmts_year == 2020, "score"] = (submission.loc[pmts_year == 2020, "score"] - 0.07).clip(0)
submission.loc[pmts_year == 2021, "score"] = (submission.loc[pmts_year == 2021, "score"] - 0.06).clip(0)
submission.loc[pmts_year == 2022, "score"] = (submission.loc[pmts_year == 2022, "score"] - 0.02).clip(0)
submission.to_csv("submission.csv", index=False)

13th place solution - pmts_year_1139T postprocess