22nd place solution: I asked ChatGPT for feature transformation ideas

第 22 名解决方案：我向 ChatGPT 询问了特征变换思路

作者： Khánh Vũ
发布时间： 2025 年 6 月 1 日
竞赛排名： 第 22 名

特征工程与变换

我不是健身专家，所以我向 ChatGPT 询问了一些基于原始特征的特征变换函数：

def mifflin_st_jeor(row):
    """基础代谢率 (kcal / 天)。身高单位 cm，体重单位 kg。"""
    if row["Sex"] == "male":
        s = 5
    else:  # 女性
        s = -161
    return 10 * row["Weight"] + 6.25 * row["Height"] - 5 * row["Age"] + s

def boer_lbm(row):
    """瘦体重 (kg)。"""
    if row["Sex"] == "male":
        return 0.407 * row["Weight"] + 0.267 * row["Height"] - 19.2
    else:
        return 0.252 * row["Weight"] + 0.473 * row["Height"] - 48.3

def body_surface_area(row):
    """Mosteller 体表面积 (m²)。"""
    return np.sqrt(row["Height"] * row["Weight"] / 3600)

def body_fat_pct(row):
    bmi = row["BMI"]
    adj = 10.8 if row["Sex"] == "male" else 0
    return 1.2 * bmi + 0.23 * row["Age"] - adj - 5.4

def max_hr(age):
    """基于年龄的最大心率。"""
    return 220 - age

def vo2_est(hr, age):
    """非常粗略的 HR→VO₂ 回归 (ml·kg⁻¹·min⁻¹)。"""
    return 14.0 + 0.37 * hr - 0.006 * age

train_df["BMI"] = train_df["Weight"] / (train_df["Height"] / 100) ** 2
train_df["Total_Exertion"] = train_df["Duration"] * train_df["Heart_Rate"]
train_df["Heart_Effort"] = train_df["Heart_Rate"] / train_df["Duration"]
train_df["BMR"] = train_df.apply(mifflin_st_jeor, axis=1)
train_df["LBM"] = train_df.apply(boer_lbm, axis=1)
train_df["BSA"] = train_df.apply(body_surface_area, axis=1)
train_df["Body_Fat_Pct"] = train_df.apply(body_fat_pct, axis=1)
train_df["MHR"] = train_df["Age"].apply(max_hr)
train_df["pct_MHR"] = train_df["Heart_Rate"] / train_df["MHR"]
train_df["Training_Load"] = train_df["pct_MHR"] * train_df["Duration"]
train_df["VO2"] = vo2_est(train_df["Heart_Rate"], train_df["Age"])
train_df["Heat_Index"] = (train_df["Body_Temp"] - 37.0) * train_df["Duration"]

交叉验证

基于 sklearn.neighbors.LocalOutlierFactor.negative_outlier_factor_ 的 100 分位箱进行 5 折分割。我不确定这有多大帮助，但它确实检测到了那些非物理数据点，并帮助降低了折间 RMSE 的方差。下图显示了 bin 编号为 0 的数据点（紫色）与外部点（黄色）的对比。

建模

将其与原始特征结合，我在 Kaggle notebook 上使用 Optuna 调整了三种梯度提升算法 XGB、LGBM 和 CatBoost。我提交的两个方案是 RidgeCV 集成。最好的一个获得了 0.05873 的 CV 和 0.05851 的私有 LB。

22nd place solution: I asked ChatGPT for feature transformation ideas

第 22 名解决方案：我向 ChatGPT 询问了特征变换思路

特征工程与变换

交叉验证

建模

同比赛其他方案