664. Playground Series - Season 5, Episode 8 | playground-series-s5e8
副标题:通过 hill climbing 选择的元模型堆叠。
受 @yunsuxiaozi 的 5 次提交挑战 启发,我决定将提交次数限制为当月的天数(31 次)。这似乎足以在进行几次实验后专注于特定方法。未来我参与的 Playground 系列竞赛可能会坚持这个 提交次数 == 当月天数 的限制。
我的方法是训练一组多样化的基模型。这些模型包括线性模型、提升树(boosted trees)和神经网络。基模型使用不同的特征集进行训练,并使用不同的参数进行调优。然后我对这些基模型运行 hill climbing。 selected 模型用于训练元学习器。
请注意,输入到元学习器的基模型没有任何加权。任何从 hill climbing 获得正权重的基模型都被用作元学习器的输入。
然后我对这些元学习器的结果运行 hill climb,以获得最终提交。
提交结果:
我选定的两个提交的公开分数均为 0.97801,私有分数也相同,CV 仅在最后一位数字上有所不同。
Best ROC AUC Score: 0.97742733 - v7 | lb 0.97801 *
Best ROC AUC Score: 0.97742735 - v8 | lb 0.97801 *
我对所有基模型使用相同的 CV 分割:
SEED = 208
FOLDS = 5
cv = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
对于元学习器,我使用了 FOLDS = 10。
Final ensemble weights (high → low):
0.2490 - cdeotte xgboost - orig as columns
0.1117 - ps-s5e8-xgboost-deep
0.0905 - nn-by-gpt5
0.0777 - CatBoostClassifier (ensemble ii)
0.0714 - xgboost and nn ensemble
0.0670 - CatBoostClassifier (ensemble iv)
0.0577 - LGBMClassifier-params_v12
0.0556 - LGBMClassifier-ii-TE-std
0.0516 - CatBoostClassifier (ensemble)
0.0353 - NeuralNetFastAI_BAG_L2-ii
0.0324 - CatBoostClassifier (ensemble hist-ii)
0.0284 - cdeotte xgboost - ensemble
0.0262 - SGDClassifier (stack light-ii)
0.0169 - RandomForestEntr_BAG_L2-ii
0.0117 - RandomForestClassifier (ensemble ii)
0.0098 - RandomForestEntr_BAG_L2-iii
0.0053 - RandomForestClassifier (ensemble)
0.0009 - HistGradientBoostingClassifier (ensemble histbook)
0.0009 - cdeotte xgboost - orig as rows
Final ensemble weights (high → low):
0.223215 - WeightedEnsemble_L2-l2
0.177151 - NeuralNetTorch_r79_BAG_L1-l2
0.142211 - NeuralNetTorch_BAG_L1-l2
0.101949 - LightGBM_r131_BAG_L1-l2
0.067658 - CatBoostClassifier (l2 boruta)
0.066139 - cdeotte xgboost more - orig as columns
0.064272 - nn-by-gpt5-more
0.057201 - NeuralNetFastAI_BAG_L1-l2
0.044992 - LightGBMXT_BAG_L1-l2
0.029317 - XGBoost_BAG_L1-l2
0.014647 - NeuralNetFastAI_r191_BAG_L1-l2
0.008044 - RandomForestGini_BAG_L2-iii-more
0.002062 - LightGBM_BAG_L1-iii-more
0.000923 - XGBoost_BAG_L1-iii-more
0.000220 - HistGradientBoostingClassifier (l2 boruta)
注意,这里的 "cdeotte xgboost more - orig as columns" 和 "nn-by-gpt5-more" 是 @cdeotte 的模型,基于选定的基模型的 oofs 进行训练。这些模型经过调整以匹配我自己的 CV 分割并进行常规训练(without full fit)。
根据模型类型,我使用了以下函数的组合。
def feature_engineer(df):
df['has_debt'] = (df['balance'] < 0).astype(int)
df['long_duration'] = (df['duration'] > 300).astype('category')
df['duration_sqrt'] = np.sqrt(df['duration']).astype('float32')
df['duration_log'] = np.log1p(df['duration'])
df['duration_sin'] = np.sin(2*np.pi * df['duration'] / 540).astype('float32')
df['duration_cos'] = np.cos(2*np.pi * df['duration'] / 540).astype('float32')
df['balance_log'] = (np.sign(df['balance']) * np.log1p(np.abs(df['balance']))).astype('float32')
df['balance_sin'] = np.sin(2*np.pi * df['balance'] / 1000).astype('float32')
df['balance_cos'] = np.cos(2*np.pi * df['balance'] / 1000).astype('float32')
df['age_sin'] = np.sin(2*np.pi * df['age'] / 10).astype('float32')
df['pdays_sin'] = np.sin(2*np.pi * df['pdays'] / 7).astype('float32')
df['duration_bin_20'] = pd.qcut(df['duration'], q=20, labels=False, duplicates='drop')
df['balance_bin_20'] = pd.qcut(df['balance'], q=20, labels=False, duplicates='drop')
# ref: https://www.kaggle.com/code/ganeshataqwa/0-96-classifying-bank-customers-let-s-do-it
df['is_first_contact'] = np.where(df['pdays'] == -1, 1, 0)
df['contact_ratio'] = df['campaign'] / (df['previous'] + 1)
df['economic_stability'] = df['balance'] / df['age']
high_months = ['mar', 'oct', 'sep', 'dec']
df['is_high_conversion_month'] = np.where(df['month'].isin(high_months), 1, 0)
df['is_short_call'] = np.where(df['duration'] <= 150, 1, 0)
month_map = {
'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
df['month_num'] = df['month'].map(month_map).astype(int)
df['month_sin'] = np.sin(2*np.pi * df['month_num'] / 12).astype('float32')
df = df.drop(['month_num'], axis=1)
return df
创建分类特征的组合。
def pairwise_combinations(train, test, to_combine):
encoded_columns = []
pair_size = [2, 3]
for r in pair_size:
for cols in tqdm(list(combinations(to_combine, r))):
col_name = '_'.join(cols)
train[col_name] = train[list(cols)].astype(str).agg('_'.join, axis=1)
train[col_name] = train[col_name].astype('category')
test[col_name] = test[list(cols)].astype(str).agg('_'.join, axis=1)
test[col_name] = test[col_name].astype('category')
encoded_columns.append(col_name)
print(len(encoded_columns), 'new features added')
return train, test
to_combine = ['default', 'housing', 'loan', 'poutcome', 'balance', 'duration', 'previous']
X, X_test = pairwise_combinations(X, X_test, to_combine)
数值特征的交互。
def add_interaction_features(df, features):
data = df.copy()
for f1, f2 in itertools.combinations(features, 2):
data[f'{f1}_plus_{f2}'] = data[f1] + data[f2]
data[f'{f1}_minus_{f2}'] = data[f1] - data[f2]
data[f'{f1}_div_{f2}'] = data[f1] / (data[f2] + 1e-5)
data[f'{f1}_times_{f2}'] = data[f1] * data[f2]
return data
nums = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
X = add_interaction_features(X, nums)
X_test = add_interaction_features(X_test, nums)
我以多种方式使用了原始数据。
首先是 @siukeitin 在这个 评论 中建议的。具体来说是组合增强和后处理。
model = Augmented(
Postprocessed(LGBMClassifier, contrarian)(**light_params_v28), X_orig, y_orig
)
我还使用了 @jmascacibar 在这里讨论的以下函数:原始数据的困境
def add_original_cols(df_train, df_test, df_orig, feats, target_col='y'):
'''
Add original features groupby original target to the synthetic data
ref: https://www.kaggle.com/competitions/playground-series-s5e8/discussion/597903
'''
train = df_train.copy()
test = df_test.copy()
tm = df_orig[target_col].mean()
add_feats = []
for feat in feats:
if feat in df_orig.columns:
name = f'{feat}_orig_target_mean'
mapping = df_orig.groupby(feat)[target_col].mean()
train[name] = train[feat].map(mapping)
train[name] = train[name].fillna(tm)
test[name] = test[feat].map(mapping)
test[name] = test[name].fillna(tm)
add_feats.append(name)
print(f'Added {name} feature')
print('\n---- Complete ----\n')
print(f'Train, Test shape: {train.shape, test.shape}')
return train, test
最后,我在基模型中使用的唯一公开模型来自 @cdeotte,有关更多信息请参阅他的讨论帖子:
这是我的第一篇 write up,如果有任何不清楚的地方请告诉我,也欢迎询问任何我没有涵盖的内容。
就是这样,祝大家好运 🖖🏾。