14th Place Solution | 优胜方案

第14名方案

作者： Matt OP (Grandmaster) | 比赛排名： 14

大家好，这又是一场非常有趣的比赛！我的方案里有不少内容要讲，所以我直接开始吧。完整代码链接：点击这里。

编码

我们有相当多的分类特征，所以我观察到了许多不同的有效方法。以下是我使用的：

对 ["Gender", "OverTime", "MaritalStatus", "PerformanceRating"] 使用 LabelEncoder()
对 ["Department", "BusinessTravel"] 使用 OneHotEncoder()
对 ["EducationField", "JobRole"] 使用 LeaveOneOutEncoder(sigma = 0.05)

异常值

训练集中有几个值可能对模型构建具有破坏性。这是我处理它们的策略：

train.at[527, "Education"] = 5
train.at[1535, "JobLevel"] = 5
train.at[1398, "DailyRate"] = train["DailyRate"].median()

特征工程

@snnclsr 在第3季第1集中提出了一个很棒的想法，添加一个特征来表示数据是否为生成的：

train["is_generated"] = 1
test["is_generated"] = 1
original["is_generated"] = 0

我最终使用了这个特征，因为我们再次处理的是合成数据，它让CV分数有了一点提升。

@craigmthomas 在第3季第2集中也有一个很好的想法，即使用风险因素数量作为特征。这花了不少时间，但我仔细检查了所有特征，并仔细观察了 Attrition（离职）的比例。我尝试了许多不同的子集，但这个设置最终对CV的提升最大：

def feature_risk_factors(df):
    df["risk_factors"] = df[[
    "RelationshipSatisfaction", "MonthlyIncome", 
    "BusinessTravel", "Department", "EducationField", 
    "Education", "JobInvolvement", "JobSatisfaction", 
    "RelationshipSatisfaction", "StockOptionLevel", 
    "TrainingTimesLastYear", "WorkLifeBalance", "OverTime"
    ]].apply(
        lambda x: \
        0 + (1 if x.MonthlyIncome < 3000 else 0) + \
        (1 if x.BusinessTravel == "Travel_Frequently" else 0) + \
        (1 if x.Department == "Human Resources" else 0) + \
        (1 if x.EducationField in ["Human Resources", "Marketing"] else 0) + \
        (1 if x.Education == 1 else 0) + \
        (1 if x.JobInvolvement == 1 else 0) + \
        (1 if x.JobSatisfaction == 1 else 0) + \
        (1 if x.StockOptionLevel == 0 else 0) + \
        (1 if x.TrainingTimesLastYear == 0 else 0) + \
        (1 if x.WorkLifeBalance == 1 else 0) + \
        (1 if x.OverTime == 1 else 0),
        axis = 1
    )
    return df

这个特征实际上最终成为了 CatBoost 和 XGBoost 中最重要的特征。奇怪的是，LGBM 仅将其列为第13重要的特征。

好了好了，我不仅仅使用了别人的特征工程想法。以下是我个人设计的特征：

def feature_engineering(df):
    df["Dedication"] = df["YearsAtCompany"] + df["YearsInCurrentRole"] + df["TotalWorkingYears"]
    df["JobSkill"] = df["JobInvolvement"] * df["JobLevel"]
    df["Satisfaction"] = df["EnvironmentSatisfaction"] * df["RelationshipSatisfaction"]
    df["MonthlyRateIncome"] = df["MonthlyIncome"] * df["MonthlyRate"]
    df["HourlyDailyRate"] = df["HourlyRate"] * df["DailyRate"]
    return df

同比赛其他方案

1st place. That was unexpected...

Playground S03E03 8th Place Solution

Using pseudo labels to predict fitness / overfitting

My S3E3 226 - >> 38

39th Place Ensemble