45th Place Simple Solution - 4 model average | 优胜方案

第45名简单解决方案 - 4模型平均

作者: Nirjhar Roy | 发布时间: 2020-02-12

感谢大家发布的精彩解决方案。感谢组织者以及我的队友 @iru538 @tarandro @mobassir 和 @shahules。

我很早就参加了比赛，然后为了参加 PKU 比赛离开了。在 PKU 未能获得奖牌后，我又重新回到了这场比赛，并与 Shahul 和 Mobassir 组队。那时他们已经训练了大约 13 个以上的原生（公共 Kernel）TF bert + xlnet 模型，预测分数在 .385 到 .387 之间，但我们无法突破 .400 大关。然后我应用了下面的后处理方法（收集自一个公共 Kernel）：

test_preds = final_predictions
y_train = df_train[output_categories].values
for column_ind in range(30):
    curr_column = y_train[:, column_ind]
    values = np.unique(curr_column)
    map_quantiles = []
    for val in values:
        occurrence = np.mean(curr_column == val)
        cummulative = sum(el['occurrence'] for el in map_quantiles)
        map_quantiles.append({'value': val, 'occurrence': occurrence, 'cummulative': cummulative})
            
    for quant in map_quantiles:
        pred_col = test_preds[:, column_ind]
        q1, q2 = np.quantile(pred_col, quant['cummulative']), np.quantile(pred_col, min(quant['cummulative'] + quant['occurrence'], 1))
        pred_col[(pred_col >= q1) & (pred_col <= q2)] = quant['value']
        test_preds[:, column_ind] = pred_col

我们第一次突破了 .401，并且通过 XLNET 和 Roberta 等模型进一步将其提升到了 .417。

与此同时，Toru 加入了我们的团队，他带来了一个很棒的单个 Pytorch 模型，在没有后处理的情况下得分达到了 .409。因为我对 Pytorch 更熟悉，所以我以该模型为基础，开始训练 GPT2 和其他 Pytorch BERT 模型。基于 Pytorch 模型，我们的分数达到了 .410。

Alexis 加入后，向我们指出了 Data Science Bowl 2018 第一名解决方案中使用的一个后处理技巧——“Data Science is Important- Bert says”。这是我们的 OptimizedRounder：

from functools import partial
import scipy as sp

class OptimizedRounder(object):
    def __init__(self,correlation):
        self.correlation = correlation
        self.coef_ = 0
        self.score = 0

    def _kappa_loss(self, coef, X, y):
        a= X.copy()
        b=y.copy()
        X_p = pd.cut(a, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0,1,2])
        
        a[X_p == 0] = 0
        a[X_p == 2] = 1 

        #print("validation score = {}".format(spearmanr(a, b).correlation))
        if spearmanr(a, b).correlation < self.correlation:
            self.score = spearmanr(a, b).correlation
            return - spearmanr(a, b).correlation + (self.correlation - spearmanr(a, b).correlation + 1)**10
        else:
            self.score = spearmanr(a, b).correlation
            return - spearmanr(a, b).correlation

    def fit(self, X, y,coef_ini):
        loss_partial = partial(self._kappa_loss, X=X, y=y)
        initial_coef = coef_ini
        self.coef_ = sp.optimize.minimize(loss_partial, initial_coef, method='nelder-mead')

    def coefficients(self):
        return self.coef_['x']
    def score_fc(self):
        return self.score

Alexis 加入后，开始对我们的 Pytorch 模型应用后处理，这带来了很大的提升。因此我们放弃了 TF 模型（因为它们也占用更多的 GPU）。

我们最终选择的解决方案包括以下模型的集成：

BERT_Base 5折（问题模型和答案模型分开）（尝试了三种不同的架构）+ 统计特征（极大地提升了分数）。
带有统计特征的序列模型。
Distillbert 嵌入序列模型。
Distillbert 嵌入 LSTM 模型。

我们使用 GPT2 代替序列模型获得了稍好一点的 Private Score，但我们没有将其选为最终分数。我们得分最高的 Kernel 包含：

BERT_Base 5折（问题模型和答案模型分开）（尝试了三种不同的架构）。
GPT2 Base（10折，问题和答案分开模型）。
简单的 LSTM 模型。
带有 USE 特征的序列模型

45th Place Simple Solution - 4 model average

第45名 简单解决方案 - 4模型平均

同比赛其他方案

第45名简单解决方案 - 4模型平均