22nd place solution(NLP beginner)

第22名方案（NLP初学者）

作者：kurupical (Grandmaster) | 排名：22nd | 发布时间：2020-02-12

感谢 Kaggle 举办了一场精彩的比赛！

我是 NLP 新手，所以我从优秀的 Kernel 中学到了很多关于预处理、如何构建 Bert 模型等知识。我的代码基于 @idv2005 的这个 Kernel 和 @adityaecdrid 的这个 Kernel！非常感谢！

以下是我工作的总结。（抱歉英语不好。）

解决方案

bert-base-cased, bert-base-uncased, xlnet-base-uncased 以及 Twin models (XLNet, Bert)。
Twin model 的想法来自 https://www.kaggle.com/akensert/bert-base-tf2-0-now-huggingface-transformer。非常感谢！Twin models 包含用于问题的 Bert 和用于答案的 Bert，并使用 3 个自定义层处理序列输出。
1. 对 QBert/ABert 进行平均池化 => 拼接 (shape=768*2) => dense(30)
2. 对 QBert 进行平均池化 => dense(22)，对 ABert 进行平均池化 => dense(8)，拼接 QBert 的 dense 输出和 ABert 的 dense 输出。
3. 拼接 QBert/ABert 序列输出 (shape=512, 768*2) => Transformers => 平均池化 => dense(30)
最终输出：a * 0.25 + b * 0.25 + c * 0.5

后处理 (感谢这个讨论)

best_bin_dict = {'question_type_spelling': 4, 'question_type_instructions': 8, 'question_type_entity': 9, 'question_type_definition': 8, 'question_type_consequence': 4, 'question_type_compare': 8, 'question_type_choice': 8, 'question_opinion_seeking': 50, 'question_not_really_a_question': 4, 'question_multi_intent': 9, 'question_interestingness_self': 9, 'question_interestingness_others': 100, 'question_has_commonly_accepted_answer': 5, 'question_fact_seeking': 10, 'question_expect_short_answer': 50, 'question_conversational': 8, 'question_body_critical': 100, 'question_asker_intent_understanding': 100, 'answer_well_written': 9, 'answer_type_procedure': 50, 'answer_type_instructions': 9, 'answer_relevance': 100, 'answer_level_of_information': 50}

df_sub = pd.read_csv("../input/google-quest-challenge/sample_submission.csv")
pred_final = np.array(preds).sum(axis=0)
for i, col in enumerate(df_sub.columns[1:]):
    df_sub[col] = pred_final[:, i]
not_type_spelling_idx = df_test.query("host not in ['ell.stackexchange.com', 'english.stackexchange.com']").index

for col in df_sub.columns[1:]:
    if col in best_bin_dict:
        n_bins = best_bin_dict[col]
        binned = pd.cut(df_sub[col].values, n_bins, retbins=True, labels=np.arange(n_bins)/(n_bins-1))[0]
        if col == "question_type_spelling":
           binned[not_type_spelling_idx] = 0
        df_sub[col] = binned
df_sub.to_csv("submission.csv", index=False)

无效尝试

DistilBert
bert-large-cased, bert-large-uncased, bert-large-uncased-whole-masking
Categories/hosts Embedding（类别/主机嵌入）
L1loss, L2Loss, Focal Loss (最后我使用了 BCELoss... 但我不明白为什么 BCELoss 是最好的，除非它在这种情况下不对称... 参考 => 理解二元交叉熵)
子任务：预测类别。
拼接 CLS token 序列和平均池化。

其他

我在这次比赛中花费了大约 200 美元。

第22名方案（NLP初学者）

解决方案

无效尝试

其他

同比赛其他方案