第45名解决方案、学习心得、尝试过程与未尽之事

第45名解决方案、学习心得、尝试过程与未尽之事 | 0.919/0.917 提交过晚 :D

作者: Gaurav Rawat

团队成员: Gaurav Rawat, Urvish, Ayaan Jang, Tony Mark Chris

排名: 第45名

首先感谢Kaggle举办这场比赛，为我们打开了尝试大语言模型（LLM）并解决那些被认为只有现代LLM才能解决的问题的大门。感谢社区慷慨地分享想法和代码，没有这些，我们大多数人可能无法取得如此大的进展。还要感谢我们的团队成员@urvishp80 @ayaanjang @tonymarkchris，他们在过程中加入并提供了帮助。我们很幸运及时确定了CV（交叉验证），但未能及时提交在私有榜上获得0.917/0.919的LB分数。不过我们及时确定了前者（0.917），这仍然很不错 :)

数据集

训练数据集

以下是对我们效果最好的数据集：

Chris的60k带上下文数据集
使用@mbanaei的TF-IDF方法生成的context1和context2，包含270k数据和Cohere数据

验证

感谢@yeoyunsianggeremie提供的这个数据集https://www.kaggle.com/datasets/yeoyunsianggeremie/validation-500。这帮助我们获得了CV和LB之间的关联性。

检索/推理

Openbook minilm，虽然我们也尝试了mpnet-base-v2且效果不错，但最终提交中未包含[使用openbk时为0.834 LB分数]。
在维基百科数据上使用的MB tfidf技术，数据来自https://www.kaggle.com/code/nbroad/create-science-wikipedia-dataset - 效果良好
MB在Cohere上的tfidf技术 - 效果良好
我们在推理检索中未使用270k的MB数据集
我们发现MB notebook中的排序顺序是基于分数的逆序/升序，我们对此进行了简单修正，这很有帮助

context1 = f"{retrieved_articles[index][-1][2]}\\n{retrieved_articles[index][-2][2]}\\n{retrieved_articles[index][-3][2]}\\n{retrieved_articles[index][-4][2]}"

训练

CV 0.893（在500数据集上）在60k带上下文数据上训练的Deberta
CV 0.8963（在500数据集上）在context1和context2 tfidf上下文数据上训练的Deberta，使用interleaved（交错）方式与hugging face datasets结合，示例见https://www.kaggle.com/code/gauravbrills/customdeberta-w-context-seq-mix/notebook - 交错训练似乎极大地帮助了模型泛化
CV 0.896（在500数据集上）仅在60k上通过tfidf生成的context 1上训练的Deberta
CV 0.836（在500数据集上）在60k数据集上训练、冻结15层的Deberta
CV 0.92 Llama，但在100数据集上训练，在LB上表现不佳，使用PEFT进行多选序列分类

我们最佳模型（0.8963 CV）在交错数据集上的训练图，训练时CV为0.8917：

参数设置：

warmup_ratio=0.02, 
    learning_rate=5e-6,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
FREEZE_LAYERS = 15

最终集成与推理

在我们获得500数据集之前，我们一直在集成方面挣扎。我们只是将带上下文和不带上下文的模型以及有时使用Llama进行混合，但始终对提交结果缺乏信心。一旦我们获得了500数据集，我们就挑选出排名前3的模型，并使用`average`（平均）和`optuna`加权集成。https://www.kaggle.com/code/gauravbrills/t2-0-901-openbook-tfidf?scriptVersionId=146318816

我们很幸运地成功提交了一些结果，但错过了最后几个可能获得0.917/0.919 LB分数的提交，因为没有时间了。

CV	私有榜LB	公开榜LB	描述
0.9036	0.909	0.903	截止时间前加权略优
0.9017	0.909	0.905	截止时间前已选
0.91	0.919297	0.917	提交过晚 😃
0.908	0.917	0.912	提交过晚 😃

尝试过但未奏效或可能奏效的方法

重排序器：我们尝试了多个重排序器，包括ST marco系列，但它们对公开榜LB没有帮助，虽然在私有榜上似乎效果不错。我们还研究了colbert和https://huggingface.co/ibm/re2g-reranker-nq，其他团队也尝试过这些方法，但我们不知为何失去了信心。
在不同上下文的openbook tfidf上进行训练，我们也曾尝试生成上下文来检验效果，但未能及时完成。
数据集清洗：我们尝试使用mwparserfromhell、LatexNodes2Text和beautiful soup，但只是随便试试，没有耐心从头开始清洗wiki数据。似乎一些顶尖团队在这方面做了更好的探索和实现。

from pylatexenc.latex2text import LatexNodes2Text
from bs4 import BeautifulSoup
import mwparserfromhell
#import wikitextparser as wtp

def context_cleaner(text):
    ## 解析Latex
    l2t = LatexNodes2Text()
    #text = l2t.latex_to_text(text)
    ## 解析wiki媒体
    #text = wtp.parse(text)
    code = mwparserfromhell.parse(text)
    print(code.filter_templates())
    for template in code.filter_templates(): 
        print(template)
        code.replace(template, l2t.latex_to_text(str(template)))
    text = str(code) 
    text = text.replace("==References==", "").replace("==External links==", "")
    return text

使用不同嵌入的FAISS和通过ST的简单余弦相似度，我们发现不同嵌入的FAISS索引得分不如tfidf好，因此放弃了尝试其他嵌入，尽管它快得多。余弦相似度在GPU上计算量较大，因此我们也放弃了使用，尽管我们认为它的检索效果不错。这些都是在minilm上尝试的，我们未能及时完成在BGE或e5上的尝试 :(
大语言模型：我们团队的Urvish成功实现了使用Llama进行多选序列分类。但我们的推理耗时约4-5小时，最终无法将其加入集成。此外，我们也未能显著提升Llama的CV分数。
@urvishp80的自定义Llama模型：

class CustomLlamaModel(nn.Module):
    def __init__(self, backbone, num_labels, use_gradient_checkpointing=False):
        super(CustomLlamaModel, self).__init__() 
        self.model = backbone
        self.model.config.use_cache = False
        self.config = self.model.config
        self.num_labels = num_labels

        if use_gradient_checkpointing:
                        self.model.gradient_checkpointing_enable() 
        self.pooler = MeanPooling()
        self.dense = nn.Linear(self.config.hidden_size, 1024) 
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(self.config.hidden_size, 1, bias=False)

    def forward(self, input_ids, attention_mask, token_type_ids=None, position_ids=None, head_mask=None,
                inputs_embeds=None, labels=None):
        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
        batch_size = input_ids.shape[0] if input_ids is not None else inputs_embeds.shape[0]

        flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
        flat_inputs_embeds = (
            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
            if inputs_embeds is not None
            else None
        )

        outputs = self.model.model(input_ids=flat_input_ids, attention_mask=flat_attention_mask, output_attentions=False,
            output_hidden_states=True)

        # 从基础模型获取最后的隐藏状态
        # last_hidden_state = outputs[0]
        # # 从基础模型获取最后的隐藏状态
        # last_hidden_state = outputs[0]
        # 从所有层获取隐藏状态
        last_hidden_state = outputs[1][-1]

        x = self.pooler(last_hidden_state, flat_attention_mask)
        x = self.dropout(x)
        logits = self.classifier(x)
        reshaped_logits = logits.view(batch_size, num_choices) 

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(reshaped_logits, labels)

        if self.model.config.output_attentions:
            attentions = outputs.attentions
        else:
            attentions = None

        if self.model.config.output_hidden_states:
            hidden_states = outputs.hidden_states
        else:
            hidden_states = None

        return SequenceClassifierOutput(
            loss=loss,
            logits=reshaped_logits,
            hidden_states=hidden_states,
            attentions=attentions,
        )

最终，这对我们所有人来说都是一次很棒的学习经历，我们仍然从这场比赛中学到了很多可以应用于现实世界的知识。随着回忆起更多尝试过的事情，我会更新这篇文章，或者可能会写一篇博客。

45th Place solutions , learnings , trials and what could have been | 0.919/0.917 submitted too late :D

第45名解决方案、学习心得、尝试过程与未尽之事 | 0.919/0.917 提交过晚 :D

数据集

训练数据集

验证

检索/推理

训练

最终集成与推理

尝试过但未奏效或可能奏效的方法

相关链接

同比赛其他方案