13th place solution: Ensemble context ensemble model

首先，感谢Kaggle组织本次比赛，也感谢我的队友们。@natnitarach @pongtsu @kunato @yoyoismee 🔥🔥🔥🔥🔥

我们从比赛的第一周就开始参与这场比赛。在比赛中，我们大约有三次从 top 20 跌至 nowhere，但每次都成功回升。每当看到排名从 top 20 跌至 nowhere 时，我们都会感到恐慌（就像坐过山车一样，LOL😅），但在比赛结束时，我们成功获得了金牌 🥇🎉。
我们的解决方案是集成两个维基百科版本 + 27万条数据集，并通过最大概率集成 debertaV3 large 模型。
我将分为检索和建模两个部分来介绍。检索对我们的得分影响很大。

检索
在检索阶段，我们使用了两个来源的维基百科嵌入向量，利用 "bge-small-en" 构建了一个 FaissFlatL2 索引。

Wikipedia cohere 35M -> 3500万个向量
Wikipedia 2023 -> 按字符块分割，每块1000个字符 -> 2100万个向量

以及 27万条数据集
总计我们有 3500万 + 2100万 = 5600万个向量用于检索搜索，规模非常庞大！我们采用了一种将大索引拆分为小索引，最后再合并的技术，称之为 "Faiss Batch"（每个维基百科版本大约有6个索引，每个索引大小为10GB）。因此仅检索部分的总大小就超过了100GB。

在推理时，你可以参考上图中的步骤，推理过程如下：

对于维基百科数据集，我们使用 Faiss 查询搜索来获取排名靠前的约100篇文章（6个索引 × 15个近邻）。
使用 TF-IDF 进行重排序。根据实验结果，我们发现 TF-IDF 重排序的效果优于 bge-reranker。

以下是我们称之为 Faiss Batch 的示例代码

from datasets import Dataset
ds = load_from_disk('/kaggle/input/rms1data')
test = Dataset.from_pandas(test)
import os
dir_path = '/kaggle/input/allyouneedret'
index_list = os.listdir(dir_path)
index_list = sorted(index_list)
print(index_list)
k = 10
total = 0
distance_list = []
indices_list = []
res = faiss.StandardGpuResources()
for indexBatch in index_list:
    index_Batch = "/kaggle/input/allyouneedret/" + indexBatch
    print(f"read index {index_Batch}")
    index1 = faiss.read_index(index_Batch)
    index1 = faiss.index_cpu_to_gpu(res,0,index1)
    distances1, indices1  = index1.search(query_vector,k)
    updated_indices2 = [[idx + total for idx in inner_list] for inner_list in indices1]
    total += index1.ntotal
    print(total)
    distance_list.append(distances1)
    indices_list.append(updated_indices2)
    del index1
    _ = gc.collect()
    libc.malloc_trim(0)
    torch.cuda.empty_cache()
concatenated_indices = np.concatenate(indices_list, axis=1)
concatenated_distances = np.concatenate(distance_list, axis=1)

代码写得真是一团糟 😂

在 27 万条数据集的处理上，我们参考了讨论帖中使用的 TF-IDF 方法。

最终，我们将得到 3 个上下文，每个模型会对这 3 个上下文分别进行预测。

模型
在建模阶段，我们使用了 DebertaV3 large 模型并进行集成。
我们采用了与 Chris 相同的训练流程（如何训练开卷模型（第1部分）），解冻了嵌入层，并使用了从 512 到 768 的不同长度的 token，训练数据包括 6万条带上下文的数据集，以及我们使用 chatGPT3.5 生成的约 5000 条问答数据。

未奏效的方法

尝试压缩索引（如 IVQ、PQ 等）导致准确率下降。
我们尝试将 Platypus2-70B-instruct 与我们的检索方法结合，但由于运行时间过长（提交时超过 9 小时），实际不可行。
在 DeBERTaV3 上应用 PEFT 并未取得预期效果。

改进方向

获取更优质的维基百科文本来源。
提升我们模型的性能。

GitHub 代码
https://github.com/nat-nischw/kaggel-llm-science-exam-2023

13th place solution: Ensemble context ensemble model

第13名解决方案：集成上下文集成模型

同比赛其他方案