603. LLM Prompt Recovery | llm-prompt-recovery
首先,我要感谢主办方举办这场精彩的比赛。同时,也非常感谢我的队友 @andreivanenko 的出色合作。
TL;DR:我们的方案主要基于均值提示,并利用 Mistral 7B 进行了细微改进。
最初,我们使用公开的提示数据集(https://www.kaggle.com/datasets/what5up/concat-prompts)来训练均值提示。训练的核心方法是在现有提示中添加一个新词,使其与训练集中的其他提示更加相似。使用的词库包含训练提示中的所有词汇。
均值提示的训练始于短语“Rewrite this text”,并采用以下方法:
通过实验我们发现,最初使用束搜索生成均值提示效果最佳,一旦其准确率不再提升,就切换到插入和剪枝方法。此外,在排行榜探测过程中,我们发现长度超过 128 个 token 的提示得分约为 0.41,因此需要进一步剪枝。使用上述全部方法后,分数提升至 0.68。
随着均值提示在本地数据上表现更佳,我们注意到本地数据集的分数与公开排行榜分数之间的差距逐渐增大。我们决定通过重新生成提示来缩小这种偏差,使本地数据更贴近排行榜的真实表现。
为此,我们把所有此前提交到排行榜的均值提示收集成列表:
scores = [
# 70 more mean prompts
[0.60, "Improve the text to this."],
[0.59, "rewrite this text tothepoint humanoid about around towards takes accompanying"],
[0.59, "rewrite this text conveying human ensue somehow portrayal one further"],
[0.59, "rewrite it make thee o e be how this described the text ideas lie plane ultimate"],
[0.58, "Improve the text to this. Rewrite this text using this style."],
[0.56, "Improve essay to this. rewrite this text using this style."]
]
为了将本地数据集与排行榜进行比较,我们首先在该数据集上评估我们的均值提示并得到对应分数。然后使用余弦相似度把这些分数与排行榜分数进行比较。此后,我们开始构建自己的数据:每次随机抽取 10 条提示,如果它们能够提升评估指标,就将其加入数据集。
best_prompts_ids = []
best_score = 0
iteration = 0
while True:
sample_ids = random.sample(range(len(prompts_embs)), 10)
candidate_prompts_ids = best_prompts_ids + sample_ids
# Calculate the sharpened cosine similarity between the embeddings of the mean prompts and the prompts from generated dataset
candidate_scores = (cosine_similarity(lb_prompts_embs, prompts_embs[candidate_prompts_ids, :]) ** 3).mean(axis=1).reshape(1, -1)
# Compute our dataset evaluation metric which is the cosine similarity between scores of our generated dataset and LB.
cos_score = cosine_similarity(candidate_scores, lb_scores)[0][0]
if cos_score > best_score:
best_score = cos_score
best_prompts_ids = candidate_prompts_ids
print(f"Iteration: {iteration}, cos_similarity: {cos_score:.6f}")
iteration += 1
通过这种方式,我们创建了一个新的本地训练集,使得公开排行榜的分数从 0.68 提升至 0.70。由此得到的最佳均值提示如下所示:
rewrite also key essence since its cry thine that had then expressed in improve the underlying paragraphs from it more directly but with either в similar descriptive desired statement to best how you described such text it is da ultimate involves that an human maintains retell animistic this newly eventual presented than classic adult manner due please my would just fashion the following as follows device ss plea chefs poe us da formal piece while edit out any non grand local warera band gospel virtual salt park industry flair useless question oath sherlock taker for page transcript get four empowerment discuss name and height so frame it
在我们的实验中发现,向均值提示中加入 LLM 的输出能够略微提升本地数据集和排行榜的准确率。我们使用 Mistral 7B 模型生成了以 “Improve this text by.” 开头的提示,然后将生成的文本添加到我们最佳的均值提示中,结果提升了约 0.005。