4th Place: ST5 Tokenizer Attack!

第4名：ST5 Tokenizer Attack!

作者：Yusef A. | 排名：第4名 | 竞赛：LLM Prompt Recovery | 发布日期：2024-04-17

前言

大家好，

非常高兴获得第4名并且是第一次（！）solo金牌——这是一场非常有趣的竞赛。我知道可能有人会对评分方式展开讨论（也许有道理），但对我来说，它反而让比赛更有意思。

TL;DR

lucrarea
0.69 评分的平均提示 + Mistral 7B，使用简单的 response_prefix = "Modify this text by"。

核心思路

这个金色的词几乎可以替代“text”“work”等所有词。我的猜测是原始的 T5 tokenizer 在构建翻译指令时使用了德语和罗马尼亚语两种语言，因此在原始 tokenizer 中包含了这两种语言的词汇，后来被用到了 ST5 中。

另一点大家发现的是，ST5 在 PyTorch 与 Keras（tf.keras）实现上略有不同——尤其是 Keras 版本缺少 sentinel token（<extra_id_X>）并且最大长度只有 128，这在实际使用时需要特别注意。我第一次写了一个自认为会高分的提示，结果分数反而下降，就是这个原因。可以用下面的代码验证：

for short_str, v in scores.items():
    torch_embeddings = model.encode(
        short_str, show_progress_bar=False, normalize_embeddings=True
    )
    keras_embeddings = encoder([short_str])
    x = keras_embeddings[0].numpy()
    y = torch_embeddings.reshape(1, -1)

    keras_torch = cosine_similarity(x, y)[0]
    torch_score = np.abs(cosine_similarity(y, df_embeddings) ** 3).mean(axis=1)[0]
    keras_score = np.abs(cosine_similarity(x, df_embeddings) ** 3).mean(axis=1)[0]
    delta = v - torch_score

    # Print each row with the specified column width and alignment
    print(
        f"{v:<14} | {keras_torch[0]:<6.4f} | {torch_score:<10.4f} | {keras_score:<10.4f} | {delta:<10.4f}"
    )

我的攻击方法

我采用了一种类似（但略有不同）的方法来攻击该任务。实际上我通过仅使用平均提示获得了多个 0.69 的分数，这是我最好的一条，也是我最终使用的：

"""▁summarize▁this▁Save▁story▁sentence▁into▁simply▁alterISH▁textPotrivit▁vibe".▁Make▁it▁crystalnier▁essence▁Promote▁any▁emotional-growthfulness▁găsi▁casual/bod▁language▁serious'▁bingo▁peut▁brainstorm▁perhaps▁simply▁saying▁Dyna▁aimplinations▁note▁detailedhawkeklagte▁acest▁piece▁has▁movement▁AND▁OK▁aceasta▁puiss▁ReinIR▁when▁sendmepresenting▁cet▁today▁Th▁aprecia▁USABLE▁prote,lineAMA.▁Respondebenfalls▁behalf▁thenfeel▁mid▁Gov▁Th▁empABLE▁according▁(▁Packaging▁tone▁send▁pelucrarea▁aim▁thereof▁speechelllucrarea▁preferfully].▁Making▁or▁exertloweringlucrarealucrarealucrarealucrarealucrarea."""

如何得到这个提示

首先我研究了 T5 Tokenizer——它是一个 SentencePiece tokenizer，能够对子词进行切分，可以通过以下方式访问：

st = SentenceTransformer('sentence-transformers/sentence-t5-base')
tokenizer = st.tokenizer
vocab = tokenizer.vocab

为了生成合适的提示，我做了下面的工作：

构造匹配公开/私密集合的候选集

我根据排行榜的反馈构造了一个约 1000 条的候选集，并用类似下面的代码进行验证：

scores = {
    """▁summarize▁this▁Save▁etc▁sentence▁into▁simply▁alterISH▁text▁structure▁vibe".▁Offer▁natural▁crystalier▁contextual▁stories▁level▁emotionally/growthfulness,▁casual▁perhaps▁make'▁serious▁text▁bingo▁peut▁brainstorm▁cet▁yourself▁saying▁Dyna▁aimplinATE▁Plus▁würde▁thateklagte▁acest▁piece▁has▁movement!!!!Be▁aceasta▁A▁ReinTEM▁when▁sendrendupresenting▁cet▁imlowering▁aprecia▁saidphacharacter,lineAMA.▁Respond",▁behalf▁AND▁workout▁Ho▁Govm▁throughlucrarealucrarea▁It▁in▁folucrarea▁perlucrareainfusedtonslucrarealucrarea▁preferfullylly•""" : 0.7,
    """▁summarize▁this▁Save▁beatphrase▁into▁A▁alterISH▁textstructure▁vibe“.▁Offer▁crispаier▁contextual▁storiesINA▁emotionally▁comportBnous,▁casual▁Perhaps▁makeMoo▁serious▁text▁bingo▁peut▁brainstorm▁cet▁yourself▁saying▁Dyna▁aimplinrent▁For▁Person▁motionran▁acest▁piece▁has▁living!!!!!▁nutzenLL▁an▁Reincomposing▁make▁moyennpresentingaceastă▁démomph▁as▁pertrimmedlucrarea+lineAMA.▁Respond▁thereof▁behalf▁FROM▁avecallow▁GovOTHPlucrarearage▁it▁Falucrareaplucrareapedcullucrarealucrarea▁preferfully""" : 0.69,
    'summarize this Save/4phraseTM So Alterlate text shaping vibe? Offer slightly poeticibility Utilis stories continuing emotions REelemente it WITH casual Itslucrarea serious text bingo- brainstormDr yourself saying Dyna aimplindated Charakter würden appreciates dial THIS piece! Mission demonstrate Example TO cet ReinEPA make compuslucrareapresentinglucrarealucrarealucrarea as...... InlucrarealucrarealucrareaAMA. Respond thereof behalf....' : 0.666,
    "scrisese lucrarea rele provoace it lucrarea ideile alter this text jazz. caractere lucrarea dialog luand usuce someone make readucem sentinţă lucrarea. twist it story lucrarea more slogan material how rele this. dresat casual pentr lucrarea body echolls text would channel scena. revere umm modalitatea fr datat bingo me elaborate mission give. lucrarea ss dramatic wise refaci acesta body it tone would best posibil celui text transferate it poem together. slide etc lock lucrarea text yourself wise nanny" : 0.66,
    'summarize lucrarea inspired material somehow tweak lucrarea dialogue lucrarea convey this text appropriately caracter . goal would lucrarea experiencing it make consciously reprise prompt ]. creat tone text lucrarea . Example prospective ]. lucrarea übertragen ell it . celui text body rated saying s / strip . Ideas găsi how Enhanc Casual intended genre Send this Ainsi . symbolic eklagte writing aceasta loaded angle emulate text ! distilled More please slide above lucrarea ]. Bingo . . consideră breathing shaping text form . Anyone ABLE HOME т THER Strat aims Acesta .' : 0.66,
    'Textual improve bangor this text expressing way act ot somehow uss rh ve way piece make res ezine und legs aud item' : 0.63,
    'Improve the following text using the writing style of, maintaining the original meaning but altering the tone, diction, and stylistic elements to match the new style.' : 0.60,
    'Rewrite the text to reflect existing themes, provide a concise and engaging narration, and improvise passages to enhance its prose.' : 0.56
}

def fitness(sample, scores):
    score_losses = np.array(list(scores.values()))
    sims = np.abs(cosine_similarity(st.encode(list(scores.keys()), normalize_embeddings=True), sample)**3).mean(axis=-1)
    return np.abs(sims - score_losses).sum()

def find_best_sample(A, scores, sample_size=100, iterations=500):
    best_sample = None
    best_loss = float('inf')
    
    for _ in range(iterations):
        # Randomly select a subset of A
        sample_indices = np.random.choice(len(A), sample_size, replace=True)
        sample = A[sample_indices]
        
        # Calculate the loss for the current sample using the provided fitness function
        current_loss = fitness(sample, scores)
        
        # Update the best sample if the current one has a lower loss
        if current_loss < best_loss:
            best_loss = current_loss
            best_sample = sample
            best_idx = sample_indices
    
    return best_sample, best_loss, best_idx

这段代码帮助我构建了一个与真实评分分布更接近的候选集，从而能够更可靠地验证不同提示的效果。

逐步搜索最佳提示

接下来，我使用一种贪婪的逐步添加词的方式生成最终提示：

best_sentence = ""
while True:
    all_words = complete_all_words
    best_similarity = (np.abs(cosine_similarity(st.encode(best_sentence).reshape(1,-1), embeddings))**3).mean()

    if ADDED_NEW_WORD:
        print(f"Current Similarity: {best_similarity}")
        new_sentences = [best_sentence + word for word in complete_all_words]
        similarity_scores = (np.abs(cosine_similarity(st.encode(new_sentences, normalize_embeddings=False, show_progress_bar=False, batch_size=2048), embeddings))**3).mean(axis=1)
        
        max_index = np.argmax(similarity_scores)
        if similarity_scores[max_index] > best_similarity:
            best_similarity = similarity_scores[max_index]
            best_sentence = new_sentences[max_index]
            print(f"New Similarity: {best_similarity}\n{best_sentence}")
            ADDED_NEW_WORD = True
            all_words = list(np.array(complete_all_words)[np.argsort(best_similarity)[::-1]])
        else:
            print(f"No new words")
            ADDED_NEW_WORD = False

我基本上是在寻找能够在整个数据集上提升平均余弦相似度（csc）的下一个词。由于使用的是句子嵌入模型，这种贪心搜索稍显繁琐且耗时，我在一块 P100 GPU 上运行了整个过程。

模型与后处理

我的实际 token 长度约为 95，还有几个额外的词可以使用，我用 Mistral 7B 对其进行后处理以提升分数。我也尝试了 Gemma 1.1（表现相当惊艳），但 Mistral 在验证集上略胜一筹，于是最终选择了 Mistral。

未成功的尝试

LORA：低 rank（2~4）效果最好，否则容易过拟合。
直接预测 Embedding + 句子嵌入恢复（论文 / GitHub）：在与经调优的 LongT5 作为攻击模型配合时表现尚可。
直接预测 Embedding：使用带注意力机制的 MLP，利用 ST5 编码的原始文本与变形文本作为输入，对输出 Embedding 进行预测。随后对我的平均提示进行词符微调，使其更接近目标嵌入，也有一定帮助。

希望你喜欢这个 lucrarea。

论文：Predicting Embeddings https://arxiv.org/abs/2305.03010 GitHub：GEIA https://github.com/HKUST-KnowComp/GEIA/