42th place: a very simple solution

第 42 名：一个非常简单的解决方案

作者： Pavel Kazlou
发布时间： 2024-07-25
竞赛排名： 42

我的解决方案思路非常简单：利用文档频率选择最重要的词元（tokens），然后将前 25 个最重要的词元用"OR"运算符连接起来。

计算 CPC 代码的文档频率很容易，计算量也不大，所以我直接在提交代码中内联实现了：

code_freq = pl.read_parquet('/kaggle/input/uspto-explainable-ai/patent_metadata.parquet')['cpc_codes'].explode().value_counts().sort('count', descending=True)  

code_freq = code_freq.rows_by_key(key=["cpc_codes"], unique=True)

标题的处理稍微重一些，但仍然不值得单独创建一个 notebook：

vectorizer = CountVectorizer(max_df=10000, min_df=10, binary=True)

titles_df = pl.scan_parquet('/kaggle/input/uspto-explainable-ai/patent_data/*')\
        .select(['publication_number', 'title'])\
        .collect()

word_title_freq = vectorizer.fit_transform(titles_df['title'])

word_title_freq = dict(
    zip(
        vectorizer.get_feature_names_out(), 
        np.squeeze(np.asarray(word_title_freq.sum(axis=0)))
    )
)

有了这些 CPC 代码和标题单词的全局文档频率，我只是为邻居专利计算了局部文档频率，并将词元的重要性计算为 local_document_frequency / global_document_frequency：

flat_list_codes = [code for codes in row['cpc_codes'] for code in codes ]
codes_counter = Counter(flat_list_codes)
weighted_codes = [(f'cpc:{elem}', count / code_freq.get(elem, 1)) for (elem, count) in codes_counter.items()]

titles = [title for title in row['title'] if title]
flat_list_title_words = [token.text for title in titles for token in analyzer(title)]
title_words_counter = Counter(flat_list_title_words)
weighted_title_words = [(f'ti:{elem}', count / word_title_freq.get(elem, 10000)) for (elem, count) in title_words_counter.items()]

然后只需按重要性对所有词元进行排序，并通过 OR 组合：

selected_operands = sorted(
            weighted_codes + weighted_title_words,
            key=lambda x: x[1], 
            reverse=True)[:25]

return ' OR '.join(selected_operands)

我尝试过对代码对使用相同的方法——分数只有小幅提高。对于权利要求书、摘要和说明书，这种方法完全失败了。可能是因为它忽略了单个文档内的频率：虽然代码在同一专利内从不重复，标题中的单词也很少重复，但对于大段文本来说情况就不再如此了。

USPTO Explainable AI 竞赛 查看竞赛主页 作者主页：Pavel Kazlou 访问 Kaggle 个人主页

第 42 名：一个非常简单的解决方案

同比赛其他方案