47th place solution

第 47 名解决方案

作者： HayatoFujihara (MASTER)
比赛排名： 47
发布时间： 2024-07-25

感谢主办者和参与者举办这次比赛。我学到了很多。

我的解决方案如下。

单个单词和 CPC 的查询检查

我通过运行 tfidf 获得的单词和 CPC 的查询来检查相关性。这使得分数提高了约 0.04。

for j, word in enumerate(topk_words):
    ti_query = f"ti:" + word
    cand = whoosh_utils.execute_query(ti_query, qp, searcher)
    ti_score = ap50_true(cand, target)
    ti_score += (len(topk_words) - j) * 0.00001
    ti_scores.append(ti_score)

for j, cpc in enumerate(topk_cpc):
    cpc_query = f"cpc:" + cpc
    cand = whoosh_utils.execute_query(cpc_query, qp, searcher)
    cpc_score = ap50_true(cand, target)
    cpc_score += (len(topk_cpc) - j) * 0.00001
    cpc_scores.append(cpc_score)

难度评估

当 meta_i 被分为 5 部分且 tfidf 获得的 CPC 都相同时，分数倾向于显著下降。

也许存在没有 CPC 的相邻专利。

当满足该条件时，添加了双词搜索以给予单词更多重要性。

meta_i_list = []
for j in range(5):
    start_index = j*10
    end_index = min(start_index + 10, len(meta_i))
    if start_index >= len(meta_i):
        break
meta_i_list.append(meta_i[start_index:end_index])

cpc_mat_list_d = [cpc_cv_tfidf.transform(m.get_column("cpc")) for m in meta_i_list]
cpc_idx_list_d = []
for cpc_mat_d in cpc_mat_list_d:
    X_cpc_d, cpc_idx_d = select_top_k_columns(cpc_mat_d, k=4)
    cpc_idx_list_d.append(cpc_idx_d)
cpc_idx_list_d = np.unique(cpc_idx_list_d)
print(len(cpc_idx_list_d))
difficulty = False
if len(cpc_idx_list_d) <= 4:
    difficulty = True

if difficulty:
    X_ti, idx = select_top_k_columns(ti_mat, k=100)
    X_cpc, cpc_idx = select_top_k_columns(cpc_mat, k=30)
else:
    X_ti, idx = select_top_k_columns(ti_mat, k=30)
    X_cpc, cpc_idx = select_top_k_columns(cpc_mat, k=50)

随机判断

需要探索的区域非常大，随机判断有助于提高分数。

def move_random(self):
    p = 0.65 + 0.05 * np.random.choice(range(6))
    self.use = np.random.binomial(1, p, len(self.words))
    while len(self.words) >= 1 and np.count_nonzero(self.use == 1) == 0:        
        self.use = np.random.binomial(1, p, len(self.words))
    
    return self

第 47 名解决方案

单个单词和 CPC 的查询检查

难度评估

随机判断

同比赛其他方案