12th Place Solution | 优胜方案

第12名方案

作者： yanqiangmiffy
团队： xlyhq
排名： LB 第13名，PB 第12名

简介

在专利匹配数据集中，参赛者需要判断两个短语的相似度，一个是锚点，另一个是目标，然后输出两者在不同语义（上下文）下的相似度，范围为 0-1。

我们的团队 ID 是 xlyhq，LB 排名第 13，PB 排名第 12。非常感谢 @heng zheng、@pythonlan、@leolu1998、@syzong。四位队友的辛勤付出和奉献最终幸运地获得了金牌。

与顶级团队的其他核心思路类似，这里我们主要分享我们的比赛历程和相关实验的具体结果，以及一些有趣的尝试。

文本处理

数据集主要包括 anchor、target 和 context 字段，以及额外的文本拼接信息。在比赛期间，我们主要尝试了以下拼接方式：

v1: test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
v2: test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context'] + '[SEP]' + test['context_text']，相当于直接拼接 A47 类似的代码
v3: test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context'] + '[SEP]' + test['context_text']，获取更多文本进行拼接，相当于拼接 A47 下的子类别，例如 A47B、A47C

context_mapping = {
    "A": "Human Necessities",
    "B": "Operations and Transport",
    "C": "Chemistry and Metallurgy",
    "D": "Textiles",
    "E": "Fixed Constructions",
    "F": "Mechanical Engineering",
    "G": "Physics",
    "H": "Electricity",
    "Y": "Emerging Cross-Sectional Technologies",
}

titles = pd.read_csv('./input/cpc-codes/titles.csv')

def process(text):
    return re.sub(u"\\(.*?\\)|\\{.*?}|\\[.*?]", "", text)

def get_context(cpc_code):
    cpc_data = titles[(titles['code'].map(len) <= 4) & (titles['code'].str.contains(cpc_code))]
    texts = cpc_data['title'].values.tolist()
    texts = [process(text) for text in texts]
    return ";".join([context_mapping[cpc_code[0]]] + texts)

def get_cpc_texts():
    cpc_texts = dict()
    for code in tqdm(train['context'].unique()):
        cpc_texts[code] = get_context(code)
    return cpc_texts

cpc_texts = get_cpc_texts()

这种拼接方法可以有很大改进，但文本长度变长，最大长度设置为 300，导致训练变慢。

v4: 核心拼接方法: test['text'] = test['text'] + '[SEP]' + test['target_info']

# concat target info
test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]' + test['context_text']
target_info = test.groupby(['anchor', 'context'])['target'].agg(list).reset_index()
target_info['target'] = target_info['target'].apply(lambda x: list(set(x)))
target_info['target_info'] = target_info['target'].apply(lambda x: ', '.join(x))
target_info['target_info'].apply(lambda x: len(x.split(', '))).describe()

del target_info['target']
test=test.merge(target_info,on=['anchor','context'],how='left')
test['text'] = test['text'] + '[SEP]' + test['target_info'] 
test.head()

这种拼接方法可以极大地提高模型的 CV 和 LB 分数。通过比较 v3 和 v4 两种不同的拼接方法，我们发现选择高质量的文本进行拼接可以提升模型。v3 方法有很多冗余信息，而 v4 方式在实体层面包含很多关键信息。

我们非常幸运在比赛的最后几天发现了这个“魔法技巧”，其他金牌区的团队也在他们的方案中提到了这一点。

交叉验证划分

在比赛期间，我们尝试了不同的数据划分方法，包括：

StratifiedGroupKFold：这种拼接方法的 CV 和 LB 线之间差异较小，分数稍好
StratifiedKFold：离线 CV 相对较高
其他 Kfold



        
        
            
                同比赛其他方案
            
            
                
                
                    
                        
                            1st place solution
                        
                    
                
                
                
                    
                        
                            2nd Place Solution
                        
                    
                
                
                
                    
                        
                            3rd place solution
                        
                    
                
                
                
                    
                        
                            5th solution: prompt is all you need
                        
                    
                
                
                
                    
                        
                            7th place solution - the power of randomness