2025 AI 原生编程挑战赛—AIOps故障定位挑战赛 | 532387
在参加天池2025 AI原生编程挑战赛的过程中,我深刻体会到了AIOps领域根因分析的复杂性和挑战性。本文将分享我从Baseline方案到AI混合方案的完整技术演进过程,希望能为同样在这条路上探索的朋友们提供一些参考。
本次大赛的核心任务是分布式系统根因分析,基于阿里云可观测平台的多模态数据(Trace、Log、Metric),快速准确地定位系统故障的根本原因。
电商系统由10余个微服务组成,采用云原生架构部署:
Frontend → Frontend-Proxy (网关)
↓
Backend Services:
├── product-catalog (商品目录)
├── inventory (库存管理)
├── cart (购物车)
├── checkout (结算)
├── payment (支付)
├── recommendation (推荐)
├── ad (广告)
├── email (邮件)
└── ... (其他服务)
比赛注入了三大类故障:
告警触发机制: 每道题目注入的故障会触发特定告警,主要包括黄金三指标:
根因定位挑战:
输入示例:
{
"problem_id": "001",
"time_range": "2025-06-05 07:54:36 ~ 2025-06-05 11:54:36",
"candidate_root_causes": [
"ad.cpu",
"payment.Failure",
"cart.networkLatency"
],
"alarm_rules": ["frontend_avg_rt"]
}
输出要求:
{
"problem_id": "001",
"root_causes": ["ad.cpu"]
}
评分标准为F1-score,要求从候选集合中选出最小充分根因集合。
纯规则+优先级匹配,不依赖LLM。
延迟分析算法:
def analyze_latency_root_cause(anomaly_start, anomaly_end, candidates):
# 1. 计算基线时间段 (异常前1小时)
baseline_start = anomaly_start - timedelta(hours=1)
# 2. 查找高延迟Span (独占时间 > 2秒)
finder = FindRootCauseSpansRT(
start_time=anomaly_start,
end_time=anomaly_end,
duration_threshold=2000000000, # 2秒
normal_start_time=baseline_start,
normal_end_time=anomaly_start
)
# 3. 使用SLS的diff_patterns()识别异常服务
top_95_percent_spans = finder.find_top_95_percent_spans()
# 4. SPL查询示例 - 增强业务逻辑检测
query = """
with t0 as (
select spanName, serviceName, duration,
if((spanId in top_spans), 'true', 'false') as anomaly_label,
-- 检测GC停顿特征
case
when spanName like '%GC%' or serviceName like '%cache%' then 'business_logic'
else 'normal'
end as service_type
from log
)
select diff_patterns(
row(spanName, serviceName, service_type),
array['anomaly_label'],
'true', 'false'
) as ret from t0
"""
# 5. 按优先级匹配候选根因 - 包含业务逻辑故障
priority_list = [".cpu", ".memory", ".LargeGc", ".CacheFailure", ".networkLatency"]
for service in extracted_services:
# 先检查业务逻辑故障
for business_suffix in [".LargeGc", ".CacheFailure"]:
if f"{service}{business_suffix}" in candidates:
# 验证是否真的是GC或缓存问题
if self._detect_gc_or_cache_issues(service, anomaly_start, anomaly_end):
return [f"{service}{business_suffix}"]
# 再检查常规性能问题
for perf_suffix in [".cpu", ".memory", ".networkLatency"]:
if f"{service}{perf_suffix}" in candidates:
return [f"{service}{perf_suffix}"]
return []
def _detect_gc_or_cache_issues(self, service, start_time, end_time):
"""检测GC停顿或缓存故障的业务逻辑特征"""
# 检测GC特征
gc_query = f"""
select count(*) as gc_count,
avg(cast(duration as double)) as avg_gc_duration
from log
where serviceName = '{service}'
and (spanName like '%GC%' or
attributes['gc.phase'] is not null)
and __time__ >= '{start_time}'
and __time__ <= '{end_time}'
"""
# 检测缓存故障特征
cache_query = f"""
select count(*) as cache_error_count,
count(distinct traceId) as affected_traces
from log
where serviceName = '{service}'
and (spanName like '%cache%' or
attributes['cache.operation'] is not null)
and statusCode > 1
and __time__ >= '{start_time}'
and __time__ <= '{end_time}'
"""
# 执行查询并分析特征
gc_result = self._query_sls(gc_query)
cache_result = self._query_sls(cache_query)
return gc_result or cache_result
错误分析算法:
def analyze_error_root_cause(start_time, end_time, candidates):
# 查询所有错误trace
query = """
statusCode > 1 |
select serviceName, count(*) as error_count
from log
group by serviceName
order by error_count desc
"""
# 优先级匹配
for service in services_by_error_count:
for suffix in [".Failure", ".Unreachable"]:
if f"{service}{suffix}" in candidates:
return [f"{service}{suffix}"]
return []
阿里云SLS提供的高级函数,用于对比异常/正常两组数据的差异:
diff_patterns(
columns, -- 要分析的列
label_columns, -- 标签列
pos_label, -- 异常标签值
neg_label -- 正常标签值
)
返回示例:
Pattern: {serviceName: 'ad', spanName: 'GetAds'}
Anomaly count: 450 (95%)
Normal count: 20 (5%)
Span的独占时间 = 总时间 - 所有子Span时间
def calculate_exclusive_duration(span, child_spans):
total_duration = span.duration
child_overlap_time = 0
for child in child_spans:
overlap = get_time_overlap(span, child)
child_overlap_time += overlap
return total_duration - child_overlap_time
基于大量案例总结的领域知识:
# 延迟类: CPU问题最常见,但业务逻辑故障需要特别关注
LATENCY_PRIORITY = [
".cpu", # 1. CPU资源问题 (40%)
".LargeGc", # 2. 业务逻辑GC停顿 (25%)
".memory", # 3. 内存压力 (20%)
".CacheFailure", # 4. 缓存故障 (10%)
".networkLatency" # 5. 网络延迟 (5%)
]
# 错误类: 功能故障最常见,业务逻辑故障需要深入分析
ERROR_PRIORITY = [
".Failure", # 1. 服务功能故障 (50%)
".CacheFailure", # 2. 缓存服务故障 (20%)
".Unreachable", # 3. 服务不可达 (15%)
# 注意:FloodHomepage和NodeKiller属于特殊场景,需要单独检测
]
# 业务逻辑故障检测优先级
BUSINESS_LOGIC_PRIORITY = [
".LargeGc", # 1. 大规模GC停顿 - 影响JVM服务
".CacheFailure", # 2. 缓存失效 - 影响依赖缓存的服务
".FloodHomepage", # 3. 首页洪水攻击 - 表现为前端流量异常
".NodeKiller" # 4. 节点终止 - 影响Pod级别的服务
]
| 指标 | 结果 |
|---|---|
| A榜得分 | 80分 (8/10) |
| 延迟类准确率 | 100% (5/5) ✓ |
| 错误类准确率 | 60% (3/5) |
优点:
缺点:
完全依赖LLM的4阶段推理:
Stage 1: LLM判断问题类型 (error/latency)
↓
Stage 2: 收集证据数据 (trace统计)
↓
Stage 3: LLM解读证据,返回候选根因
↓
Stage 4: LLM综合推理,返回最终结果
def analyze_with_llm(problem_id, time_range, candidates, alarm_rules):
# Stage 1: LLM判断类型
analysis_type = llm_client.call_llm(
prompt=f"告警规则: {alarm_rules}, 判断是延迟还是错误问题",
system_prompt="你是根因分析专家..."
)
# Stage 2: 收集证据
if analysis_type == "latency":
evidence = _collect_latency_evidence(...) # ❌问题点
else:
evidence = _collect_error_evidence(...)
# Stage 3: LLM解读证据
llm_candidates = llm_client.call_llm(
prompt=f"证据: {evidence}, 分析可能的根因",
system_prompt="返回候选根因列表..."
) # ❌返回23个候选而非1个!
# Stage 4: LLM综合推理
final_result = llm_client.call_llm(
prompt=f"候选: {llm_candidates}, 选择最可能的根因",
system_prompt="综合判断..."
) # ❌经常返回"unknown"!
return final_result
问题1: 技术债 - 延迟分析未实现
def _collect_latency_evidence_simplified(self, start_time, end_time):
# ❌ 简化版本,返回空数据!
return {
'type': 'latency',
'services': [], # 空列表导致LLM无法分析
'note': 'Simplified version - actual implementation needed'
}
结果: 所有延迟类问题(5/5)全部失败!
问题2: LLM返回格式不受控
# Prompt: "分析证据并返回可能的根因"
# 预期: ["ad.cpu"]
# 实际: ["ad.cpu", "ad.memory", "ad.LargeGc", ..., "payment.Failure"] (23个!)
原因: Prompt不够明确,LLM"脑洞大开"
问题3: 过度推理适得其反
Stage 3: 正确识别 → "ad.cpu"
Stage 4: "重新思考" → "unknown" ❌
LLM在第4阶段"画蛇添足",反而推翻了正确答案。
| 指标 | 结果 |
|---|---|
| A榜得分 | 30分 (3/10) |
| 延迟类准确率 | 0% (0/5) ❌ |
| 错误类准确率 | 60% (3/5) |
比Baseline还差了! 但证明了LLM在错误类问题上有潜力。
吸取V1/V2的教训,采用混合策略:
class AIRootCauseAgentV3:
def analyze_root_cause(self, problem_id, time_range, alarm_rules, candidates):
# Stage 1: 规则判断类型 (不用LLM!)
analysis_type = self._determine_analysis_type(alarm_rules)
# Stage 2: 分支策略
if analysis_type == "latency" and BASELINE_AVAILABLE:
# 延迟类: 直接用Baseline
return analyze_latency_root_cause(start, end, candidates)
else:
# 错误类: LLM + 规则
return self._analyze_error_with_llm(...)
简化LLM任务: 只让LLM识别服务名,不让它决定故障类型
def _analyze_error_with_llm(self, evidence, candidates):
# Step 1: 收集错误统计
evidence = {
'services': [
{'service': 'frontend', 'error_count': 100, 'error_rate': 0.8},
{'service': 'checkout', 'error_count': 95, 'error_rate': 0.9},
{'service': 'payment', 'error_count': 80, 'error_rate': 1.0}
]
}
# Step 2: LLM只识别服务名
system_prompt = """你是根因分析专家。
**严格规则**:
1. 只返回服务名 (如"payment", "cart")
2. 不要返回完整根因 (不要加.cpu/.Failure等)
3. 选择error_rate最高的服务
4. 只返回一个服务名
返回格式: payment
"""
service_name = llm_client.call_llm(
prompt=f"证据:\n{json.dumps(evidence, indent=2)}\n\n返回服务名:",
system_prompt=system_prompt,
temperature=0.1, # 低温度提高确定性
max_tokens=20 # 限制输出长度
)
# 返回: "payment"
# Step 3: 规则应用优先级
priority_list = [".Failure", ".Unreachable", ".CacheFailure"]
for suffix in priority_list:
candidate = f"{service_name}{suffix}"
if candidate in candidates:
return candidate
return None
def _fallback_to_evidence(self, evidence, candidates):
# Fallback 1: 尝试evidence中的top 5服务
for service_data in evidence['services'][:5]:
service = service_data['service']
root_cause = self._apply_priority_rules(service, candidates)
if root_cause:
return root_cause
# Fallback 2: 基于候选直接匹配
for candidate in candidates:
if any(keyword in candidate for keyword in ['Failure', 'cpu']):
return candidate
# Fallback 3: 返回第一个候选 (总比空好)
return candidates[0] if candidates else "unknown"
| 指标 | V3结果 | Baseline |
|---|---|---|
| A榜得分 | 70分 (7/10) | 80分 |
| 延迟类准确率 | 100% (5/5) ✓ | 100% |
| 错误类准确率 | 40% (2/5) | 60% |
详细结果对比:
| 问题ID | 类型 | V3结果 | Baseline | 正确答案 | V3对错 |
|---|---|---|---|---|---|
| 004 | error | checkout.Failure | payment.Failure | ? | ❌ |
| 005 | latency | ad.cpu | ad.cpu | ad.cpu | ✅ |
| 006 | latency | ad.cpu | ad.cpu | ad.cpu | ✅ |
| 007 | error | ad.Failure | ad.Failure | ad.Failure | ✅ |
| 008 | latency | ad.cpu | ad.cpu | ad.cpu | ✅ |
| 009 | latency | recommendation.cpu | recommendation.cpu | recommendation.cpu | ✅ |
| 010 | error | checkout.Failure | cart.Failure | ? | ❌ |
| 011 | latency | ad.cpu | ad.cpu | ad.cpu | ✅ |
| 012 | error | ad.Failure | ad.Failure | ad.Failure | ✅ |
| 013 | error | checkout.Failure | payment.Failure | ? | ❌ |
核心问题: 错误类问题中,LLM容易将downstream的错误误判为upstream服务的故障。
例如:
教训: 延迟问题90%是CPU/Memory资源问题,而非网络延迟
# ❌ V2的错误: LLM直觉判断
LLM: "延迟问题" → "networkLatency" (符合语义,但错误!)
# ✓ Baseline的正确: 领域规则
Rule: 延迟优先级 = [cpu, memory, LargeGc, networkLatency]
| 版本 | Prompt | 结果 |
|---|---|---|
| V1 | "分析证据并返回根因" | 返回23个候选 ❌ |
| V2 | "只返回1个根因" | 返回1个 ✓ |
| V3 | "只返回服务名,不要加后缀" | 准确率提升 ✓ |
参数优化:
# V1: temperature=0.3, max_tokens=2000
# V2: temperature=0.1, max_tokens=50
# V3: temperature=0.1, max_tokens=20
纯规则 (Baseline): 80分 - 准确但不灵活
纯LLM (V1/V2): 20-30分 - 灵活但不准确
混合策略 (V3/V4): 70-90分 - 兼具准确性和灵活性 ✓
统计分析: 看错误数量 → 容易被上游误导
调用链分析: 找叶子节点 → 找到真正根因 ✓
| 版本 | 每题Token | A榜10题 | B榜196题 |
|---|---|---|---|
| V1 | ~2500 | 25K | 490K |
| V2 | ~100 | 1K | 19.6K |
| V3 | ~30 | 300 | 5.9K |
| V4 | ~50 | 500 | 9.8K |
Token消耗更低了!
A榜10题:
- Baseline: ~5分钟
- V1: ~3分钟 (4阶段LLM并行)
- V3: ~6分钟 (Baseline延迟分析慢)
- V4: ~8分钟 (增加调用链查询)
B榜196题预估:
- V4: ~2个多小时
# 存储历史正确case
case_db = {
"latency + frontend_avg_rt + ad高延迟": "ad.cpu",
"error + overall_error_count + payment高错误": "payment.Failure",
}
# 新问题先查相似case
similar = find_similar_cases(current_problem, case_db)
if similar and confidence > 0.9:
return similar[0].root_cause
# 构建服务依赖图
G = nx.DiGraph()
for trace in traces:
G.add_edge(trace.parent, trace.service, weight=trace.error_rate)
# GNN预测故障传播路径
gnn_model = GCN(input_dim=128, hidden_dim=64, output_dim=2)
fault_prob = gnn_model(G)
root_service = argmax(fault_prob)
results = {
"baseline": baseline_analyze(), # 权重0.3
"trace": trace_analyze(), # 权重0.3
"llm": llm_analyze(), # 权重0.2
"gnn": gnn_analyze(), # 权重0.2
}
final = weighted_vote(results)
新手入门:
进阶优化:
欢迎交流讨论! 如果本文对你有帮助,请点赞支持~