天池2025 AIOps挑战赛：从15分开始的技术演进之路

关联比赛: 2025 AI 原生编程挑战赛—AIOps故障定位挑战赛
发表于浙江省 · 2025-11-04 16:07:18

前言

在参加天池2025 AI原生编程挑战赛的过程中,我深刻体会到了AIOps领域根因分析的复杂性和挑战性。本文将分享我从Baseline方案到AI混合方案的完整技术演进过程,希望能为同样在这条路上探索的朋友们提供一些参考。

赛题背景

本次大赛的核心任务是分布式系统根因分析,基于阿里云可观测平台的多模态数据(Trace、Log、Metric),快速准确地定位系统故障的根本原因。

系统架构

电商系统由10余个微服务组成,采用云原生架构部署:

Frontend → Frontend-Proxy (网关)
   ↓
Backend Services:
├── product-catalog (商品目录)
├── inventory (库存管理)
├── cart (购物车)
├── checkout (结算)
├── payment (支付)
├── recommendation (推荐)
├── ad (广告)
├── email (邮件)
└── ... (其他服务)

故障类型

比赛注入了三大类故障:

性能问题类 (延迟类)
- cpu: CPU使用率过高
- memory: 内存压力
- networkLatency: 网络延迟
- LargeGc: JVM大规模GC
服务故障类 (错误类)
- Failure: 服务功能故障
- Unreachable: 服务不可达
- CacheFailure: 缓存失效
特定业务逻辑类 (A榜重点)
- LargeGc: 模拟服务大规模垃圾回收，导致短暂停顿
- CacheFailure: 模拟缓存服务故障，缓存失效或无法访问
- FloodHomepage: 对前端首页发起大量请求洪水攻击
- NodeKiller: 终止指定Kubernetes节点进行混沌工程测试

A榜根因分析要点

告警触发机制: 每道题目注入的故障会触发特定告警，主要包括黄金三指标：

请求延迟（Latency）
请求流量（Traffic）
错误率（Errors）

根因定位挑战:

故障注入点可能是单个也可能是多个
需要从候选根因集合中找出最小充分根因集合
业务逻辑故障往往表现为间接影响，需深入分析调用链

输入输出格式

输入示例:

{
  "problem_id": "001",
  "time_range": "2025-06-05 07:54:36 ~ 2025-06-05 11:54:36",
  "candidate_root_causes": [
    "ad.cpu",
    "payment.Failure",
    "cart.networkLatency"
  ],
  "alarm_rules": ["frontend_avg_rt"]
}

输出要求:

{
  "problem_id": "001",
  "root_causes": ["ad.cpu"]
}

评分标准为F1-score,要求从候选集合中选出最小充分根因集合。

技术方案演进历程

第一阶段: Baseline方案 ( B榜15分）

核心思路

纯规则+优先级匹配,不依赖LLM。

延迟分析算法:

def analyze_latency_root_cause(anomaly_start, anomaly_end, candidates):
    # 1. 计算基线时间段 (异常前1小时)
    baseline_start = anomaly_start - timedelta(hours=1)

    # 2. 查找高延迟Span (独占时间 > 2秒)
    finder = FindRootCauseSpansRT(
        start_time=anomaly_start,
        end_time=anomaly_end,
        duration_threshold=2000000000,  # 2秒
        normal_start_time=baseline_start,
        normal_end_time=anomaly_start
    )

    # 3. 使用SLS的diff_patterns()识别异常服务
    top_95_percent_spans = finder.find_top_95_percent_spans()

    # 4. SPL查询示例 - 增强业务逻辑检测
    query = """
    with t0 as (
        select spanName, serviceName, duration,
               if((spanId in top_spans), 'true', 'false') as anomaly_label,
               -- 检测GC停顿特征
               case
                   when spanName like '%GC%' or serviceName like '%cache%' then 'business_logic'
                   else 'normal'
               end as service_type
        from log
    )
    select diff_patterns(
        row(spanName, serviceName, service_type),
        array['anomaly_label'],
        'true', 'false'
    ) as ret from t0
    """

    # 5. 按优先级匹配候选根因 - 包含业务逻辑故障
    priority_list = [".cpu", ".memory", ".LargeGc", ".CacheFailure", ".networkLatency"]
    for service in extracted_services:
        # 先检查业务逻辑故障
        for business_suffix in [".LargeGc", ".CacheFailure"]:
            if f"{service}{business_suffix}" in candidates:
                # 验证是否真的是GC或缓存问题
                if self._detect_gc_or_cache_issues(service, anomaly_start, anomaly_end):
                    return [f"{service}{business_suffix}"]

        # 再检查常规性能问题
        for perf_suffix in [".cpu", ".memory", ".networkLatency"]:
            if f"{service}{perf_suffix}" in candidates:
                return [f"{service}{perf_suffix}"]

    return []

def _detect_gc_or_cache_issues(self, service, start_time, end_time):
    """检测GC停顿或缓存故障的业务逻辑特征"""
    # 检测GC特征
    gc_query = f"""
    select count(*) as gc_count,
           avg(cast(duration as double)) as avg_gc_duration
    from log
    where serviceName = '{service}'
      and (spanName like '%GC%' or
           attributes['gc.phase'] is not null)
      and __time__ >= '{start_time}'
      and __time__ <= '{end_time}'
    """

    # 检测缓存故障特征
    cache_query = f"""
    select count(*) as cache_error_count,
           count(distinct traceId) as affected_traces
    from log
    where serviceName = '{service}'
      and (spanName like '%cache%' or
           attributes['cache.operation'] is not null)
      and statusCode > 1
      and __time__ >= '{start_time}'
      and __time__ <= '{end_time}'
    """

    # 执行查询并分析特征
    gc_result = self._query_sls(gc_query)
    cache_result = self._query_sls(cache_query)

    return gc_result or cache_result

错误分析算法:

def analyze_error_root_cause(start_time, end_time, candidates):
    # 查询所有错误trace
    query = """
    statusCode > 1 |
    select serviceName, count(*) as error_count
    from log
    group by serviceName
    order by error_count desc
    """

    # 优先级匹配
    for service in services_by_error_count:
        for suffix in [".Failure", ".Unreachable"]:
            if f"{service}{suffix}" in candidates:
                return [f"{service}{suffix}"]

    return []

关键技术点

diff_patterns() - 异常模式挖掘

阿里云SLS提供的高级函数,用于对比异常/正常两组数据的差异:

diff_patterns(
    columns,           -- 要分析的列
    label_columns,     -- 标签列
    pos_label,         -- 异常标签值
    neg_label          -- 正常标签值
)

返回示例:

Pattern: {serviceName: 'ad', spanName: 'GetAds'}
Anomaly count: 450 (95%)
Normal count: 20 (5%)

独占时间计算

Span的独占时间 = 总时间 - 所有子Span时间

def calculate_exclusive_duration(span, child_spans):
    total_duration = span.duration
    child_overlap_time = 0

    for child in child_spans:
        overlap = get_time_overlap(span, child)
        child_overlap_time += overlap

    return total_duration - child_overlap_time

优先级规则

基于大量案例总结的领域知识:

# 延迟类: CPU问题最常见，但业务逻辑故障需要特别关注
LATENCY_PRIORITY = [
    ".cpu",           # 1. CPU资源问题 (40%)
    ".LargeGc",       # 2. 业务逻辑GC停顿 (25%)
    ".memory",        # 3. 内存压力 (20%)
    ".CacheFailure",  # 4. 缓存故障 (10%)
    ".networkLatency" # 5. 网络延迟 (5%)
]

# 错误类: 功能故障最常见，业务逻辑故障需要深入分析
ERROR_PRIORITY = [
    ".Failure",       # 1. 服务功能故障 (50%)
    ".CacheFailure",  # 2. 缓存服务故障 (20%)
    ".Unreachable",   # 3. 服务不可达 (15%)
    # 注意：FloodHomepage和NodeKiller属于特殊场景，需要单独检测
]

# 业务逻辑故障检测优先级
BUSINESS_LOGIC_PRIORITY = [
    ".LargeGc",       # 1. 大规模GC停顿 - 影响JVM服务
    ".CacheFailure",  # 2. 缓存失效 - 影响依赖缓存的服务
    ".FloodHomepage", # 3. 首页洪水攻击 - 表现为前端流量异常
    ".NodeKiller"     # 4. 节点终止 - 影响Pod级别的服务
]

实战效果

指标	结果
A榜得分	80分 (8/10)
延迟类准确率	100% (5/5) ✓
错误类准确率	60% (3/5)

优点:

✓ 延迟问题准确率极高
✓ 无需LLM,运行速度快
✓ 结果稳定可复现

缺点:

✗ 错误类问题识别不准(容易误判上游服务)
✗ 缺乏灵活性,难以适应新类型故障

第二阶段: AI V1方案 - 失败的尝试

核心思路

完全依赖LLM的4阶段推理:

Stage 1: LLM判断问题类型 (error/latency)
   ↓
Stage 2: 收集证据数据 (trace统计)
   ↓
Stage 3: LLM解读证据,返回候选根因
   ↓
Stage 4: LLM综合推理,返回最终结果

代码实现

def analyze_with_llm(problem_id, time_range, candidates, alarm_rules):
    # Stage 1: LLM判断类型
    analysis_type = llm_client.call_llm(
        prompt=f"告警规则: {alarm_rules}, 判断是延迟还是错误问题",
        system_prompt="你是根因分析专家..."
    )

    # Stage 2: 收集证据
    if analysis_type == "latency":
        evidence = _collect_latency_evidence(...)  # ❌问题点
    else:
        evidence = _collect_error_evidence(...)

    # Stage 3: LLM解读证据
    llm_candidates = llm_client.call_llm(
        prompt=f"证据: {evidence}, 分析可能的根因",
        system_prompt="返回候选根因列表..."
    )  # ❌返回23个候选而非1个!

    # Stage 4: LLM综合推理
    final_result = llm_client.call_llm(
        prompt=f"候选: {llm_candidates}, 选择最可能的根因",
        system_prompt="综合判断..."
    )  # ❌经常返回"unknown"!

    return final_result

失败原因分析

问题1: 技术债 - 延迟分析未实现

def _collect_latency_evidence_simplified(self, start_time, end_time):
    # ❌ 简化版本,返回空数据!
    return {
        'type': 'latency',
        'services': [],  # 空列表导致LLM无法分析
        'note': 'Simplified version - actual implementation needed'
    }

结果: 所有延迟类问题(5/5)全部失败!

问题2: LLM返回格式不受控

# Prompt: "分析证据并返回可能的根因"
# 预期: ["ad.cpu"]
# 实际: ["ad.cpu", "ad.memory", "ad.LargeGc", ..., "payment.Failure"] (23个!)

原因: Prompt不够明确,LLM"脑洞大开"

问题3: 过度推理适得其反

Stage 3: 正确识别 → "ad.cpu"
Stage 4: "重新思考" → "unknown" ❌

LLM在第4阶段"画蛇添足",反而推翻了正确答案。

实战效果

指标	结果
A榜得分	30分 (3/10)
延迟类准确率	0% (0/5) ❌
错误类准确率	60% (3/5)

比Baseline还差了! 但证明了LLM在错误类问题上有潜力。

第三阶段: AI V3方案 (70分) - 混合策略成功

吸取V1/V2的教训,采用混合策略:

延迟类问题 → Baseline分析 (100%准确) ✓
错误类问题 → LLM识别 + 规则优先级 ✓

核心架构

class AIRootCauseAgentV3:
    def analyze_root_cause(self, problem_id, time_range, alarm_rules, candidates):
        # Stage 1: 规则判断类型 (不用LLM!)
        analysis_type = self._determine_analysis_type(alarm_rules)

        # Stage 2: 分支策略
        if analysis_type == "latency" and BASELINE_AVAILABLE:
            # 延迟类: 直接用Baseline
            return analyze_latency_root_cause(start, end, candidates)
        else:
            # 错误类: LLM + 规则
            return self._analyze_error_with_llm(...)

错误分析改进

简化LLM任务: 只让LLM识别服务名,不让它决定故障类型

def _analyze_error_with_llm(self, evidence, candidates):
    # Step 1: 收集错误统计
    evidence = {
        'services': [
            {'service': 'frontend', 'error_count': 100, 'error_rate': 0.8},
            {'service': 'checkout', 'error_count': 95, 'error_rate': 0.9},
            {'service': 'payment', 'error_count': 80, 'error_rate': 1.0}
        ]
    }

    # Step 2: LLM只识别服务名
    system_prompt = """你是根因分析专家。

**严格规则**:
1. 只返回服务名 (如"payment", "cart")
2. 不要返回完整根因 (不要加.cpu/.Failure等)
3. 选择error_rate最高的服务
4. 只返回一个服务名

返回格式: payment
"""

    service_name = llm_client.call_llm(
        prompt=f"证据:\n{json.dumps(evidence, indent=2)}\n\n返回服务名:",
        system_prompt=system_prompt,
        temperature=0.1,  # 低温度提高确定性
        max_tokens=20     # 限制输出长度
    )
    # 返回: "payment"

    # Step 3: 规则应用优先级
    priority_list = [".Failure", ".Unreachable", ".CacheFailure"]
    for suffix in priority_list:
        candidate = f"{service_name}{suffix}"
        if candidate in candidates:
            return candidate

    return None

多层Fallback机制

def _fallback_to_evidence(self, evidence, candidates):
    # Fallback 1: 尝试evidence中的top 5服务
    for service_data in evidence['services'][:5]:
        service = service_data['service']
        root_cause = self._apply_priority_rules(service, candidates)
        if root_cause:
            return root_cause

    # Fallback 2: 基于候选直接匹配
    for candidate in candidates:
        if any(keyword in candidate for keyword in ['Failure', 'cpu']):
            return candidate

    # Fallback 3: 返回第一个候选 (总比空好)
    return candidates[0] if candidates else "unknown"

实战效果

指标	V3结果	Baseline
A榜得分	70分 (7/10)	80分
延迟类准确率	100% (5/5) ✓	100%
错误类准确率	40% (2/5)	60%

详细结果对比:

问题ID	类型	V3结果	Baseline	正确答案	V3对错
004	error	checkout.Failure	payment.Failure	?	❌
005	latency	ad.cpu	ad.cpu	ad.cpu	✅
006	latency	ad.cpu	ad.cpu	ad.cpu	✅
007	error	ad.Failure	ad.Failure	ad.Failure	✅
008	latency	ad.cpu	ad.cpu	ad.cpu	✅
009	latency	recommendation.cpu	recommendation.cpu	recommendation.cpu	✅
010	error	checkout.Failure	cart.Failure	?	❌
011	latency	ad.cpu	ad.cpu	ad.cpu	✅
012	error	ad.Failure	ad.Failure	ad.Failure	✅
013	error	checkout.Failure	payment.Failure	?	❌

核心问题: 错误类问题中,LLM容易将downstream的错误误判为upstream服务的故障。

例如:

实际: payment故障 → checkout调用失败 → frontend报错
LLM看到: frontend错误最多 → 误判为checkout.Failure

关键技术总结

1. 领域规则 > LLM语义理解

教训: 延迟问题90%是CPU/Memory资源问题,而非网络延迟

# ❌ V2的错误: LLM直觉判断
LLM: "延迟问题" → "networkLatency" (符合语义,但错误!)

# ✓ Baseline的正确: 领域规则
Rule: 延迟优先级 = [cpu, memory, LargeGc, networkLatency]

2. Prompt工程的威力

版本	Prompt	结果
V1	"分析证据并返回根因"	返回23个候选 ❌
V2	"只返回1个根因"	返回1个 ✓
V3	"只返回服务名,不要加后缀"	准确率提升 ✓

参数优化:

# V1: temperature=0.3, max_tokens=2000
# V2: temperature=0.1, max_tokens=50
# V3: temperature=0.1, max_tokens=20

3. 混合策略是王道

纯规则 (Baseline): 80分 - 准确但不灵活
纯LLM (V1/V2): 20-30分 - 灵活但不准确
混合策略 (V3/V4): 70-90分 - 兼具准确性和灵活性 ✓

4. 调用链分析是根因定位的关键

统计分析: 看错误数量 → 容易被上游误导
调用链分析: 找叶子节点 → 找到真正根因 ✓

性能与成本分析

Token消耗对比

版本	每题Token	A榜10题	B榜196题
V1	~2500	25K	490K
V2	~100	1K	19.6K
V3	~30	300	5.9K
V4	~50	500	9.8K

Token消耗更低了!

运行时间

A榜10题:
- Baseline: ~5分钟
- V1: ~3分钟 (4阶段LLM并行)
- V3: ~6分钟 (Baseline延迟分析慢)
- V4: ~8分钟 (增加调用链查询)

B榜196题预估:
- V4: ~2个多小时

未来优化方向

1. Case-Based Reasoning (案例库)

# 存储历史正确case
case_db = {
    "latency + frontend_avg_rt + ad高延迟": "ad.cpu",
    "error + overall_error_count + payment高错误": "payment.Failure",
}

# 新问题先查相似case
similar = find_similar_cases(current_problem, case_db)
if similar and confidence > 0.9:
    return similar[0].root_cause

2. 图神经网络 (GNN)

# 构建服务依赖图
G = nx.DiGraph()
for trace in traces:
    G.add_edge(trace.parent, trace.service, weight=trace.error_rate)

# GNN预测故障传播路径
gnn_model = GCN(input_dim=128, hidden_dim=64, output_dim=2)
fault_prob = gnn_model(G)
root_service = argmax(fault_prob)

3. 多模型集成

results = {
    "baseline": baseline_analyze(),      # 权重0.3
    "trace": trace_analyze(),            # 权重0.3
    "llm": llm_analyze(),                # 权重0.2
    "gnn": gnn_analyze(),                # 权重0.2
}

final = weighted_vote(results)

总结与感悟

关键收获

技术债优先级最高: V1的延迟分析未实现导致惨败,证明完整性比花哨功能更重要
领域知识不可替代: CPU优先级规则虽然简单,但比LLM的"智能"更可靠
混合策略是最优解: 充分发挥规则的确定性和LLM的灵活性
调用链分析是根因定位的核心: 叶子节点检测解决了上游误判问题
Prompt工程决定成败: 从2500 tokens优化到20 tokens,准确率反而提升

参赛建议

新手入门:

先用云监控2.0页面手动分析1-2道题,理解数据结构
实现Baseline方案,保证15分的基础分
逐步引入LLM,不要一开始就all-in

进阶优化:

错误类问题: 实现调用链追踪
延迟类问题: 保持Baseline方案
建立Few-shot示例库
多层Fallback保证零空结果

欢迎交流讨论! 如果本文对你有帮助,请点赞支持~

9 · 全部评论(0)