Co-Pilot / 辅助式
更新于 a month ago

advanced-evaluation

Mmuratcankoylan
7.4k
muratcankoylan/Agent-Skills-for-Context-Engineering/skills/advanced-evaluation
82
Agent 评分

💡 摘要

一个用于实现生产级LLM-as-a-Judge评估系统的综合技能,涵盖偏见缓解、评估量规设计和评分方法。

🎯 适合人群

构建评估流水线的AI/ML工程师优化模型输出的提示工程师监督AI质量的产品经理比较模型性能的研究人员

🤖 AI 吐槽:这是一份精彩绝伦的教科书章节,可惜忘了附上教科书本身——或者开课通知。

安全分析中风险

README描述了提示LLM并处理其输出,这意味着需要访问外部API服务。主要风险是将不受信任的数据(提示/响应)注入这些外部调用,可能导致数据泄露或意外成本。缓解措施:对所有外部LLM API调用实施严格的输入验证、清理以及预算/速率限制。


name: advanced-evaluation description: This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

Advanced Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

When to Activate

Activate this skill when:

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • Analyzing correlation between automated and human judgments

Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories with distinct reliability profiles:

Direct Scoring: A single LLM rates one response on a defined scale.

  • Best for: Objective criteria (factual accuracy, instruction following, toxicity)
  • Reliability: Moderate to high for well-defined criteria
  • Failure mode: Score calibration drift, inconsistent scale interpretation

Pairwise Comparison: An LLM compares two responses and selects the better one.

  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences
  • Failure mode: Position bias, length bias

Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.

The Bias Landscape

LLM judges exhibit systematic biases that must be actively mitigated:

Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.

Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.

Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.

Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.

Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.

Metric Selection Framework

Choose metrics based on the evaluation task structure:

| Task Type | Primary Metrics | Secondary Metrics | |-----------|-----------------|-------------------| | Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ | | Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) | | Pairwise preference | Agreement rate, Position consistency | Confidence calibration | | Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |

The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.

Evaluation Approaches

Direct Scoring Implementation

Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration:

  • 1-3 scales: Binary with neutral option, lowest cognitive load
  • 1-5 scales: Standard Likert, good balance of granularity and reliability
  • 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics

Prompt Structure for Direct Scoring:

You are an expert evaluator assessing response quality.

## Task
Evaluate the following response against each criterion.

## Original Prompt
{prompt}

## Response to Evaluate
{response}

## Criteria
{for each criterion: name, description, weight}

## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement

## Output Format
Respond with structured JSON containing scores, justifications, and summary.

Chain-of-Thought Requirement: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.

Pairwise Comparison Implementation

Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.

Position Bias Mitigation Protocol:

  1. First pass: Response A in first position, Response B in second
  2. Second pass: Response B in first position, Response A in second
  3. Consistency check: If passes disagree, return TIE with reduced confidence
  4. Final verdict: Consistent winner with averaged confidence

Prompt Structure for Pairwise Comparison:

You are an expert evaluator comparing two AI responses.

## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent

## Original Prompt
{prompt}

## Response A
{response_a}

## Response B
{response_b}

## Comparison Criteria
{criteria list}

## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level

## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

Confidence Calibration: Confidence scores should reflect position consistency:

  • Both passes agree: confidence = average of individual confidences
  • Passes disagree: confidence = 0.5, verdict = TIE

Rubric Generation

Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.

Rubric Components:

  1. Level descriptions: Clear boundaries for each score level
  2. Characteristics: Observable features that define each level
  3. Examples: Representative text for each level (optional but valuable)
  4. Edge cases: Guidance for ambiguous situations
  5. Scoring guidelines: General principles for consistent application

Strictness Calibration:

  • Lenient: Lower bar for passing scores, appropriate for encouraging iteration
  • Balanced: Fair, typical expectations for production use
  • Strict: High standards, appropriate for safety-critical or high-stakes evaluation

Domain Adaptation: Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards.

Practical Guidance

Evaluation Pipeline Design

Production evaluation systems require multiple layers:

┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘

Common Anti-Patterns

Anti-pattern: Scoring without justification

  • Problem: Scores lack grounding, difficult to debug or improve
  • Solution: Always require evidence-based justification before score

Anti-pattern: Single-pass pairwise comparison

  • Problem: Position bias corrupts results
  • Solution: Always swap positions and check consistency

Anti-pattern: Overloaded criteria

  • Problem: Criteria measuring multiple things are unreliable
  • Solution: One criterion = one measurable aspect

Anti-pattern: Missing edge case guidance

  • Problem: Evaluators handle ambiguous cases inconsistently
  • Solution: Include edge cases in rubrics with explicit guidance

**A

五维分析
清晰度9/10
创新性7/10
实用性9/10
完整性8/10
可维护性8/10
优缺点分析

优点

  • 将学术研究综合为可操作的模式
  • 提供详细的偏见缓解策略
  • 包含实用的提示模板和流水线设计

缺点

  • 理论性强,缺乏可直接运行的代码示例
  • 假设了对LLM评估的较多先验知识
  • 未提供具体的实现或依赖项列表

相关技能

pytorch

S
toolCode Lib / 代码库
92/ 100

“它是深度学习的瑞士军刀,但祝你好运能从47种安装方法里找到那个不会搞崩你系统的那一个。”

agno

S
toolCode Lib / 代码库
90/ 100

“它承诺成为智能体领域的Kubernetes,但得看开发者有没有耐心学习又一个编排层。”

nuxt-skills

S
toolCo-Pilot / 辅助式
90/ 100

“这本质上是一份组织良好的小抄,能把你的 AI 助手变成一只 Nuxt 框架的复读机。”

免责声明:本内容来源于 GitHub 开源项目,仅供展示和评分分析使用。

版权归原作者所有 muratcankoylan.