Skip to content

Establish eval.yaml lifecycle and CI integration for skill validation #34814

@PureWeen

Description

@PureWeen

Summary

Based on a comprehensive 7-round evaluation of the try-fix skill (PR #34807), we've established best practices for how eval.yaml files should evolve over time. This issue tracks the work to formalize and implement these practices across all 15 skills.

Current State

Proposed Eval Lifecycle

CI Strategy (Tiered)

Trigger Tool Time Purpose
Every PR touching .github/skills/ skill-validator check 0.1s Static analysis gate
Weekly scheduled evaluate --runs 1 ~22 min Behavioral drift detection
Before skill merge evaluate --runs 3-5 ~90 min Statistical confidence

Eval Update Triggers

  • SKILL.md changes → mandatory eval review
  • Production failure → add regression scenario from real prompt
  • New anti-pattern discovered → add output_not_contains assertion
  • Quarterly → audit all evals for drift
  • Model upgrade → recalibrate baselines

Eval Writing Best Practices (from 7-round try-fix cycle)

  1. Start with 0 output_contains -- vocabulary assertions overfit immediately (scored 0.52)
  2. Use output_not_contains for anti-patterns -- more stable than expected vocabulary
  3. Use rubric items for quality -- LLM-judge criteria measure improvement, not compliance
  4. Vary prompt structure -- uniform formats train eval to test template-following
  5. Front-load critical constraints -- prompt order = execution order
  6. Negative triggers must be unambiguously off-topic -- "review this fix" is too close to "fix this"
  7. Version with changelog header -- lightweight tracking of eval evolution

Streamlined Evaluation Protocol (max 4 rounds)

Round 1: All evaluators assess fresh → identify issues
Round 2: Fixes applied → delta review only
Round 3: Final consensus → KEEP/IMPROVE/REMOVE
Round 4: (if needed) Address IMPROVE items → final verdict

Work Items

  • Add skill-validator check to CI for PRs touching .github/skills/
  • Create eval.yaml for pr-review skill (highest priority -- orchestrator)
  • Create eval.yaml for write-ui-tests skill
  • Create eval.yaml for pr-finalize skill
  • Document eval lifecycle in .github/skills/README.md
  • Request --internal-only flag from dotnet/skills team for orchestrator skills
  • Set up weekly scheduled eval run

Tooling Improvements (for dotnet/skills team)

Based on running skill-validator 9+ times during the try-fix evaluation:

  1. --internal-only flag -- internal skills get penalized for plugin non-activation (-3.7% overall vs +51.7% isolated)
  2. Static eval.yaml validation in check -- catch spec-conformance issues early (e.g., skill name in prompts)
  3. Timeout progress indicators -- distinguish "still working" from "stuck in loop"
  4. Overfitting context for infrastructure skills -- technique items ARE the intended behavior
  5. Results directory auto-naming -- --results-dir auto with --latest flag

Empirical Data (try-fix evaluation)

Metric Value
Isolated improvement +51.7%
Plugin improvement +1.3% (not activated -- by design)
Overfitting 0.40 (Moderate)
Scenarios 6
Evaluation rounds 7 (target: 4 max)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-infrastructureCI, Maestro / Coherency, upstream dependencies/versionss/triagedIssue has been reviewed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions