Summary
Based on a comprehensive 7-round evaluation of the try-fix skill (PR #34807), we've established best practices for how eval.yaml files should evolve over time. This issue tracks the work to formalize and implement these practices across all 15 skills.
Current State
Proposed Eval Lifecycle
CI Strategy (Tiered)
| Trigger |
Tool |
Time |
Purpose |
Every PR touching .github/skills/ |
skill-validator check |
0.1s |
Static analysis gate |
| Weekly scheduled |
evaluate --runs 1 |
~22 min |
Behavioral drift detection |
| Before skill merge |
evaluate --runs 3-5 |
~90 min |
Statistical confidence |
Eval Update Triggers
- SKILL.md changes → mandatory eval review
- Production failure → add regression scenario from real prompt
- New anti-pattern discovered → add
output_not_contains assertion
- Quarterly → audit all evals for drift
- Model upgrade → recalibrate baselines
Eval Writing Best Practices (from 7-round try-fix cycle)
- Start with 0
output_contains -- vocabulary assertions overfit immediately (scored 0.52)
- Use
output_not_contains for anti-patterns -- more stable than expected vocabulary
- Use rubric items for quality -- LLM-judge criteria measure improvement, not compliance
- Vary prompt structure -- uniform formats train eval to test template-following
- Front-load critical constraints -- prompt order = execution order
- Negative triggers must be unambiguously off-topic -- "review this fix" is too close to "fix this"
- Version with changelog header -- lightweight tracking of eval evolution
Streamlined Evaluation Protocol (max 4 rounds)
Round 1: All evaluators assess fresh → identify issues
Round 2: Fixes applied → delta review only
Round 3: Final consensus → KEEP/IMPROVE/REMOVE
Round 4: (if needed) Address IMPROVE items → final verdict
Work Items
Tooling Improvements (for dotnet/skills team)
Based on running skill-validator 9+ times during the try-fix evaluation:
--internal-only flag -- internal skills get penalized for plugin non-activation (-3.7% overall vs +51.7% isolated)
- Static eval.yaml validation in
check -- catch spec-conformance issues early (e.g., skill name in prompts)
- Timeout progress indicators -- distinguish "still working" from "stuck in loop"
- Overfitting context for infrastructure skills -- technique items ARE the intended behavior
- Results directory auto-naming --
--results-dir auto with --latest flag
Empirical Data (try-fix evaluation)
| Metric |
Value |
| Isolated improvement |
+51.7% |
| Plugin improvement |
+1.3% (not activated -- by design) |
| Overfitting |
0.40 (Moderate) |
| Scenarios |
6 |
| Evaluation rounds |
7 (target: 4 max) |
Related
Summary
Based on a comprehensive 7-round evaluation of the
try-fixskill (PR #34807), we've established best practices for how eval.yaml files should evolve over time. This issue tracks the work to formalize and implement these practices across all 15 skills.Current State
Proposed Eval Lifecycle
CI Strategy (Tiered)
.github/skills/skill-validator checkevaluate --runs 1evaluate --runs 3-5Eval Update Triggers
output_not_containsassertionEval Writing Best Practices (from 7-round try-fix cycle)
output_contains-- vocabulary assertions overfit immediately (scored 0.52)output_not_containsfor anti-patterns -- more stable than expected vocabularyStreamlined Evaluation Protocol (max 4 rounds)
Work Items
skill-validator checkto CI for PRs touching.github/skills/pr-reviewskill (highest priority -- orchestrator)write-ui-testsskillpr-finalizeskill.github/skills/README.md--internal-onlyflag from dotnet/skills team for orchestrator skillsTooling Improvements (for dotnet/skills team)
Based on running
skill-validator9+ times during the try-fix evaluation:--internal-onlyflag -- internal skills get penalized for plugin non-activation (-3.7% overall vs +51.7% isolated)check-- catch spec-conformance issues early (e.g., skill name in prompts)--results-dir autowith--latestflagEmpirical Data (try-fix evaluation)
Related