Establish eval.yaml lifecycle and CI integration for skill validation

## Summary

Based on a comprehensive 7-round evaluation of the `try-fix` skill (PR #34807), we've established best practices for how eval.yaml files should evolve over time. This issue tracks the work to formalize and implement these practices across all 15 skills.

## Current State

- **3 of 15 skills** have eval.yaml files (try-fix being the 3rd via PR #34807)
- No CI integration for skill validation
- No documented lifecycle for eval maintenance

## Proposed Eval Lifecycle

### CI Strategy (Tiered)

| Trigger | Tool | Time | Purpose |
|---------|------|------|---------|
| Every PR touching `.github/skills/` | `skill-validator check` | 0.1s | Static analysis gate |
| Weekly scheduled | `evaluate --runs 1` | ~22 min | Behavioral drift detection |
| Before skill merge | `evaluate --runs 3-5` | ~90 min | Statistical confidence |

### Eval Update Triggers

- **SKILL.md changes** → mandatory eval review
- **Production failure** → add regression scenario from real prompt
- **New anti-pattern discovered** → add `output_not_contains` assertion
- **Quarterly** → audit all evals for drift
- **Model upgrade** → recalibrate baselines

### Eval Writing Best Practices (from 7-round try-fix cycle)

1. **Start with 0 `output_contains`** -- vocabulary assertions overfit immediately (scored 0.52)
2. **Use `output_not_contains` for anti-patterns** -- more stable than expected vocabulary
3. **Use rubric items for quality** -- LLM-judge criteria measure improvement, not compliance
4. **Vary prompt structure** -- uniform formats train eval to test template-following
5. **Front-load critical constraints** -- prompt order = execution order
6. **Negative triggers must be unambiguously off-topic** -- "review this fix" is too close to "fix this"
7. **Version with changelog header** -- lightweight tracking of eval evolution

### Streamlined Evaluation Protocol (max 4 rounds)

```
Round 1: All evaluators assess fresh → identify issues
Round 2: Fixes applied → delta review only
Round 3: Final consensus → KEEP/IMPROVE/REMOVE
Round 4: (if needed) Address IMPROVE items → final verdict
```

## Work Items

- [ ] Add `skill-validator check` to CI for PRs touching `.github/skills/`
- [ ] Create eval.yaml for `pr-review` skill (highest priority -- orchestrator)
- [ ] Create eval.yaml for `write-ui-tests` skill
- [ ] Create eval.yaml for `pr-finalize` skill
- [ ] Document eval lifecycle in `.github/skills/README.md`
- [ ] Request `--internal-only` flag from dotnet/skills team for orchestrator skills
- [ ] Set up weekly scheduled eval run

## Tooling Improvements (for dotnet/skills team)

Based on running `skill-validator` 9+ times during the try-fix evaluation:

1. **`--internal-only` flag** -- internal skills get penalized for plugin non-activation (-3.7% overall vs +51.7% isolated)
2. **Static eval.yaml validation in `check`** -- catch spec-conformance issues early (e.g., skill name in prompts)
3. **Timeout progress indicators** -- distinguish "still working" from "stuck in loop"
4. **Overfitting context for infrastructure skills** -- technique items ARE the intended behavior
5. **Results directory auto-naming** -- `--results-dir auto` with `--latest` flag

## Empirical Data (try-fix evaluation)

| Metric | Value |
|--------|-------|
| Isolated improvement | +51.7% |
| Plugin improvement | +1.3% (not activated -- by design) |
| Overfitting | 0.40 (Moderate) |
| Scenarios | 6 |
| Evaluation rounds | 7 (target: 4 max) |

## Related

- PR #34807 -- try-fix skill improvements + eval.yaml
- Epic #32454 -- Improve on the agent story inside the .NET MAUI repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish eval.yaml lifecycle and CI integration for skill validation #34814

Summary

Current State

Proposed Eval Lifecycle

CI Strategy (Tiered)

Eval Update Triggers

Eval Writing Best Practices (from 7-round try-fix cycle)

Streamlined Evaluation Protocol (max 4 rounds)

Work Items

Tooling Improvements (for dotnet/skills team)

Empirical Data (try-fix evaluation)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Trigger	Tool	Time	Purpose
Every PR touching `.github/skills/`	`skill-validator check`	0.1s	Static analysis gate
Weekly scheduled	`evaluate --runs 1`	~22 min	Behavioral drift detection
Before skill merge	`evaluate --runs 3-5`	~90 min	Statistical confidence

Metric	Value
Isolated improvement	+51.7%
Plugin improvement	+1.3% (not activated -- by design)
Overfitting	0.40 (Moderate)
Scenarios	6
Evaluation rounds	7 (target: 4 max)

Establish eval.yaml lifecycle and CI integration for skill validation #34814

Description

Summary

Current State

Proposed Eval Lifecycle

CI Strategy (Tiered)

Eval Update Triggers

Eval Writing Best Practices (from 7-round try-fix cycle)

Streamlined Evaluation Protocol (max 4 rounds)

Work Items

Tooling Improvements (for dotnet/skills team)

Empirical Data (try-fix evaluation)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions