Add eval.yaml for verify-tests-fail-without-fix skill by PureWeen · Pull Request #34815 · dotnet/maui

PureWeen · 2026-04-04T23:20:27Z

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Adds eval.yaml for the verify-tests-fail-without-fix skill, enabling empirical A/B validation via skill-validator.

Context

This is an internal orchestrator-invoked skill used by pr-review to verify tests catch bugs
Follows eval best practices established during the try-fix evaluation cycle (PR Improve try-fix skill: add eval.yaml and fix prompt issues #34807)
Part of eval coverage expansion tracked in issue Establish eval.yaml lifecycle and CI integration for skill validation #34814

Eval Design

6 scenarios covering both verification modes, negative trigger, edge cases, regressions
0 output_contains -- rubric-based behavioral assertions only (no vocabulary overfitting)
14 output_not_contains -- anti-pattern guards for common mistakes
1 expect_activation: false -- native spec field for negative trigger
Realistic timeouts (60s-900s depending on scenario complexity)

Scenarios

Happy path: full verification -- Tests two-phase workflow (fail without fix, pass with fix)
Happy path: verify failure only -- Tests test-creation mode (no fix needed)
Negative trigger -- Documentation question should not invoke verification
Regression: semantic inversion -- Tests passing without fix = FAILED verification (not success!)
Edge case: no test files -- PR without tests can't be verified
Regression: no manual git commands -- Script handles file revert/restore, not raw git

- 6 scenarios covering both verification modes, negative trigger, edge cases - Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting) - Tests the critical 'pass without fix = FAILED verification' semantic inversion - Production-aware prompt design with varied structure - Follows eval best practices from try-fix evaluation cycle (PR #34807) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-04T23:20:37Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://github.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34815

Or

Run remotely in PowerShell:

iex "& { $(irm https://github.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34815"

Copilot

Pull request overview

Adds an evaluation specification (tests/eval.yaml) for the internal verify-tests-fail-without-fix skill to enable scenario-based validation via skill-validator.

Changes:

Introduces a new eval.yaml with 6 scenarios covering full verification, failure-only verification, negative trigger behavior, and regressions/edge cases.

Copilot · 2026-04-04T23:23:46Z

+      - type: output_not_contains
+        value: "no fix files"


In the “verify failure only” scenario, the output_not_contains: "no fix files" assertion conflicts with the script/docs: verify-tests-fail.ps1 explicitly describes this mode as “no fix files detected”. This assertion is likely to fail even when the skill behaves correctly; consider removing it or narrowing it to only prohibit requiring fix files (e.g., erroring because fix files are missing).

Suggested change

- type: output_not_contains

value: "no fix files"

Copilot · 2026-04-04T23:23:46Z

+        value: "git checkout HEAD"
+      - type: output_not_contains
+        value: "git restore"
+      - type: output_not_contains
+        value: "git stash"


These output_not_contains checks ban git checkout HEAD/git restore/git stash, but the skill’s own verify-tests-fail.ps1 uses git checkout ... internally (and even prints “git checkout HEAD” in some error messages). This can create false failures by rejecting correct script-driven behavior; consider rewriting the anti-pattern to target the agent’s manual instructions (e.g., “run git checkout/restore yourself”) rather than substrings that may legitimately appear in script output.

Suggested change

value: "git checkout HEAD"

- type: output_not_contains

value: "git restore"

- type: output_not_contains

value: "git stash"

value: "run git checkout HEAD"

- type: output_not_contains

value: "run git restore"

- type: output_not_contains

value: "run git stash"

eval.yaml: - Add positive assertions to scenarios 1, 2, 5 - Fix scenario 2: remove broad RequireFullVerification/no-fix-files bans - Narrow scenario 6 assertions to action phrases - Add scenario 7: RequireFullVerification flag usage - Add scenario 8: inverted semantics during execution SKILL.md: - Add Activation Guard section - Add inverted pass/fail semantics warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

SKILL.md: - Add Workflow section with Steps 1-4 (Determine Mode, Construct Command, Interpret Results, Report) eval.yaml: - Add scenario 9: PR label automation - Add scenario 10: script auto-detection of test files - Now 10 scenarios total Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kubaflo · 2026-04-07T10:57:56Z

🔍 Multimodel Code Review — PR #34815

Reviewed with Claude Opus 4.6, GPT 5.2, and GPT 5.3 Codex in parallel.

Model	Verdict	Issues
🟣 Opus 4.6	✅ Clean	0 — eval well-structured, assertions well-targeted
🟢 GPT 5.2	⚠️ 4 issues	2 High, 2 Medium
🔵 GPT 5.3 Codex	⚠️ 2 issues	2 Medium

🔴 High (GPT only — not consensus)

1. Possibly obsolete PR label references (s/ai-reproduction-*)
SKILL.md Step 4 and eval scenario 9 reference s/ai-reproduction-confirmed / s/ai-reproduction-failed labels and claim the script auto-manages them. GPT found evidence these may be superseded by s/agent-gate-* per agent-labels.md, and the verification script itself may not contain label logic. Worth verifying current label policy.

2. SKILL.md "Determine Mode" may not match script semantics
SKILL.md says fix files = "non-test code changes in src/". The script may detect "any non-test files since merge-base" (not src/-scoped). -RequireFullVerification is described as selecting mode, but may actually just prevent silent fallback. Worth aligning wording.

🟡 Medium (GPT + Codex consensus)

3. PR description says "0 output_contains" but eval has 5
The eval has output_contains for "verify", "fail", "test", "RequireFullVerification", and "label". Suggest updating the description to reflect the actual assertion mix.

4. Scenario 8: TimeoutException treated as unconditional verification success
A TimeoutException - element not found could indicate a broken test (wrong selector) rather than valid bug reproduction. The assertions forbid saying "test is broken" or "verification failed". SKILL.md troubleshooting itself acknowledges: "Element not found → Wrong AutomationId, app crashed". Consider allowing the agent to caveat ambiguous failures.

ℹ️ Minor Observations

Timeout inconsistency: Scenarios 6 and 7 have 900s timeouts but are interpretive (not execution). Could be reduced to 120s.
Broad output_contains: "verify", "fail", "test", "label" are trivially satisfiable — rubric items do the real evaluation work, which is fine, but these could be tightened or removed.
Opus dissent: Found no issues. Considers the loose output_contains intentional (rubric-compensated) and the SKILL.md additions internally consistent.

Bottom Line

The eval design is solid — good scenario coverage, inverted-semantics regression guards, and a proper negative trigger. Main actionable items:

Fix the PR description (0 → 5 output_contains) — easy fix
Verify label policy — may be stale references
Consider scenario 8 ambiguity — TimeoutException ≠ guaranteed bug reproduction

Review performed by Copilot CLI using Claude Opus 4.6, GPT 5.2, and GPT 5.3 Codex

eval.yaml: - Scenario 6: Narrow 'git checkout HEAD' to 'I will run git checkout' (script uses git checkout internally, banning the substring causes false failures) - Scenario 8: Replace ambiguous TimeoutException with clear assertion failure (TimeoutException could indicate broken test, not bug reproduction) - Scenario 9: Remove incorrect label automation claim (script doesn't manage PR labels), replace with verification output format scenario - Copilot inline review: removed conflicting 'no fix files' assertion from s2 SKILL.md: - Step 1: Fix 'non-test code changes in src/' to 'detected by script from git diff' - Step 4: Remove false claim that script auto-manages PR labels Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-04-07T20:20:25Z

All review comments addressed ✅

From kubaflo's multimodel review:

🔴 High #1 — Obsolete PR label references: ✅ Fixed

Removed false claim in SKILL.md Step 4 that the script auto-manages PR labels (it doesn't — verified by grep of verify-tests-fail.ps1)
Replaced scenario 9 (label automation) with a verification output format scenario — no longer asserts about labels

🔴 High #2 — "Determine Mode" wording: ✅ Fixed

Changed non-test code changes in src/ → non-test code changes detected by the script from the git diff (matches actual script behavior)

🟡 Medium #3 — PR description "0 output_contains" is wrong: ✅ Fixed in v3

Already corrected in previous round — PR description says 5 output_contains (now 4 after removing the label one)

🟡 Medium #4 — Scenario 8 TimeoutException ambiguity: ✅ Fixed

Replaced ambiguous TimeoutException - element not found with a clear assertion failure (Assert.That(rect.Height, Is.GreaterThan(0))) that unambiguously indicates bug reproduction
Removed output_not_contains: "test needs to be fixed" — agent should be allowed to caveat truly ambiguous failures

From Copilot inline review:

Scenario 6 git checkout assertion: ✅ Fixed

output_not_contains: "git checkout HEAD" → "I will run git checkout" (the script itself uses git checkout internally on lines 734/764 — banning the substring would cause false failures)

Scenario 2 "no fix files" ban: Already removed in v2

Current stats: 10 scenarios, 4 `output_contains`, 16 `output_not_contains`, 10 rubric blocks

kubaflo · 2026-04-07T20:24:27Z

Follow-up Code Review — PR #34815 (v4)

Reviewing changes since my earlier multimodel review (commit 028eb16 → 596fc53).

All prior review feedback addressed ✅

Issue	Resolution	Verified
🔴 Obsolete label references	SKILL.md Step 4 removed label claim; scenario 9 replaced with verification output format scenario	✅ No more `s/ai-reproduction-*` references in SKILL.md
🔴 "Determine Mode" wording	Changed to "non-test code changes detected by the script from the git diff"	✅ Accurate
🟡 PR desc "0 output_contains"	v4 header says 4 `output_contains` — matches actual file	✅
🟡 Scenario 8 TimeoutException ambiguity	Replaced with clear assertion failure (`Assert.That(rect.Height, Is.GreaterThan(0))`) + removed overly restrictive `output_not_contains` for "test needs to be fixed"	✅ No longer rewards accepting broken tests
Scenario 6 git assertion conflict	`"git checkout HEAD"` → `"I will run git checkout"` (avoids false failure when script itself uses git checkout)	✅

v4 Changes Look Correct

SKILL.md:

Step 4 now just says "Report the result to the invoking orchestrator" — no false label claims ✅
Step 1 wording accurately describes script behavior ✅

eval.yaml:

Version bumped to v4 with changelog comment ✅
Scenario 8 prompt now uses unambiguous assertion failure — much better signal ✅
Scenario 9 repurposed to test verification output format explanation — useful scenario ✅
Stats: 10 scenarios, 4 output_contains, 16 output_not_contains, 10 rubric blocks ✅

Verdict: LGTM

Confidence: high
All multimodel review feedback addressed. The eval is well-structured with good scenario coverage. Ready for merge.

Follow-up review by Copilot CLI

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [Microsoft.Maui.Controls](https://github.com/dotnet/maui) | nuget | patch | `10.0.51` -> `10.0.60` | --- ### Release Notes <details> <summary>dotnet/maui (Microsoft.Maui.Controls)</summary> ### [`v10.0.60`](https://github.com/dotnet/maui/releases/tag/10.0.60): .NET 10 SR6 10.0.60 [Compare Source](dotnet/maui@10.0.51...10.0.60) #### What's Changed .NET MAUI 10.0.60 introduces significant improvements across all platforms with focus on quality, performance, and developer experience. This release includes 242 commits with various improvements, bug fixes, and enhancements. #### Ai Agents - Add eval.yaml for verify-tests-fail-without-fix skill by [@PureWeen](https://github.com/PureWeen) in dotnet/maui#34815 #### Blazor - \[blazorwebview] align `SupportedOSPlatform` attribute with templates by [@jonathanpeppers](https://github.com/jonathanpeppers) in dotnet/maui#25073 #### Border - \[Testing] Refactoring Feature Matrix UITest Cases for Border Control by [@HarishKumarSF4517](https://github.com/HarishKumarSF4517) in dotnet/maui#34349 - Fix LayoutCycleException from nested Borders on Windows by [@Oxymoron290](https://github.com/Oxymoron290) in dotnet/maui#34337 <details> <summary>🔧 Fixes</summary> - [LayoutCycleException caused by nested Borders in ControlTemplates](dotnet/maui#32406) </details> #### Button - \[iOS] Button RTL text and image overlap - fix by [@kubaflo](https://github.com/kubaflo) in dotnet/maui#29041 - \[Android] Button with corner radius shadow broken on Android device - fix by [@kubaflo](https://github.com/kubaflo) in dotnet/maui#29339 <details> <summary>🔧 Fixes</summary> - [\[Android\] Button with corner radius shadow broken on Android device](dotnet/maui#20596) </details> - \[iOS] Preserve AlwaysTemplate rendering mode in Button.ResizeImageIfNecessary by [@kubaflo](https://github.com/kubaflo) in dotnet/maui#25107 <details> <summary>🔧 Fixes</summary> - [\[iOS\] TintColor on UIButton image no longer working when button made visible](dotnet/maui#25093) </details> - \[Android] Implemented Material3 support for ImageButton by [@Dhivya-SF4094](https://github.com/Dhivya-SF4094) in dotnet/maui#33649 <details> <summary>🔧 Fixes</summary> - [Implement Material3 support for ImageButton](dotnet/maui#33648) </details> - Fixed CI failure : Restore BackButtonBehavior IsEnabled after CanExecute changes by [@Shalini-Ashokan](https://github.com/Shalini-A...

Updated [Microsoft.Maui.Controls](https://github.com/dotnet/maui) from 10.0.51 to 10.0.60. <details> <summary>Release notes</summary> _Sourced from [Microsoft.Maui.Controls's releases](https://github.com/dotnet/maui/releases)._ ## 10.0.60 ## What's Changed .NET MAUI 10.0.60 introduces significant improvements across all platforms with focus on quality, performance, and developer experience. This release includes 242 commits with various improvements, bug fixes, and enhancements. ## Ai Agents - Add eval.yaml for verify-tests-fail-without-fix skill by @PureWeen in dotnet/maui#34815 ## Blazor - [blazorwebview] align `SupportedOSPlatform` attribute with templates by @jonathanpeppers in dotnet/maui#25073 ## Border - [Testing] Refactoring Feature Matrix UITest Cases for Border Control by @HarishKumarSF4517 in dotnet/maui#34349 - Fix LayoutCycleException from nested Borders on Windows by @Oxymoron290 in dotnet/maui#34337 <details> <summary>🔧 Fixes</summary> - [LayoutCycleException caused by nested Borders in ControlTemplates](dotnet/maui#32406) </details> ## Button - [iOS] Button RTL text and image overlap - fix by @kubaflo in dotnet/maui#29041 - [Android] Button with corner radius shadow broken on Android device - fix by @kubaflo in dotnet/maui#29339 <details> <summary>🔧 Fixes</summary> - [[Android] Button with corner radius shadow broken on Android device](dotnet/maui#20596) </details> - [iOS] Preserve AlwaysTemplate rendering mode in Button.ResizeImageIfNecessary by @kubaflo in dotnet/maui#25107 <details> <summary>🔧 Fixes</summary> - [[iOS] TintColor on UIButton image no longer working when button made visible](dotnet/maui#25093) </details> - [Android] Implemented Material3 support for ImageButton by @Dhivya-SF4094 in dotnet/maui#33649 <details> <summary>🔧 Fixes</summary> - [Implement Material3 support for ImageButton](dotnet/maui#33648) </details> - Fixed CI failure : Restore BackButtonBehavior IsEnabled after CanExecute changes by @Shalini-Ashokan in dotnet/maui#34668 ## Checkbox - [iOS/MacCatalyst] Fix CheckBox foreground color not resetting when set to null by @Ahamed-Ali in dotnet/maui#34284 <details> ... (truncated) Commits viewable in [compare view](dotnet/maui@10.0.51...10.0.60). </details> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Microsoft.Maui.Controls&package-manager=nuget&previous-version=10.0.51&new-version=10.0.60)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 4, 2026 23:20

Copilot started reviewing on behalf of PureWeen April 4, 2026 23:21 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 5, 2026

[repo-status] 📊 Daily Repo Status - April 5, 2026 #34817

Closed

PureWeen mentioned this pull request Apr 6, 2026

Establish eval.yaml lifecycle and CI integration for skill validation #34814

Open

7 tasks

kubaflo added the area-ai-agents Copilot CLI agents, agent skills, AI-assisted development label Apr 7, 2026

kubaflo approved these changes Apr 7, 2026

View reviewed changes

PureWeen merged commit a38e0bb into main Apr 7, 2026
4 of 5 checks passed

PureWeen deleted the skill-eval/verify-tests-fail-without-fix branch April 7, 2026 20:54

github-actions Bot mentioned this pull request Apr 8, 2026

[repo-status] Daily Repo Status - April 8, 2026 🌟 #34871

Closed

PureWeen mentioned this pull request Apr 8, 2026

Improve code-review eval: add regression detection scenarios #34884

Open

github-actions Bot mentioned this pull request Apr 12, 2026

[repo-status] Daily Repo Status - April 12, 2026 🌟 #34922

Closed

dependabot Bot mentioned this pull request Apr 29, 2026

Bump Microsoft.Maui.Controls from 10.0.51 to 10.0.60 asbjorjo/PursuitTimer#55

Merged

PureWeen added this to the .NET 10 SR6 milestone Apr 29, 2026

This was referenced Apr 30, 2026

Bump Microsoft.Maui.Controls from 10.0.51 to 10.0.60 nilsauf/FTMS_Viewer#20

Merged

Bump Microsoft.Maui.Controls from 8.0.3 to 10.0.60 VladislavAntonyuk/ContinueOnPC#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval.yaml for verify-tests-fail-without-fix skill#34815

Add eval.yaml for verify-tests-fail-without-fix skill#34815
PureWeen merged 4 commits intomainfrom
skill-eval/verify-tests-fail-without-fix

PureWeen commented Apr 4, 2026

Uh oh!

github-actions Bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

kubaflo commented Apr 7, 2026

Uh oh!

PureWeen commented Apr 7, 2026

Uh oh!

kubaflo commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

PureWeen commented Apr 4, 2026

Summary

Context

Eval Design

Scenarios

Uh oh!

github-actions Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo commented Apr 7, 2026

🔍 Multimodel Code Review — PR #34815

🔴 High (GPT only — not consensus)

🟡 Medium (GPT + Codex consensus)

ℹ️ Minor Observations

Bottom Line

Uh oh!

PureWeen commented Apr 7, 2026

All review comments addressed ✅

From kubaflo's multimodel review:

From Copilot inline review:

Current stats: 10 scenarios, 4 output_contains, 16 output_not_contains, 10 rubric blocks

Uh oh!

kubaflo commented Apr 7, 2026

Follow-up Code Review — PR #34815 (v4)

All prior review feedback addressed ✅

v4 Changes Look Correct

Verdict: LGTM

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Apr 4, 2026 •

edited

Loading

Current stats: 10 scenarios, 4 `output_contains`, 16 `output_not_contains`, 10 rubric blocks