Add eval.yaml for verify-tests-fail-without-fix skill#34815
Conversation
- 6 scenarios covering both verification modes, negative trigger, edge cases - Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting) - Tests the critical 'pass without fix = FAILED verification' semantic inversion - Production-aware prompt design with varied structure - Follows eval best practices from try-fix evaluation cycle (PR #34807) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://github.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34815Or
iex "& { $(irm https://github.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34815" |
There was a problem hiding this comment.
Pull request overview
Adds an evaluation specification (tests/eval.yaml) for the internal verify-tests-fail-without-fix skill to enable scenario-based validation via skill-validator.
Changes:
- Introduces a new
eval.yamlwith 6 scenarios covering full verification, failure-only verification, negative trigger behavior, and regressions/edge cases.
| - type: output_not_contains | ||
| value: "no fix files" |
There was a problem hiding this comment.
In the “verify failure only” scenario, the output_not_contains: "no fix files" assertion conflicts with the script/docs: verify-tests-fail.ps1 explicitly describes this mode as “no fix files detected”. This assertion is likely to fail even when the skill behaves correctly; consider removing it or narrowing it to only prohibit requiring fix files (e.g., erroring because fix files are missing).
| - type: output_not_contains | |
| value: "no fix files" |
| value: "git checkout HEAD" | ||
| - type: output_not_contains | ||
| value: "git restore" | ||
| - type: output_not_contains | ||
| value: "git stash" |
There was a problem hiding this comment.
These output_not_contains checks ban git checkout HEAD/git restore/git stash, but the skill’s own verify-tests-fail.ps1 uses git checkout ... internally (and even prints “git checkout HEAD” in some error messages). This can create false failures by rejecting correct script-driven behavior; consider rewriting the anti-pattern to target the agent’s manual instructions (e.g., “run git checkout/restore yourself”) rather than substrings that may legitimately appear in script output.
| value: "git checkout HEAD" | |
| - type: output_not_contains | |
| value: "git restore" | |
| - type: output_not_contains | |
| value: "git stash" | |
| value: "run git checkout HEAD" | |
| - type: output_not_contains | |
| value: "run git restore" | |
| - type: output_not_contains | |
| value: "run git stash" |
eval.yaml: - Add positive assertions to scenarios 1, 2, 5 - Fix scenario 2: remove broad RequireFullVerification/no-fix-files bans - Narrow scenario 6 assertions to action phrases - Add scenario 7: RequireFullVerification flag usage - Add scenario 8: inverted semantics during execution SKILL.md: - Add Activation Guard section - Add inverted pass/fail semantics warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SKILL.md: - Add Workflow section with Steps 1-4 (Determine Mode, Construct Command, Interpret Results, Report) eval.yaml: - Add scenario 9: PR label automation - Add scenario 10: script auto-detection of test files - Now 10 scenarios total Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Multimodel Code Review — PR #34815Reviewed with Claude Opus 4.6, GPT 5.2, and GPT 5.3 Codex in parallel.
🔴 High (GPT only — not consensus)1. Possibly obsolete PR label references ( 2. SKILL.md "Determine Mode" may not match script semantics 🟡 Medium (GPT + Codex consensus)3. PR description says "0 4. Scenario 8: TimeoutException treated as unconditional verification success ℹ️ Minor Observations
Bottom LineThe eval design is solid — good scenario coverage, inverted-semantics regression guards, and a proper negative trigger. Main actionable items:
Review performed by Copilot CLI using Claude Opus 4.6, GPT 5.2, and GPT 5.3 Codex |
eval.yaml: - Scenario 6: Narrow 'git checkout HEAD' to 'I will run git checkout' (script uses git checkout internally, banning the substring causes false failures) - Scenario 8: Replace ambiguous TimeoutException with clear assertion failure (TimeoutException could indicate broken test, not bug reproduction) - Scenario 9: Remove incorrect label automation claim (script doesn't manage PR labels), replace with verification output format scenario - Copilot inline review: removed conflicting 'no fix files' assertion from s2 SKILL.md: - Step 1: Fix 'non-test code changes in src/' to 'detected by script from git diff' - Step 4: Remove false claim that script auto-manages PR labels Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All review comments addressed ✅From kubaflo's multimodel review:🔴 High #1 — Obsolete PR label references: ✅ Fixed
🔴 High #2 — "Determine Mode" wording: ✅ Fixed
🟡 Medium #3 — PR description "0 output_contains" is wrong: ✅ Fixed in v3
🟡 Medium #4 — Scenario 8 TimeoutException ambiguity: ✅ Fixed
From Copilot inline review:Scenario 6 git checkout assertion: ✅ Fixed
Scenario 2 "no fix files" ban: Already removed in v2 Current stats: 10 scenarios, 4
|
Follow-up Code Review — PR #34815 (v4)Reviewing changes since my earlier multimodel review (commit All prior review feedback addressed ✅
v4 Changes Look CorrectSKILL.md:
eval.yaml:
Verdict: LGTMConfidence: high Follow-up review by Copilot CLI |
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [Microsoft.Maui.Controls](https://github.com/dotnet/maui) | nuget | patch | `10.0.51` -> `10.0.60` | --- ### Release Notes <details> <summary>dotnet/maui (Microsoft.Maui.Controls)</summary> ### [`v10.0.60`](https://github.com/dotnet/maui/releases/tag/10.0.60): .NET 10 SR6 10.0.60 [Compare Source](dotnet/maui@10.0.51...10.0.60) #### What's Changed .NET MAUI 10.0.60 introduces significant improvements across all platforms with focus on quality, performance, and developer experience. This release includes 242 commits with various improvements, bug fixes, and enhancements. #### Ai Agents - Add eval.yaml for verify-tests-fail-without-fix skill by [@​PureWeen](https://github.com/PureWeen) in dotnet/maui#34815 #### Blazor - \[blazorwebview] align `SupportedOSPlatform` attribute with templates by [@​jonathanpeppers](https://github.com/jonathanpeppers) in dotnet/maui#25073 #### Border - \[Testing] Refactoring Feature Matrix UITest Cases for Border Control by [@​HarishKumarSF4517](https://github.com/HarishKumarSF4517) in dotnet/maui#34349 - Fix LayoutCycleException from nested Borders on Windows by [@​Oxymoron290](https://github.com/Oxymoron290) in dotnet/maui#34337 <details> <summary>🔧 Fixes</summary> - [LayoutCycleException caused by nested Borders in ControlTemplates](dotnet/maui#32406) </details> #### Button - \[iOS] Button RTL text and image overlap - fix by [@​kubaflo](https://github.com/kubaflo) in dotnet/maui#29041 - \[Android] Button with corner radius shadow broken on Android device - fix by [@​kubaflo](https://github.com/kubaflo) in dotnet/maui#29339 <details> <summary>🔧 Fixes</summary> - [\[Android\] Button with corner radius shadow broken on Android device](dotnet/maui#20596) </details> - \[iOS] Preserve AlwaysTemplate rendering mode in Button.ResizeImageIfNecessary by [@​kubaflo](https://github.com/kubaflo) in dotnet/maui#25107 <details> <summary>🔧 Fixes</summary> - [\[iOS\] TintColor on UIButton image no longer working when button made visible](dotnet/maui#25093) </details> - \[Android] Implemented Material3 support for ImageButton by [@​Dhivya-SF4094](https://github.com/Dhivya-SF4094) in dotnet/maui#33649 <details> <summary>🔧 Fixes</summary> - [Implement Material3 support for ImageButton](dotnet/maui#33648) </details> - Fixed CI failure : Restore BackButtonBehavior IsEnabled after CanExecute changes by [@​Shalini-Ashokan](https://github.com/Shalini-A...
Updated [Microsoft.Maui.Controls](https://github.com/dotnet/maui) from 10.0.51 to 10.0.60. <details> <summary>Release notes</summary> _Sourced from [Microsoft.Maui.Controls's releases](https://github.com/dotnet/maui/releases)._ ## 10.0.60 ## What's Changed .NET MAUI 10.0.60 introduces significant improvements across all platforms with focus on quality, performance, and developer experience. This release includes 242 commits with various improvements, bug fixes, and enhancements. ## Ai Agents - Add eval.yaml for verify-tests-fail-without-fix skill by @PureWeen in dotnet/maui#34815 ## Blazor - [blazorwebview] align `SupportedOSPlatform` attribute with templates by @jonathanpeppers in dotnet/maui#25073 ## Border - [Testing] Refactoring Feature Matrix UITest Cases for Border Control by @HarishKumarSF4517 in dotnet/maui#34349 - Fix LayoutCycleException from nested Borders on Windows by @Oxymoron290 in dotnet/maui#34337 <details> <summary>🔧 Fixes</summary> - [LayoutCycleException caused by nested Borders in ControlTemplates](dotnet/maui#32406) </details> ## Button - [iOS] Button RTL text and image overlap - fix by @kubaflo in dotnet/maui#29041 - [Android] Button with corner radius shadow broken on Android device - fix by @kubaflo in dotnet/maui#29339 <details> <summary>🔧 Fixes</summary> - [[Android] Button with corner radius shadow broken on Android device](dotnet/maui#20596) </details> - [iOS] Preserve AlwaysTemplate rendering mode in Button.ResizeImageIfNecessary by @kubaflo in dotnet/maui#25107 <details> <summary>🔧 Fixes</summary> - [[iOS] TintColor on UIButton image no longer working when button made visible](dotnet/maui#25093) </details> - [Android] Implemented Material3 support for ImageButton by @Dhivya-SF4094 in dotnet/maui#33649 <details> <summary>🔧 Fixes</summary> - [Implement Material3 support for ImageButton](dotnet/maui#33648) </details> - Fixed CI failure : Restore BackButtonBehavior IsEnabled after CanExecute changes by @Shalini-Ashokan in dotnet/maui#34668 ## Checkbox - [iOS/MacCatalyst] Fix CheckBox foreground color not resetting when set to null by @Ahamed-Ali in dotnet/maui#34284 <details> ... (truncated) Commits viewable in [compare view](dotnet/maui@10.0.51...10.0.60). </details> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Summary
Adds eval.yaml for the
verify-tests-fail-without-fixskill, enabling empirical A/B validation via skill-validator.Context
pr-reviewto verify tests catch bugsEval Design
output_contains-- rubric-based behavioral assertions only (no vocabulary overfitting)output_not_contains-- anti-pattern guards for common mistakesexpect_activation: false-- native spec field for negative triggerScenarios