Skip to content

Add eval.yaml for verify-tests-fail-without-fix skill#34815

Merged
PureWeen merged 4 commits intomainfrom
skill-eval/verify-tests-fail-without-fix
Apr 7, 2026
Merged

Add eval.yaml for verify-tests-fail-without-fix skill#34815
PureWeen merged 4 commits intomainfrom
skill-eval/verify-tests-fail-without-fix

Conversation

@PureWeen
Copy link
Copy Markdown
Member

@PureWeen PureWeen commented Apr 4, 2026

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Adds eval.yaml for the verify-tests-fail-without-fix skill, enabling empirical A/B validation via skill-validator.

Context

Eval Design

  • 6 scenarios covering both verification modes, negative trigger, edge cases, regressions
  • 0 output_contains -- rubric-based behavioral assertions only (no vocabulary overfitting)
  • 14 output_not_contains -- anti-pattern guards for common mistakes
  • 1 expect_activation: false -- native spec field for negative trigger
  • Realistic timeouts (60s-900s depending on scenario complexity)

Scenarios

  1. Happy path: full verification -- Tests two-phase workflow (fail without fix, pass with fix)
  2. Happy path: verify failure only -- Tests test-creation mode (no fix needed)
  3. Negative trigger -- Documentation question should not invoke verification
  4. Regression: semantic inversion -- Tests passing without fix = FAILED verification (not success!)
  5. Edge case: no test files -- PR without tests can't be verified
  6. Regression: no manual git commands -- Script handles file revert/restore, not raw git

- 6 scenarios covering both verification modes, negative trigger, edge cases
- Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting)
- Tests the critical 'pass without fix = FAILED verification' semantic inversion
- Production-aware prompt design with varied structure
- Follows eval best practices from try-fix evaluation cycle (PR #34807)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 4, 2026 23:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://github.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34815

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://github.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34815"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an evaluation specification (tests/eval.yaml) for the internal verify-tests-fail-without-fix skill to enable scenario-based validation via skill-validator.

Changes:

  • Introduces a new eval.yaml with 6 scenarios covering full verification, failure-only verification, negative trigger behavior, and regressions/edge cases.

Comment on lines +36 to +37
- type: output_not_contains
value: "no fix files"
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the “verify failure only” scenario, the output_not_contains: "no fix files" assertion conflicts with the script/docs: verify-tests-fail.ps1 explicitly describes this mode as “no fix files detected”. This assertion is likely to fail even when the skill behaves correctly; consider removing it or narrowing it to only prohibit requiring fix files (e.g., erroring because fix files are missing).

Suggested change
- type: output_not_contains
value: "no fix files"

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +109
value: "git checkout HEAD"
- type: output_not_contains
value: "git restore"
- type: output_not_contains
value: "git stash"
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These output_not_contains checks ban git checkout HEAD/git restore/git stash, but the skill’s own verify-tests-fail.ps1 uses git checkout ... internally (and even prints “git checkout HEAD” in some error messages). This can create false failures by rejecting correct script-driven behavior; consider rewriting the anti-pattern to target the agent’s manual instructions (e.g., “run git checkout/restore yourself”) rather than substrings that may legitimately appear in script output.

Suggested change
value: "git checkout HEAD"
- type: output_not_contains
value: "git restore"
- type: output_not_contains
value: "git stash"
value: "run git checkout HEAD"
- type: output_not_contains
value: "run git restore"
- type: output_not_contains
value: "run git stash"

Copilot uses AI. Check for mistakes.
eval.yaml:
- Add positive assertions to scenarios 1, 2, 5
- Fix scenario 2: remove broad RequireFullVerification/no-fix-files bans
- Narrow scenario 6 assertions to action phrases
- Add scenario 7: RequireFullVerification flag usage
- Add scenario 8: inverted semantics during execution

SKILL.md:
- Add Activation Guard section
- Add inverted pass/fail semantics warning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SKILL.md:
- Add Workflow section with Steps 1-4 (Determine Mode, Construct Command, Interpret Results, Report)

eval.yaml:
- Add scenario 9: PR label automation
- Add scenario 10: script auto-detection of test files
- Now 10 scenarios total

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kubaflo
Copy link
Copy Markdown
Contributor

kubaflo commented Apr 7, 2026

🔍 Multimodel Code Review — PR #34815

Reviewed with Claude Opus 4.6, GPT 5.2, and GPT 5.3 Codex in parallel.

Model Verdict Issues
🟣 Opus 4.6 ✅ Clean 0 — eval well-structured, assertions well-targeted
🟢 GPT 5.2 ⚠️ 4 issues 2 High, 2 Medium
🔵 GPT 5.3 Codex ⚠️ 2 issues 2 Medium

🔴 High (GPT only — not consensus)

1. Possibly obsolete PR label references (s/ai-reproduction-*)
SKILL.md Step 4 and eval scenario 9 reference s/ai-reproduction-confirmed / s/ai-reproduction-failed labels and claim the script auto-manages them. GPT found evidence these may be superseded by s/agent-gate-* per agent-labels.md, and the verification script itself may not contain label logic. Worth verifying current label policy.

2. SKILL.md "Determine Mode" may not match script semantics
SKILL.md says fix files = "non-test code changes in src/". The script may detect "any non-test files since merge-base" (not src/-scoped). -RequireFullVerification is described as selecting mode, but may actually just prevent silent fallback. Worth aligning wording.


🟡 Medium (GPT + Codex consensus)

3. PR description says "0 output_contains" but eval has 5
The eval has output_contains for "verify", "fail", "test", "RequireFullVerification", and "label". Suggest updating the description to reflect the actual assertion mix.

4. Scenario 8: TimeoutException treated as unconditional verification success
A TimeoutException - element not found could indicate a broken test (wrong selector) rather than valid bug reproduction. The assertions forbid saying "test is broken" or "verification failed". SKILL.md troubleshooting itself acknowledges: "Element not found → Wrong AutomationId, app crashed". Consider allowing the agent to caveat ambiguous failures.


ℹ️ Minor Observations

  • Timeout inconsistency: Scenarios 6 and 7 have 900s timeouts but are interpretive (not execution). Could be reduced to 120s.
  • Broad output_contains: "verify", "fail", "test", "label" are trivially satisfiable — rubric items do the real evaluation work, which is fine, but these could be tightened or removed.
  • Opus dissent: Found no issues. Considers the loose output_contains intentional (rubric-compensated) and the SKILL.md additions internally consistent.

Bottom Line

The eval design is solid — good scenario coverage, inverted-semantics regression guards, and a proper negative trigger. Main actionable items:

  1. Fix the PR description (0 → 5 output_contains) — easy fix
  2. Verify label policy — may be stale references
  3. Consider scenario 8 ambiguity — TimeoutException ≠ guaranteed bug reproduction

Review performed by Copilot CLI using Claude Opus 4.6, GPT 5.2, and GPT 5.3 Codex

@kubaflo kubaflo added the area-ai-agents Copilot CLI agents, agent skills, AI-assisted development label Apr 7, 2026
eval.yaml:
- Scenario 6: Narrow 'git checkout HEAD' to 'I will run git checkout' (script
  uses git checkout internally, banning the substring causes false failures)
- Scenario 8: Replace ambiguous TimeoutException with clear assertion failure
  (TimeoutException could indicate broken test, not bug reproduction)
- Scenario 9: Remove incorrect label automation claim (script doesn't manage
  PR labels), replace with verification output format scenario
- Copilot inline review: removed conflicting 'no fix files' assertion from s2

SKILL.md:
- Step 1: Fix 'non-test code changes in src/' to 'detected by script from git diff'
- Step 4: Remove false claim that script auto-manages PR labels

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

PureWeen commented Apr 7, 2026

All review comments addressed ✅

From kubaflo's multimodel review:

🔴 High #1 — Obsolete PR label references: ✅ Fixed

  • Removed false claim in SKILL.md Step 4 that the script auto-manages PR labels (it doesn't — verified by grep of verify-tests-fail.ps1)
  • Replaced scenario 9 (label automation) with a verification output format scenario — no longer asserts about labels

🔴 High #2 — "Determine Mode" wording: ✅ Fixed

  • Changed non-test code changes in src/non-test code changes detected by the script from the git diff (matches actual script behavior)

🟡 Medium #3 — PR description "0 output_contains" is wrong: ✅ Fixed in v3

  • Already corrected in previous round — PR description says 5 output_contains (now 4 after removing the label one)

🟡 Medium #4 — Scenario 8 TimeoutException ambiguity: ✅ Fixed

  • Replaced ambiguous TimeoutException - element not found with a clear assertion failure (Assert.That(rect.Height, Is.GreaterThan(0))) that unambiguously indicates bug reproduction
  • Removed output_not_contains: "test needs to be fixed" — agent should be allowed to caveat truly ambiguous failures

From Copilot inline review:

Scenario 6 git checkout assertion: ✅ Fixed

  • output_not_contains: "git checkout HEAD""I will run git checkout" (the script itself uses git checkout internally on lines 734/764 — banning the substring would cause false failures)

Scenario 2 "no fix files" ban: Already removed in v2

Current stats: 10 scenarios, 4 output_contains, 16 output_not_contains, 10 rubric blocks

@kubaflo
Copy link
Copy Markdown
Contributor

kubaflo commented Apr 7, 2026

Follow-up Code Review — PR #34815 (v4)

Reviewing changes since my earlier multimodel review (commit 028eb16596fc53).

All prior review feedback addressed ✅

Issue Resolution Verified
🔴 Obsolete label references SKILL.md Step 4 removed label claim; scenario 9 replaced with verification output format scenario ✅ No more s/ai-reproduction-* references in SKILL.md
🔴 "Determine Mode" wording Changed to "non-test code changes detected by the script from the git diff" ✅ Accurate
🟡 PR desc "0 output_contains" v4 header says 4 output_contains — matches actual file
🟡 Scenario 8 TimeoutException ambiguity Replaced with clear assertion failure (Assert.That(rect.Height, Is.GreaterThan(0))) + removed overly restrictive output_not_contains for "test needs to be fixed" ✅ No longer rewards accepting broken tests
Scenario 6 git assertion conflict "git checkout HEAD""I will run git checkout" (avoids false failure when script itself uses git checkout)

v4 Changes Look Correct

SKILL.md:

  • Step 4 now just says "Report the result to the invoking orchestrator" — no false label claims ✅
  • Step 1 wording accurately describes script behavior ✅

eval.yaml:

  • Version bumped to v4 with changelog comment ✅
  • Scenario 8 prompt now uses unambiguous assertion failure — much better signal ✅
  • Scenario 9 repurposed to test verification output format explanation — useful scenario ✅
  • Stats: 10 scenarios, 4 output_contains, 16 output_not_contains, 10 rubric blocks ✅

Verdict: LGTM

Confidence: high
All multimodel review feedback addressed. The eval is well-structured with good scenario coverage. Ready for merge.

Follow-up review by Copilot CLI

@PureWeen PureWeen merged commit a38e0bb into main Apr 7, 2026
4 of 5 checks passed
@PureWeen PureWeen deleted the skill-eval/verify-tests-fail-without-fix branch April 7, 2026 20:54
@PureWeen PureWeen added this to the .NET 10 SR6 milestone Apr 29, 2026
evgenygunko pushed a commit to evgenygunko/CopyWordsDA that referenced this pull request Apr 30, 2026
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [Microsoft.Maui.Controls](https://github.com/dotnet/maui) | nuget | patch | `10.0.51` -> `10.0.60` |

---

### Release Notes

<details>
<summary>dotnet/maui (Microsoft.Maui.Controls)</summary>

### [`v10.0.60`](https://github.com/dotnet/maui/releases/tag/10.0.60): .NET 10 SR6 10.0.60

[Compare Source](dotnet/maui@10.0.51...10.0.60)

#### What's Changed

.NET MAUI 10.0.60 introduces significant improvements across all platforms with focus on quality, performance, and developer experience. This release includes 242 commits with various improvements, bug fixes, and enhancements.

#### Ai Agents

-   Add eval.yaml for verify-tests-fail-without-fix skill by [@&#8203;PureWeen](https://github.com/PureWeen) in dotnet/maui#34815

#### Blazor

-   \[blazorwebview] align `SupportedOSPlatform` attribute with templates by [@&#8203;jonathanpeppers](https://github.com/jonathanpeppers) in dotnet/maui#25073

#### Border

-   \[Testing] Refactoring Feature Matrix UITest Cases for Border Control by [@&#8203;HarishKumarSF4517](https://github.com/HarishKumarSF4517) in dotnet/maui#34349

-   Fix LayoutCycleException from nested Borders on Windows by [@&#8203;Oxymoron290](https://github.com/Oxymoron290) in dotnet/maui#34337

    <details>
    <summary>🔧 Fixes</summary>

    -   [LayoutCycleException caused by nested Borders in ControlTemplates](dotnet/maui#32406)

    </details>

#### Button

-   \[iOS] Button RTL text and image overlap - fix by [@&#8203;kubaflo](https://github.com/kubaflo) in dotnet/maui#29041

-   \[Android] Button with corner radius shadow broken on Android device - fix by [@&#8203;kubaflo](https://github.com/kubaflo) in dotnet/maui#29339

    <details>
    <summary>🔧 Fixes</summary>

    -   [\[Android\] Button with corner radius shadow broken on Android device](dotnet/maui#20596)

    </details>

-   \[iOS] Preserve AlwaysTemplate rendering mode in Button.ResizeImageIfNecessary by [@&#8203;kubaflo](https://github.com/kubaflo) in dotnet/maui#25107

    <details>
    <summary>🔧 Fixes</summary>

    -   [\[iOS\] TintColor on UIButton image no longer working when button made visible](dotnet/maui#25093)

    </details>

-   \[Android] Implemented Material3 support for  ImageButton by [@&#8203;Dhivya-SF4094](https://github.com/Dhivya-SF4094) in dotnet/maui#33649

    <details>
    <summary>🔧 Fixes</summary>

    -   [Implement Material3 support for ImageButton](dotnet/maui#33648)

    </details>

-   Fixed CI failure : Restore BackButtonBehavior IsEnabled after CanExecute changes by [@&#8203;Shalini-Ashokan](https://github.com/Shalini-A...
TheCodeTraveler pushed a commit to TheCodeTraveler/MAUIChatGPTClone that referenced this pull request Apr 30, 2026
Updated [Microsoft.Maui.Controls](https://github.com/dotnet/maui) from
10.0.51 to 10.0.60.

<details>
<summary>Release notes</summary>

_Sourced from [Microsoft.Maui.Controls's
releases](https://github.com/dotnet/maui/releases)._

## 10.0.60

## What's Changed

.NET MAUI 10.0.60 introduces significant improvements across all
platforms with focus on quality, performance, and developer experience.
This release includes 242 commits with various improvements, bug fixes,
and enhancements.


## Ai Agents
- Add eval.yaml for verify-tests-fail-without-fix skill by @​PureWeen in
dotnet/maui#34815

## Blazor
- [blazorwebview] align `SupportedOSPlatform` attribute with templates
by @​jonathanpeppers in dotnet/maui#25073

## Border
- [Testing] Refactoring Feature Matrix UITest Cases for Border Control
by @​HarishKumarSF4517 in dotnet/maui#34349

- Fix LayoutCycleException from nested Borders on Windows by
@​Oxymoron290 in dotnet/maui#34337
  <details>
  <summary>🔧 Fixes</summary>

- [LayoutCycleException caused by nested Borders in
ControlTemplates](dotnet/maui#32406)
  </details>

## Button
- [iOS] Button RTL text and image overlap - fix by @​kubaflo in
dotnet/maui#29041

- [Android] Button with corner radius shadow broken on Android device -
fix by @​kubaflo in dotnet/maui#29339
  <details>
  <summary>🔧 Fixes</summary>

- [[Android] Button with corner radius shadow broken on Android
device](dotnet/maui#20596)
  </details>

- [iOS] Preserve AlwaysTemplate rendering mode in
Button.ResizeImageIfNecessary by @​kubaflo in
dotnet/maui#25107
  <details>
  <summary>🔧 Fixes</summary>

- [[iOS] TintColor on UIButton image no longer working when button made
visible](dotnet/maui#25093)
  </details>

- [Android] Implemented Material3 support for ImageButton by
@​Dhivya-SF4094 in dotnet/maui#33649
  <details>
  <summary>🔧 Fixes</summary>

- [Implement Material3 support for
ImageButton](dotnet/maui#33648)
  </details>

- Fixed CI failure : Restore BackButtonBehavior IsEnabled after
CanExecute changes by @​Shalini-Ashokan in
dotnet/maui#34668

## Checkbox
- [iOS/MacCatalyst] Fix CheckBox foreground color not resetting when set
to null by @​Ahamed-Ali in dotnet/maui#34284
  <details>
 ... (truncated)

Commits viewable in [compare
view](dotnet/maui@10.0.51...10.0.60).
</details>

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Microsoft.Maui.Controls&package-manager=nuget&previous-version=10.0.51&new-version=10.0.60)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai-agents Copilot CLI agents, agent skills, AI-assisted development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants