Skip to content

Add continuous-eval observe ref for Foundry agent Monitoring#1733

Open
jugonzales wants to merge 5 commits intomicrosoft:mainfrom
jugonzales:jugonzales/continuous-evals-skill
Open

Add continuous-eval observe ref for Foundry agent Monitoring#1733
jugonzales wants to merge 5 commits intomicrosoft:mainfrom
jugonzales:jugonzales/continuous-evals-skill

Conversation

@jugonzales
Copy link
Copy Markdown
Member

@jugonzales jugonzales commented Apr 6, 2026

Description

Adds a new continuous-eval sub-skill that documents the continuous_eval_create, continuous_eval_get, and
continuous_eval_delete MCP tools. These tools enable ongoing evaluation of agent responses — auto-detecting
agent kind and routing to the appropriate backend (evaluation rules for prompt/workflow agents, scheduled
evaluations for hosted agents).

Changes:

- New foundry-agent/continuous-eval/continuous-eval.md — skill doc with entry points, behavioral rules, 

operations, response format, and evaluator guidance
- SKILL.md — registered continuous-eval sub-skill in the sub-skills table and lifecycle table
- observe/observe.md — added cross-reference to continuous-eval in Related Skills
- Updates snapshots to handle keywords

Checklist

  • Tests pass locally (cd tests && npm test)
  • If modifying skill descriptions: verified routing correctness with integration tests (npm run test:skills:integration -- <skill>)
  • If modifying skill USE FOR / DO NOT USE FOR / PREFER OVER clauses: confirmed no routing regressions for competing skills
  • Version bumped in skill frontmatter (if skill files changed)

@jugonzales jugonzales force-pushed the jugonzales/continuous-evals-skill branch from aa1cdb5 to be05bf7 Compare April 7, 2026 19:41
jugonzales and others added 2 commits April 9, 2026 14:17
- New continuous-eval skill doc with entry points, behavioral rules, operations, and response format
- Register continuous-eval in SKILL.md sub-skills table and lifecycle table
- Add continuous-eval cross-reference to observe.md Related Skills
- Trim SKILL.md description to fit 1024 char limit
- Update trigger snapshots for new keywords

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jugonzales jugonzales force-pushed the jugonzales/continuous-evals-skill branch from be05bf7 to 474a51d Compare April 9, 2026 21:37
@jugonzales jugonzales marked this pull request as ready for review April 9, 2026 21:42
Copilot AI review requested due to automatic review settings April 9, 2026 21:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation and routing keywords for Foundry “continuous evaluation” monitoring, integrating it into the existing observe workflow and updating trigger keyword snapshots accordingly.

Changes:

  • Added a new continuous evaluation reference doc covering continuous_eval_create/get/delete and how to act on monitoring results.
  • Updated observe step-6 monitoring guidance and expanded observe entry points/keywords to include production monitoring scenarios.
  • Bumped microsoft-foundry skill version and refreshed trigger keyword snapshots to reflect the updated description/keywords.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
plugin/skills/microsoft-foundry/SKILL.md Updates skill description/keywords, bumps version, and clarifies observe lifecycle entry for continuous monitoring.
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md Adds continuous monitoring intents, tools, and entry point to the observe skill.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/cicd-monitoring.md Refocuses Step 6 on CI/CD eval gates plus continuous production monitoring and links to continuous eval reference.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md New reference describing continuous evaluation configuration, operations, and remediation loop.
tests/microsoft-foundry/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords (continuous/enable/disable, etc.).
tests/microsoft-foundry/resource/create/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/models/deploy/deploy-model/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/models/deploy/deploy-model-optimal-region/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/models/deploy/customize-deployment/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/models/deploy/capacity/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/foundry-agent/create/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/foundry-agent/deploy/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/foundry-agent/invoke/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/foundry-agent/observe/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/foundry-agent/trace/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/foundry-agent/troubleshoot/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
tests/microsoft-foundry/foundry-agent/eval-datasets/snapshots/triggers.test.ts.snap Snapshot update for changed skill description keywords.
Comments suppressed due to low confidence (1)

plugin/skills/microsoft-foundry/SKILL.md:28

  • PR description says a new doc was added at foundry-agent/continuous-eval/continuous-eval.md, but the change actually adds foundry-agent/observe/references/continuous-eval.md and no new foundry-agent/continuous-eval/ sub-skill entry appears in the Sub-Skills table. Either update the PR description to match, or add/register the intended standalone sub-skill to avoid confusion for maintainers.
| Sub-Skill | When to Use | Reference |
|-----------|-------------|-----------|
| **deploy** | Containerize, build, push to ACR, create/update/start/stop/clone agent deployments | [deploy](foundry-agent/deploy/deploy.md) |
| **invoke** | Send messages to an agent, single or multi-turn conversations | [invoke](foundry-agent/invoke/invoke.md) |
| **observe** | Evaluate agent quality, run batch evals, analyze failures, optimize prompts, improve agent instructions, compare versions, set up CI/CD monitoring, and enable continuous production evaluation | [observe](foundry-agent/observe/observe.md) |
| **trace** | Query traces, analyze latency/failures, correlate eval results to specific responses via App Insights `customEvents` | [trace](foundry-agent/trace/trace.md) |
| **troubleshoot** | View container logs, query telemetry, diagnose failures | [troubleshoot](foundry-agent/troubleshoot/troubleshoot.md) |
| **create** | Create new hosted agent applications. Supports Microsoft Agent Framework, LangGraph, or custom frameworks in Python or C#. Downloads starter samples from foundry-samples repo. | [create](foundry-agent/create/create.md) |
| **eval-datasets** | Harvest production traces into evaluation datasets, manage dataset versions and splits, track evaluation metrics over time, detect regressions, and maintain full lineage from trace to deployment. Use for: create dataset from traces, dataset versioning, evaluation trending, regression detection, dataset comparison, eval lineage. | [eval-datasets](foundry-agent/eval-datasets/eval-datasets.md) |

Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new continuous-eval.md is well-structured - clear entry points, behavioral rules, and operations with proper cross-references to trace/deploy/observe. Separating pre-deploy (CI/CD pipeline) from post-deploy (continuous monitoring) in cicd-monitoring.md is a solid design choice. Version bump and snapshot updates look correct.

Two things to address before merging:

  1. cicd-monitoring.md duplicates roughly 70% of continuous-eval.md's remediation content (score reading, triage steps, routing table, verification). The two docs already have small wording drift between them. Since cicd-monitoring.md already links to continuous-eval.md for setup, it should also defer to it for the "acting on results" workflow instead of duplicating it inline.

  2. observe.md Quick Reference lists continuous_eval_create and continuous_eval_get but not continuous_eval_delete. The linked continuous-eval.md documents delete as a full operation with its own entry point - the parent should surface it too.

3. **Enable** — call `continuous_eval_create` with the selected evaluators. The tool auto-detects agent kind and configures the appropriate backend (real-time for prompt agents, scheduled for hosted agents).
4. **Confirm** — present the returned configuration to the user.

### Acting on Monitoring Results
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire section (reading scores, triage, remediation routing, verification) is nearly identical to continuous-eval.md's "Acting on Results" - already with small wording drift (e.g., "and timestamps" here but not there, "Route To" vs "Action" column names). Since you're already linking to continuous-eval.md for the setup workflow, consider replacing this with:
For how to read evaluation scores, triage regressions, and verify fixes, see Acting on Results.
That keeps continuous-eval.md as the single source of truth for the remediation loop.

|----------|-------|
| MCP server | `azure` |
| Key MCP tools | `evaluator_catalog_get`, `evaluation_agent_batch_eval_create`, `evaluator_catalog_create`, `evaluation_comparison_create`, `prompt_optimize`, `agent_update` |
| Key MCP tools | `evaluator_catalog_get`, `evaluation_agent_batch_eval_create`, `evaluator_catalog_create`, `evaluation_comparison_create`, `prompt_optimize`, `agent_update`, `continuous_eval_create`, `continuous_eval_get` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuous_eval_delete is missing from this list but is documented as a full operation in the linked continuous-eval.md reference (with its own entry point for "Delete continuous eval"). Consider adding it here for consistency.

Copilot AI review requested due to automatic review settings April 9, 2026 22:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

@jugonzales jugonzales changed the title Add continuous-eval sub skill for Foundry agent Monitoring Add continuous-eval observe ref for Foundry agent Monitoring Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new continuous-eval.md is solid - entry points, behavioral rules, and operations are well-structured with proper cross-references. Four additional items beyond what's already flagged:

  1. The Disable operation likely destroys evaluator config if the tool upserts (see inline comment).
  2. evaluation_get isn't in the Quick Reference but is used in the remediation workflow.
  3. The observe.md DO NOT manually call guardrail doesn't cover continuous_eval_create.
  4. Evaluator examples differ across three files with no explanation of why.

Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three additional items beyond what's already flagged:

  1. SKILL.md description dropped the "prompt optimizer workflows" keyword that was in the previous version. The new wording has "prompt optimizer" but not "prompt optimizer workflows" - this could regress routing for that phrase. Either restore it or confirm the removal was intentional.

  2. The scenario parameter in continuous-eval.md's optional parameters table lists standard and business values but doesn't explain what each mode does or when to choose one. An agent can't make a useful recommendation without this context.

  3. continuous_eval_get returns a list (per the Response Format section), but the Disable and Delete workflows assume a single config. If multiple configs exist, there's no guidance on which to target - worth a note on expected cardinality or how to disambiguate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants