agent-studio/backend/openjiuwen_studio/evaluation · openJiuwen/agent-studio - AtomGit

ZYQ5333docs: remove unreleased Evaluation Agent and Workflow documentation

文件	最后提交记录	最后更新时间
Evaluation Agent and Workflow	docs: remove unreleased Evaluation Agent and Workflow documentation Co-authored-by: michaelhuawei<michael.atamuk@huawei.com> # message auto-generated for no-merge-commit merge: !1065 merge feature/evaluation-flag-ui-toggle into develop docs: remove unreleased Evaluation Agent and Workflow documentation Created-by: michaelhuawei Commit-by: michaelhuawei Merged-by: ZYQ5333 Description: What type of PR is this? /kind documentation /kind cleanup What does this PR do / why do we need it: This PR removes documentation for the Evaluation Agent and Workflow feature, which is not released and not available in the current product version. Keeping this documentation visible misleads users and suggests functionality that does not exist yet. ### ✔ Changes - Deleted the entire Evaluation Agent and Workflow documentation directories: - [EN docs](ca://s?q=Show_removed_EN_docs_path) - [ZH docs](ca://s?q=Show_removed_ZH_docs_path) - Verified that no other documentation pages link to these removed paths. - Ensures the documentation set reflects only released and supported features. ### ✔ Result Documentation is now aligned with the actual product capabilities and avoids user confusion. Which issue(s) this PR fixes: Fixes #865 Code review checklist: + - [ ] whether to verify the function's return value + - [ ] Whether to comply with SOLID principle / Demeter's law + - [ ] Whether there is UT test case && the test case is valid (if no test case, explain why) + - [ ] Whether the API change is involved + - [ ] Whether official document modification is involved See merge request: openJiuwen/agent-studio!1065	15 天前
cli	feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI) Co-authored-by: Michael<michael.atamuk@huawei.com> Co-authored-by: adi_amir<adi.amir1@huawei.com> Co-authored-by: nizzan<nizzan.kimhi@huawei.com> Co-authored-by: @aharonamir1<amir.aharon@huawei.com> # message auto-generated for no-merge-commit merge: !1023 merge evaluation into develop feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI) Created-by: michaelhuawei Commit-by: Michael;michaelhuawei;aharonamir1;@aharonamir1;nikita-mee;nizzan;adi_amir Merged-by: ZYQ5333 Description: <!-- Thanks for sending a pull request! Here are some tips for you: 1) If this is your first time, please read our contributor guidelines: [https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md](https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md) 2) If you want to contribute your code but don't know who will review and merge, please add label `openJiuwen-assistant` to the pull request. --> What type of PR is this? /kind feature /kind refactor --- ## What does this PR do / why do we need it This PR introduces the Evaluation System for Agents and Workflows, a major new module that provides first‑class, systematic evaluation capabilities across all OpenJiuwen workflow patterns. It enables teams to measure correctness, reliability, semantic quality, latency, token usage, and regression behavior for any workflow or agent. The system solves three long‑standing gaps: 1. No regression detection — previously no structured way to verify that workflow changes preserved correctness. 2. No comparative measurement — no shared metrics to compare versions of agents/workflows. 3. No sampling support — LLM nondeterminism required multi‑trial evaluation, which did not exist. This feature adds a complete backend + frontend evaluation pipeline, including suites, tasks, graders, metrics, benchmark loading, and a full results UI. --- ## Which issue(s) this PR fixes Fixes #<issue-number> --- ## What scenarios were tested, and what were the verification results Functional verification - Created evaluation suites, added tasks, updated tasks, deleted tasks. - Loaded all seven benchmark YAML files; validated correct task creation. - Ran evaluation against workflows and agents with deterministic, model-based, and code-based graders. - Verified pattern detection across all six structural patterns (Routing, Chaining, Parallelization, Orchestrator‑Worker, Evaluator‑Optimizer, Memory Usage). - Confirmed correct grader behavior: deterministic checks, LLM judge calls, code-based execution, weight aggregation. - Confirmed metrics engine correctness: pass/fail, pass@k, pass^k, score distribution, latency stats, token usage, reliability, per-grader breakdown. - Verified custom aggregate metrics execution, including error handling. - Confirmed run lifecycle: RUNNING → COMPLETED/FAILED, immutability of completed runs. - Verified large-suite behavior (50 tasks × 5 trials) and UI rendering of large trace sets. Performance verification - Execution engine calls scale linearly with tasks × trials. - Model-based graders correctly issue LLM judge calls per trial. - No regressions to existing workflow/agent execution performance. Reliability verification - Flakiness metric validated using multi-trial runs. - Pattern detection validated with synthetic traces and real workflows. - Code-based grader error paths tested (exceptions, invalid returns). Frontend verification - Full CRUD for suites and tasks. - Run dialog correctly configures workflow/agent target and trial count. - Results UI renders Overview, Metrics, Graders, and Traces tabs with correct visibility rules. - Zustand store state transitions validated. --- ## Self-checklist + - [x] Design: Reviewed with maintainers; all comments addressed. + - [x] Test: Full UT/ST coverage for harness, graders, metrics, pattern validator, API, and frontend store. + - [x] Verification: PR description includes detailed verification results for feature, refactor, and bugfix aspects. + - [x] Interface: Adds new external API endpoints under `/evaluation`; no breaking changes to existing interfaces. + - [x] Document: Benchmark usage, suite/task schema, and evaluation workflow documented; docs PR prepared separately. --- ## Special notes for reviewers - This module is fully additive — no existing endpoints or execution logic are modified. - Code-based graders and custom metrics use `exec()`; this is an accepted constraint for v1 and will be hardened later. - Large evaluation runs can produce multi‑MB result payloads; pagination is planned for a future release. - Pattern detection is heuristic; tasks may override `pattern_type` explicitly. See merge request: openJiuwen/agent-studio!1023	27 天前
sdk	feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI) Co-authored-by: Michael<michael.atamuk@huawei.com> Co-authored-by: adi_amir<adi.amir1@huawei.com> Co-authored-by: nizzan<nizzan.kimhi@huawei.com> Co-authored-by: @aharonamir1<amir.aharon@huawei.com> # message auto-generated for no-merge-commit merge: !1023 merge evaluation into develop feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI) Created-by: michaelhuawei Commit-by: Michael;michaelhuawei;aharonamir1;@aharonamir1;nikita-mee;nizzan;adi_amir Merged-by: ZYQ5333 Description: <!-- Thanks for sending a pull request! Here are some tips for you: 1) If this is your first time, please read our contributor guidelines: [https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md](https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md) 2) If you want to contribute your code but don't know who will review and merge, please add label `openJiuwen-assistant` to the pull request. --> What type of PR is this? /kind feature /kind refactor --- ## What does this PR do / why do we need it This PR introduces the Evaluation System for Agents and Workflows, a major new module that provides first‑class, systematic evaluation capabilities across all OpenJiuwen workflow patterns. It enables teams to measure correctness, reliability, semantic quality, latency, token usage, and regression behavior for any workflow or agent. The system solves three long‑standing gaps: 1. No regression detection — previously no structured way to verify that workflow changes preserved correctness. 2. No comparative measurement — no shared metrics to compare versions of agents/workflows. 3. No sampling support — LLM nondeterminism required multi‑trial evaluation, which did not exist. This feature adds a complete backend + frontend evaluation pipeline, including suites, tasks, graders, metrics, benchmark loading, and a full results UI. --- ## Which issue(s) this PR fixes Fixes #<issue-number> --- ## What scenarios were tested, and what were the verification results Functional verification - Created evaluation suites, added tasks, updated tasks, deleted tasks. - Loaded all seven benchmark YAML files; validated correct task creation. - Ran evaluation against workflows and agents with deterministic, model-based, and code-based graders. - Verified pattern detection across all six structural patterns (Routing, Chaining, Parallelization, Orchestrator‑Worker, Evaluator‑Optimizer, Memory Usage). - Confirmed correct grader behavior: deterministic checks, LLM judge calls, code-based execution, weight aggregation. - Confirmed metrics engine correctness: pass/fail, pass@k, pass^k, score distribution, latency stats, token usage, reliability, per-grader breakdown. - Verified custom aggregate metrics execution, including error handling. - Confirmed run lifecycle: RUNNING → COMPLETED/FAILED, immutability of completed runs. - Verified large-suite behavior (50 tasks × 5 trials) and UI rendering of large trace sets. Performance verification - Execution engine calls scale linearly with tasks × trials. - Model-based graders correctly issue LLM judge calls per trial. - No regressions to existing workflow/agent execution performance. Reliability verification - Flakiness metric validated using multi-trial runs. - Pattern detection validated with synthetic traces and real workflows. - Code-based grader error paths tested (exceptions, invalid returns). Frontend verification - Full CRUD for suites and tasks. - Run dialog correctly configures workflow/agent target and trial count. - Results UI renders Overview, Metrics, Graders, and Traces tabs with correct visibility rules. - Zustand store state transitions validated. --- ## Self-checklist + - [x] Design: Reviewed with maintainers; all comments addressed. + - [x] Test: Full UT/ST coverage for harness, graders, metrics, pattern validator, API, and frontend store. + - [x] Verification: PR description includes detailed verification results for feature, refactor, and bugfix aspects. + - [x] Interface: Adds new external API endpoints under `/evaluation`; no breaking changes to existing interfaces. + - [x] Document: Benchmark usage, suite/task schema, and evaluation workflow documented; docs PR prepared separately. --- ## Special notes for reviewers - This module is fully additive — no existing endpoints or execution logic are modified. - Code-based graders and custom metrics use `exec()`; this is an accepted constraint for v1 and will be hardened later. - Large evaluation runs can produce multi‑MB result payloads; pagination is planned for a future release. - Pattern detection is heuristic; tasks may override `pattern_type` explicitly. See merge request: openJiuwen/agent-studio!1023	27 天前
__init__.py	feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI) Co-authored-by: Michael<michael.atamuk@huawei.com> Co-authored-by: adi_amir<adi.amir1@huawei.com> Co-authored-by: nizzan<nizzan.kimhi@huawei.com> Co-authored-by: @aharonamir1<amir.aharon@huawei.com> # message auto-generated for no-merge-commit merge: !1023 merge evaluation into develop feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI) Created-by: michaelhuawei Commit-by: Michael;michaelhuawei;aharonamir1;@aharonamir1;nikita-mee;nizzan;adi_amir Merged-by: ZYQ5333 Description: <!-- Thanks for sending a pull request! Here are some tips for you: 1) If this is your first time, please read our contributor guidelines: [https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md](https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md) 2) If you want to contribute your code but don't know who will review and merge, please add label `openJiuwen-assistant` to the pull request. --> What type of PR is this? /kind feature /kind refactor --- ## What does this PR do / why do we need it This PR introduces the Evaluation System for Agents and Workflows, a major new module that provides first‑class, systematic evaluation capabilities across all OpenJiuwen workflow patterns. It enables teams to measure correctness, reliability, semantic quality, latency, token usage, and regression behavior for any workflow or agent. The system solves three long‑standing gaps: 1. No regression detection — previously no structured way to verify that workflow changes preserved correctness. 2. No comparative measurement — no shared metrics to compare versions of agents/workflows. 3. No sampling support — LLM nondeterminism required multi‑trial evaluation, which did not exist. This feature adds a complete backend + frontend evaluation pipeline, including suites, tasks, graders, metrics, benchmark loading, and a full results UI. --- ## Which issue(s) this PR fixes Fixes #<issue-number> --- ## What scenarios were tested, and what were the verification results Functional verification - Created evaluation suites, added tasks, updated tasks, deleted tasks. - Loaded all seven benchmark YAML files; validated correct task creation. - Ran evaluation against workflows and agents with deterministic, model-based, and code-based graders. - Verified pattern detection across all six structural patterns (Routing, Chaining, Parallelization, Orchestrator‑Worker, Evaluator‑Optimizer, Memory Usage). - Confirmed correct grader behavior: deterministic checks, LLM judge calls, code-based execution, weight aggregation. - Confirmed metrics engine correctness: pass/fail, pass@k, pass^k, score distribution, latency stats, token usage, reliability, per-grader breakdown. - Verified custom aggregate metrics execution, including error handling. - Confirmed run lifecycle: RUNNING → COMPLETED/FAILED, immutability of completed runs. - Verified large-suite behavior (50 tasks × 5 trials) and UI rendering of large trace sets. Performance verification - Execution engine calls scale linearly with tasks × trials. - Model-based graders correctly issue LLM judge calls per trial. - No regressions to existing workflow/agent execution performance. Reliability verification - Flakiness metric validated using multi-trial runs. - Pattern detection validated with synthetic traces and real workflows. - Code-based grader error paths tested (exceptions, invalid returns). Frontend verification - Full CRUD for suites and tasks. - Run dialog correctly configures workflow/agent target and trial count. - Results UI renders Overview, Metrics, Graders, and Traces tabs with correct visibility rules. - Zustand store state transitions validated. --- ## Self-checklist + - [x] Design: Reviewed with maintainers; all comments addressed. + - [x] Test: Full UT/ST coverage for harness, graders, metrics, pattern validator, API, and frontend store. + - [x] Verification: PR description includes detailed verification results for feature, refactor, and bugfix aspects. + - [x] Interface: Adds new external API endpoints under `/evaluation`; no breaking changes to existing interfaces. + - [x] Document: Benchmark usage, suite/task schema, and evaluation workflow documented; docs PR prepared separately. --- ## Special notes for reviewers - This module is fully additive — no existing endpoints or execution logic are modified. - Code-based graders and custom metrics use `exec()`; this is an accepted constraint for v1 and will be hardened later. - Large evaluation runs can produce multi‑MB result payloads; pagination is planned for a future release. - Pattern detection is heuristic; tasks may override `pattern_type` explicitly. See merge request: openJiuwen/agent-studio!1023	27 天前