docs: remove unreleased Evaluation Agent and Workflow documentation
Co-authored-by: michaelhuawei<michael.atamuk@huawei.com>
# message auto-generated for no-merge-commit merge:
!1065 merge feature/evaluation-flag-ui-toggle into develop
docs: remove unreleased Evaluation Agent and Workflow documentation
Created-by: michaelhuawei
Commit-by: michaelhuawei
Merged-by: ZYQ5333
Description: **What type of PR is this?**
/kind documentation /kind cleanup
**What does this PR do / why do we need it**:
This PR removes documentation for the **Evaluation Agent and Workflow** feature, which is not released and not available in the current product version.
Keeping this documentation visible misleads users and suggests functionality that does not exist yet.
### ✔ Changes
- Deleted the entire Evaluation Agent and Workflow documentation directories:
- **[EN docs](ca://s?q=Show_removed_EN_docs_path)**
- **[ZH docs](ca://s?q=Show_removed_ZH_docs_path)**
- Verified that no other documentation pages link to these removed paths.
- Ensures the documentation set reflects only released and supported features.
### ✔ Result
Documentation is now aligned with the actual product capabilities and avoids user confusion.
**Which issue(s) this PR fixes**:
Fixes #865
**Code review checklist**:
+ - [ ] whether to verify the function's return value
+ - [ ] Whether to comply with **SOLID principle / Demeter's law**
+ - [ ] Whether there is UT test case && the test case is valid (if no test case, explain why)
+ - [ ] Whether the API change is involved
+ - [ ] Whether official document modification is involved
See merge request: openJiuwen/agent-studio!1065
feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI)
Co-authored-by: Michael<michael.atamuk@huawei.com>
Co-authored-by: adi_amir<adi.amir1@huawei.com>
Co-authored-by: nizzan<nizzan.kimhi@huawei.com>
Co-authored-by: @aharonamir1<amir.aharon@huawei.com>
# message auto-generated for no-merge-commit merge:
!1023 merge evaluation into develop
feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI)
Created-by: michaelhuawei
Commit-by: Michael;michaelhuawei;aharonamir1;@aharonamir1;nikita-mee;nizzan;adi_amir
Merged-by: ZYQ5333
Description: <!--
Thanks for sending a pull request!
Here are some tips for you:
1) If this is your first time, please read our contributor guidelines:
[https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md](https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md)
2) If you want to contribute your code but don't know who will review and merge,
please add label openJiuwen-assistant to the pull request.
-->
**What type of PR is this?**
/kind feature /kind refactor
---
## **What does this PR do / why do we need it**
This PR introduces the **Evaluation System for Agents and Workflows**, a major new module that provides first‑class, systematic evaluation capabilities across all OpenJiuwen workflow patterns. It enables teams to measure correctness, reliability, semantic quality, latency, token usage, and regression behavior for any workflow or agent.
The system solves three long‑standing gaps:
1. **No regression detection** — previously no structured way to verify that workflow changes preserved correctness.
2. **No comparative measurement** — no shared metrics to compare versions of agents/workflows.
3. **No sampling support** — LLM nondeterminism required multi‑trial evaluation, which did not exist.
This feature adds a complete backend + frontend evaluation pipeline, including suites, tasks, graders, metrics, benchmark loading, and a full results UI.
---
## **Which issue(s) this PR fixes**
Fixes #<issue-number>
---
## **What scenarios were tested, and what were the verification results**
**Functional verification**
- Created evaluation suites, added tasks, updated tasks, deleted tasks.
- Loaded all seven benchmark YAML files; validated correct task creation.
- Ran evaluation against workflows and agents with deterministic, model-based, and code-based graders.
- Verified pattern detection across all six structural patterns (Routing, Chaining, Parallelization, Orchestrator‑Worker, Evaluator‑Optimizer, Memory Usage).
- Confirmed correct grader behavior: deterministic checks, LLM judge calls, code-based execution, weight aggregation.
- Confirmed metrics engine correctness: pass/fail, pass@k, pass^k, score distribution, latency stats, token usage, reliability, per-grader breakdown.
- Verified custom aggregate metrics execution, including error handling.
- Confirmed run lifecycle: RUNNING → COMPLETED/FAILED, immutability of completed runs.
- Verified large-suite behavior (50 tasks × 5 trials) and UI rendering of large trace sets.
**Performance verification**
- Execution engine calls scale linearly with tasks × trials.
- Model-based graders correctly issue LLM judge calls per trial.
- No regressions to existing workflow/agent execution performance.
**Reliability verification**
- Flakiness metric validated using multi-trial runs.
- Pattern detection validated with synthetic traces and real workflows.
- Code-based grader error paths tested (exceptions, invalid returns).
**Frontend verification**
- Full CRUD for suites and tasks.
- Run dialog correctly configures workflow/agent target and trial count.
- Results UI renders Overview, Metrics, Graders, and Traces tabs with correct visibility rules.
- Zustand store state transitions validated.
---
## **Self-checklist**
+ - [x] **Design**: Reviewed with maintainers; all comments addressed.
+ - [x] **Test**: Full UT/ST coverage for harness, graders, metrics, pattern validator, API, and frontend store.
+ - [x] **Verification**: PR description includes detailed verification results for feature, refactor, and bugfix aspects.
+ - [x] **Interface**: Adds new external API endpoints under /evaluation; no breaking changes to existing interfaces.
+ - [x] **Document**: Benchmark usage, suite/task schema, and evaluation workflow documented; docs PR prepared separately.
---
## **Special notes for reviewers**
- This module is **fully additive** — no existing endpoints or execution logic are modified.
- Code-based graders and custom metrics use exec(); this is an accepted constraint for v1 and will be hardened later.
- Large evaluation runs can produce multi‑MB result payloads; pagination is planned for a future release.
- Pattern detection is heuristic; tasks may override pattern_type explicitly.
See merge request: openJiuwen/agent-studio!1023
feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI)
Co-authored-by: Michael<michael.atamuk@huawei.com>
Co-authored-by: adi_amir<adi.amir1@huawei.com>
Co-authored-by: nizzan<nizzan.kimhi@huawei.com>
Co-authored-by: @aharonamir1<amir.aharon@huawei.com>
# message auto-generated for no-merge-commit merge:
!1023 merge evaluation into develop
feat(evaluation): introduce full evaluation system for agents & workflows (suites, tasks, graders, metrics, benchmarks, UI)
Created-by: michaelhuawei
Commit-by: Michael;michaelhuawei;aharonamir1;@aharonamir1;nikita-mee;nizzan;adi_amir
Merged-by: ZYQ5333
Description: <!--
Thanks for sending a pull request!
Here are some tips for you:
1) If this is your first time, please read our contributor guidelines:
[https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md](https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md)
2) If you want to contribute your code but don't know who will review and merge,
please add label openJiuwen-assistant to the pull request.
-->
**What type of PR is this?**
/kind feature /kind refactor
---
## **What does this PR do / why do we need it**
This PR introduces the **Evaluation System for Agents and Workflows**, a major new module that provides first‑class, systematic evaluation capabilities across all OpenJiuwen workflow patterns. It enables teams to measure correctness, reliability, semantic quality, latency, token usage, and regression behavior for any workflow or agent.
The system solves three long‑standing gaps:
1. **No regression detection** — previously no structured way to verify that workflow changes preserved correctness.
2. **No comparative measurement** — no shared metrics to compare versions of agents/workflows.
3. **No sampling support** — LLM nondeterminism required multi‑trial evaluation, which did not exist.
This feature adds a complete backend + frontend evaluation pipeline, including suites, tasks, graders, metrics, benchmark loading, and a full results UI.
---
## **Which issue(s) this PR fixes**
Fixes #<issue-number>
---
## **What scenarios were tested, and what were the verification results**
**Functional verification**
- Created evaluation suites, added tasks, updated tasks, deleted tasks.
- Loaded all seven benchmark YAML files; validated correct task creation.
- Ran evaluation against workflows and agents with deterministic, model-based, and code-based graders.
- Verified pattern detection across all six structural patterns (Routing, Chaining, Parallelization, Orchestrator‑Worker, Evaluator‑Optimizer, Memory Usage).
- Confirmed correct grader behavior: deterministic checks, LLM judge calls, code-based execution, weight aggregation.
- Confirmed metrics engine correctness: pass/fail, pass@k, pass^k, score distribution, latency stats, token usage, reliability, per-grader breakdown.
- Verified custom aggregate metrics execution, including error handling.
- Confirmed run lifecycle: RUNNING → COMPLETED/FAILED, immutability of completed runs.
- Verified large-suite behavior (50 tasks × 5 trials) and UI rendering of large trace sets.
**Performance verification**
- Execution engine calls scale linearly with tasks × trials.
- Model-based graders correctly issue LLM judge calls per trial.
- No regressions to existing workflow/agent execution performance.
**Reliability verification**
- Flakiness metric validated using multi-trial runs.
- Pattern detection validated with synthetic traces and real workflows.
- Code-based grader error paths tested (exceptions, invalid returns).
**Frontend verification**
- Full CRUD for suites and tasks.
- Run dialog correctly configures workflow/agent target and trial count.
- Results UI renders Overview, Metrics, Graders, and Traces tabs with correct visibility rules.
- Zustand store state transitions validated.
---
## **Self-checklist**
+ - [x] **Design**: Reviewed with maintainers; all comments addressed.
+ - [x] **Test**: Full UT/ST coverage for harness, graders, metrics, pattern validator, API, and frontend store.
+ - [x] **Verification**: PR description includes detailed verification results for feature, refactor, and bugfix aspects.
+ - [x] **Interface**: Adds new external API endpoints under /evaluation; no breaking changes to existing interfaces.
+ - [x] **Document**: Benchmark usage, suite/task schema, and evaluation workflow documented; docs PR prepared separately.
---
## **Special notes for reviewers**
- This module is **fully additive** — no existing endpoints or execution logic are modified.
- Code-based graders and custom metrics use exec(); this is an accepted constraint for v1 and will be hardened later.
- Large evaluation runs can produce multi‑MB result payloads; pagination is planned for a future release.
- Pattern detection is heuristic; tasks may override pattern_type explicitly.
See merge request: openJiuwen/agent-studio!1023
fix: change db_type to optional field
Co-authored-by: chen-hui-zhe-xi<chenhui280@h-partners.com>
# message auto-generated for no-merge-commit merge:
!1052 merge set-db-type-optional into develop
fix: change db_type to optional field
Created-by: chen-hui-zhe-xi
Commit-by: chen-hui-zhe-xi
Merged-by: ZYQ5333
Description: <!-- Thanks for sending a pull request! Here are some tips for you:
1) If this is your first time, please read our contributor guidelines: https://gitcode.com/openJiuwen/community/blob/master/CONTRIBUTING.md
2) If you want to contribute your code but don't know who will review and merge, please add label openjiuwen-assistant to the pull request, we will find and do it as soon as possible.
-->
**What type of PR is this?**
<!--
选择下面一种标签替换下方 /kind <label>,可选标签类型有:
- /kind bug
- /kind task
- /kind feature
- /kind refactor
- /kind clean_code
如PR描述不符合规范,修改PR描述后需要/check-pr重新检查PR规范。
-->
/kind bug
**Self-checklist**:(**请自检,在[ ]内打上x,我们将检视你的完成情况,否则会导致pr无法合入**)
+ - [x] **设计**:PR对应的方案是否已经经过Maintainer评审,方案检视意见是否均已答复并完成方案修改
+ - [x] **测试**:PR中的代码是否已有UT/ST测试用例进行充分的覆盖,新增测试用例是否随本PR一并上库或已经上库
+ - [x] **验证**:PR描述信息中是否已包含对该PR对应的Feature、Refactor、Bugfix的预期目标达成情况的详细验证结果描述
+ - [x] **接口**:是否涉及对外接口变更,相应变更已得到接口评审组织的通过,API对应的注释信息已经刷新正确
+ - [x] **文档**:是否涉及官网文档修改,如果涉及请及时提交资料到Doc仓
<!-- **Special notes for your reviewers**: -->
<!-- + - [ ] 是否导致无法前向兼容 -->
<!-- + - [ ] 是否涉及依赖的三方库变更 -->
See merge request: openJiuwen/agent-studio!1052
code check fix
Co-authored-by: @aharonamir1<amir.aharon@huawei.com>
# message auto-generated for no-merge-commit merge:
!1066 merge feature/executions into develop
code check fix
Created-by: aharonamir1
Commit-by: @aharonamir1
Merged-by: ZYQ5333
Description: <!-- Thanks for sending a pull request! Here are some tips for you:
1) If this is your first time, please read our contributor guidelines: https://gitcode.com/openJiuwen/openJiuwen/blob/master/CONTRIBUTING.md
2) If you want to contribute your code but don't know who will review and merge, please add label openJiuwen-assistant to the pull request, we will find and do it as soon as possible.
-->
**What type of PR is this?**
/kind bug
**What does this PR do / why do we need it**:
---
Summary
- Show version (except 'draft') for execution
- Fix running execution elapsed time accuracy by computing elapsed_ms server-side and sending server_time_ms for client-server clock offset correction
- Fix stale waterfall showing previous execution's data during a new run
Changes
Backend
- workflow_runner.py — Re-enabled incremental save_trace_details() per span so running nodes appear in the waterfall as they complete
- agent_trace_utils.py — Same incremental write pattern for agent executions
- trace_summary_repository.py — get_running_traces_by_space now returns server-computed elapsed_ms and server_time_ms; get_trace_summary_by_trace_id falls back to live TraceDetailDB data
for in-progress traces
- execution.py — Removed response_model from /get_running_traces to allow server_time_ms and elapsed_ms through without Pydantic stripping
Frontend
- ExecutionsPage.tsx — Calculates timeOffset = Date.now() - server_time_ms on each poll; clears trace state on tab switch; auto-selects running > active > completed
- ExecutionList.tsx — Uses server elapsed_ms for running entries, applies timeOffset to correct clock skew for active entries, deduplicates active/running entries
- executionPanelService.ts / types.ts — Updated response types for new server_time_ms and elapsed_ms fields
Test plan
- Start a workflow execution — verify it appears as "Running" in the Workflows tab
- Start an agent execution — verify it appears as "Running" in the Agents tab
- While running, verify elapsed time is accurate (not 2x or skewed by clock difference)
- After completion, verify finished duration displays correctly (not a negative number)
- Switch tabs — verify no stale data from previous tab
- Completed executions (from old WorkflowExecutionDB) still display correctly
**Which issue(s) this PR fixes**:
Fixes [#868](https://gitcode.com/openJiuwen/agent-studio/issues/868)
**Self-checklist**:(**Please check carefully,and mark an x in the [] brackets. We will review your completion status.**)
+ - [ ] **Design**: Has the solution corresponding to the PR been reviewed by the Maintainer, and have all review comments been replied to and revised
+ - [x] **Test**: Has the code in the PR been fully covered by UT/ST test cases, and have the newly added test cases been uploaded to the repository along with this PR or already uploaded.
+ - [x] **Verification**: Does the PR description contains a detailed description of the verification results regarding the achievement of the expected goals for the Feature, Refactor, and Bugfix to this PR.
+ - [ ] **Interface**: Does it involve changes to external interfaces? The corresponding changes have been approved by the interface review organization, and the annotation information for the API has been correctly refreshed.
+ - [ ] **Document**: Does it involve modifications to the official website documentation? If so, please submit the materials to the Doc repository in a timely manner.
<!-- **Special notes for your reviewers**: -->
<!-- + - [ ] Whether it causes forward compatibility failure -->
<!-- + - [ ] Whether the dependent third-party library change is involved -->
See merge request: openJiuwen/agent-studio!1066
fix: 创建weblink知识库导致知识库列表页面空白
Co-authored-by: Mmmmroy<le.zhang1@h-partners.com>
# message auto-generated for no-merge-commit merge:
!1046 merge fix/weblink_table into develop
fix: 创建weblink知识库导致知识库列表页面空白
Created-by: Mmmmroy
Commit-by: Mmmmroy
Merged-by: ZYQ5333
Description: <!-- Thanks for sending a pull request! Here are some tips for you:
1) If this is your first time, please read our contributor guidelines: https://gitcode.com/openJiuwen/community/blob/master/CONTRIBUTING.md
2) If you want to contribute your code but don't know who will review and merge, please add label openjiuwen-assistant to the pull request, we will find and do it as soon as possible.
-->
**What type of PR is this?**
<!--
选择下面一种标签替换下方 /kind <label>,可选标签类型有:
- /kind bug
- /kind task
- /kind feature
- /kind refactor
- /kind clean_code
如PR描述不符合规范,修改PR描述后需要/check-pr重新检查PR规范。
-->
/kind bug
**Self-checklist**:(**请自检,在[ ]内打上x,我们将检视你的完成情况,否则会导致pr无法合入**)
+ - [ ] **设计**:PR对应的方案是否已经经过Maintainer评审,方案检视意见是否均已答复并完成方案修改
+ - [ ] **测试**:PR中的代码是否已有UT/ST测试用例进行充分的覆盖,新增测试用例是否随本PR一并上库或已经上库
+ - [ ] **验证**:PR描述信息中是否已包含对该PR对应的Feature、Refactor、Bugfix的预期目标达成情况的详细验证结果描述
+ - [ ] **接口**:是否涉及对外接口变更,相应变更已得到接口评审组织的通过,API对应的注释信息已经刷新正确
+ - [ ] **文档**:是否涉及官网文档修改,如果涉及请及时提交资料到Doc仓
<!-- **Special notes for your reviewers**: -->
<!-- + - [ ] 是否导致无法前向兼容 -->
<!-- + - [ ] 是否涉及依赖的三方库变更 -->
See merge request: openJiuwen/agent-studio!1046