| test(core): session fixture invariant tests (benchmark automation Stage 1)
Commits two clean hermes session jsonls to tests/fixtures/:
- session_p0_sprint_clean.jsonl (2026-04-22 21-11-34, 7 turns, full
post-P0 behavior: #5 auto-stop nudge fired, multi-step task completed,
markdown summary, no continuation placeholders)
- session_404_recovery.jsonl (2026-04-22 20-12-44, 4 turns, #4 path
ranking + #14b read_file preferred-over-bash-cat exercised)
New tests/session_fixture_invariants_test.rs runs 6 assertions per
fixture when present:
- no (continuing...) / (completed) / Continue. placeholders
(would indicate continuation recovery mechanism was re-introduced)
- no "summarize and stop instead of continuing" directive (old #5
nudge wording that caused weak models to skip user-requested steps)
- no sed -i / perl -pi / awk -i inplace shell-workaround tool
calls (P0 #2 anti-bypass regression check)
- every bash ToolResult output carries exit: N / killed: marker
(P0 #3 exit-code-in-marker regression check)
- meta: collector sees Assistant content (catches jsonl schema drift
that would silently make all other asserts trivially pass)
Explicit limits — does NOT catch regressions whose fixture is never
refreshed, and does NOT run a real replay harness. Stage 2 (a real
ReplayProvider + minimal AgentLoop test harness driving recorded
responses back through the framework) is tracked as P1 #14d in
project_095x_roadmap.md, ~1.5 day of infra work — deferred until
regression evidence warrants the investment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| 1 个月前 |