Staged work estimate — kernel red-green (RFC #25)
Local-only estimate, paired with
tracking-issue-draft.md. Numbers are wall-clock for one focused day, not "ideal" hours.
Total
~4–5 days of actual coding across all three stages. Then ~2 minor releases of soak before flipping default-on.
| Stage | Code (LoC) | Tests (LoC) | Wall time | Risk |
|---|---|---|---|---|
| 1. events + writer | ~300 | ~150 | 0.5 day | low |
| 2. dispatcher gate | ~600 | ~400 | 2–3 days | high |
| 3. plan + UI | ~250 | ~120 | 1 day | medium |
Stage 1 — events + writer (0.5 day)
Almost entirely additive, no behavior change.
Changes:
src/core/events.ts:190— extendEventunion withTestRunEvent+EditClaimEvent.src/core/test-id.ts(new, ~50 LoC) —extractTestId(file, fullName, source)pertest-id-spec.md.src/core/reducers/red-green.ts(new, ~30 LoC) —pairRedGreen(events)reducer.src/cli/commands/events.ts— addred-greensubcommand listing pairs.src/adapters/event-sink-jsonl.ts— already generic overEvent, no edits required.
Tests:
- Round-trip: append a
test_runevent, replay through reducer. extractTestIdmatrix: 8 cases (rename, move, parametrise, annotation override, etc.).
Risk: low. Pattern matches existing event additions in v0.14.
Stage 2 — dispatcher gate (2–3 days, the load-bearing one)
This is where most of the actual integration risk lives.
Changes:
src/tools/filesystem.ts:518—edit_fileregistration wraps in a gate. WhenREASONIX_STRICT_TDD=1:- Look up most recent
test_runfortest_idfrom in-memory event list (cheaper than re-reading jsonl). - Verify a matching
edit_claimfollowed it. - On dispatch refusal, throw a structured error the model can read.
- Look up most recent
src/loop.ts— per-turn coalescing buffer:- When
edit_filesucceeds, push{test_id, test_file_path}to a turn-scoped Set. - At end-of-turn (just before the next assistant call), spawn one
vitest --run -t a -t b -t ccovering all collected ids. - Parse
--reporter=jsonoutput, emit onetest_runevent per id. - On any red, revert the offending edits via the existing checkpoint mechanism (
src/checkpoints.ts), emit arepairevent so the storm-breaker engages.
- When
/refactormode — session flag inLoopState. When true, gate is bypassed; on session exit, runnpm run verify(orreasonix.config.ts'sverify_command).reasonix.config.tsschema — addverify_commandandtest_command_for(test_id).
Tests:
- Integration on a synthetic session fixture: green path, red revert, multi-edit batch,
/refactorbypass, edit before any test_id (refused). - Mock vitest spawner so tests don't depend on actual vitest runs.
Risk: high. Specific concerns:
- Loop coordination. End-of-turn flush has to play nice with: abort controller (
_turnAbort), /pro escalation (mid-turn model swap), storm-breaker (src/repair/storm.ts), thinking-mode round-trip (reasoning_content preservation). Any one of these can desynchronise the buffer. - Vitest spawn hang. Need timeout + kill + emit a
test_runevent withstatus='fail'and a tagged failure reason. Otherwise a stuck test hangs the whole agent. - Cross-platform paths. Vitest's
fullNameshould be POSIX-normalised before becoming part oftest_id; spike runs were on Windows but didn't stress this. - Revert semantics. If batch had 3 edits and 1 went red, only that file reverts; others stay. Existing
Checkpointis per-file, but the index (src/checkpoints.ts) needs a partial-restore code path.
Mitigation: land stage 2 in two PRs — first the gate + buffer behind a new flag (no auto-run), then the auto-run + revert. Validates the synchronisation before adding the spawner.
Stage 3 — plan + UI (1 day)
Changes:
src/tools/plan-types.ts:3—PlanStepgainstest_id?+test_file_path?.src/tools/plan-core.ts—submit_planvalidation: any step withtest_idmust havetest_file_path.src/cli/commands/doctor.ts— warn when plan hastest_idbut missingtest_file_path; warn on first session in an untested codebase, suggest/refactordefault.- TUI plan card — render red/green dots per step (need to inspect
src/cli/ui/cards/PlanCard*to see how steps render today).
Tests:
- Plan validation: rejects step with
test_idmissingtest_file_path. - Doctor output: snapshot of warning lines.
- TUI snapshot for a 3-step plan with mixed red/green/pending dots.
Risk: medium. TUI rendering is the unknown — depends on whether the current plan card has slots for status badges, or if the layout needs widening.
Default-on rollout (calendar, not work)
- After stage 3 lands: minor release with flag off by default.
- Two minor releases of soak — collect any hangs / false-refusals via telemetry, fix in patches.
- Flip default-on; keep
REASONIX_STRICT_TDD=0opt-out for two more minor releases.
Cross-cutting risks not pinned to a stage
- Untested codebases.
reasonix doctorshould detect (notests/dir, novitest.config.*) and refuse to enable strict mode at all on first run. Otherwise the flag is unusable. - Greenfield test-file location. Spike Exp 3 showed the model picks reasonable but inconsistent paths when none is specified. The plan-step
test_file_pathfield is the fix, but a user editing a single file with no plan still has the gap. Stage 2 should refuseedit_filewhen strict + notest_file_pathis in scope. - MCP-served edit tools. Reasonix supports MCP-hosted tools (
src/mcp.ts). If an MCP server exposes its own write/edit tool, the kernel gate doesn't apply. Stage 2 should at minimum log a warning; longer-term, MCP write tools could opt into the same gate via a hook.