文件最后提交记录最后更新时间
fix(context): preserve memory constraints across context folds (#1515) Two fixes around context-fold memory preservation, plus a behavior-stability harness to regression-test them. **context-manager.ts** — fold now reinforces pinned constraints after the summary: - New extractPinnedConstraints() pulls the # HIGH PRIORITY constraints, # User memory, and # Project memory blocks out of the live system prompt and appends them verbatim under a [PINNED CONSTRAINTS — preserved verbatim] tail on the synthetic fold summary message. The system prompt always survives the fold (only history is compressed), but in long sessions the model's attention drifts away from instructions at the top; restating them near the current context window mitigates the "lost in the middle" effect. - Summarizer system prompt now explicitly tells the model to preserve "do not" / "never" / "avoid" instructions when compressing — independent improvement, useful regardless of whether the new constraint tail fires. **loop.ts** — wires getSystemPrompt: () => this.prefix.system into ContextManagerDeps so the fold can read the current system prompt. **benchmarks/behavior-stability/** — small evals scaffold (harness / runner / report / types + one scenario). The constraint-persistence scenario stuffs the log with 40 synthetic turns, triggers a fold with a tiny keepRecentTokens budget, and asserts all four pinned constraints (2× HIGH PRIORITY, 1× User memory, 1× Project memory) survive in the folded summary. Local: npx tsx benchmarks/behavior-stability/runner.ts --local — 1/1 passing. Refs #1462.7 天前
perf(fold): raise normal threshold 50%→75%, aggressive 70%→78% (#1461) End-to-end benchmark across 12 runs (3 repeats × 4 thresholds on the "audit src/ and write an architecture overview" task, ctxMax forced to 120k to make folds reachable) showed normal fold at 0.5 is a stable net loss: config folds hit% miss cost no-fold 0.0 92.7±1.3 75k±4k $0.0169±0.0006 current 50/70 2.3 83.0±9.5 127k±48k $0.0221±0.0054 late 75/90 0.0 92.7±1.7 82k±26k $0.0184±0.0040 agg 90/95 0.0 92.6±0.3 57k±8k $0.0142±0.0014 Every fold wipes the prefix cache: in current_50_70 prompt tokens drop 63k→16k at the fold, then the next ~9 turns rebuild a fresh prefix and miss tokens climb 2-3×. Task quality is unchanged across configs (all 12 runs completed the audit), so fold was buying nothing in exchange. Raising HISTORY_FOLD_THRESHOLD to 0.75 keeps fold available as a last-mile-before-force-summary mechanism without firing on mid-context turns. Aggressive lifted to 0.78 to stay in the same band. The benchmarks/compression-eval/ harness is included so future tuning can re-run on the same task. Co-authored-by: reasonix <reasonix@deepseek.com>8 天前
docs: sync ARCHITECTURE.md and benchmark pricing to match code (#1720) Three stale-doc fixes: - ARCHITECTURE.md §4.3 — replace removed /pro single-turn arming with the current /model flash|pro + settings.json model selection. Note the removal in 0.50.0 (#1657, #1630). - ARCHITECTURE.md §4.4 — replace the never-existed FAILURE_ESCALATION_THRESHOLD counter with the actual <<<NEEDS_PRO>>> model self-report mechanism. No failure-counter; purely LLM-initiated, no-op on pro tier. - benchmarks/real-world-cache/README.md — fix 10× pricing error in v4-flash cache-hit ($0.028 → $0.0028) and entirely wrong v4-pro pricing ($0.139/1.667/3.333 → $0.003625/0.435/0.87). Recalculated cost tables; headline 99.82% hit ratio unchanged, savings now correctly show ~97.7% (flash) / ~98.9% (pro). Thanks @FriendsHL for catching this — the benchmark pricing in particular is the public cache-first defense link, the old numbers would have been embarrassing.4 天前
chore(spike): live DeepSeek run for RFC #110 cache-prefix question (#113) Two-file spike under benchmarks/spike-mcp-reconnect/: - runner.ts — 5 chat calls against live deepseek-chat with controlled tool-list drifts (identity, append, mid-stream edit) - results.md — captured run + empirical findings The headline result overturns the RFC body's "any drift = full cache miss" claim. DeepSeek's prefix cache works at chunk granularity (≈128 tokens), so the cost depends on WHERE the drift falls: - append a tool at the end → trivial cost (94.8% hit, even better than the no-drift 85% baseline because the new chunk gets cached) - edit a tool's description in the middle → loses chunks past the edit (84.1% hit observed) - replacing or reordering the tool list → effectively full miss This nudges the C2b design call away from blanket "strict default" toward graduated permissive: silent on appends, warn on mid-stream edits, refuse on reorders / removals. --strict remains as an explicit flag. Refs #110.27 天前
fix(jobs): close stop() race condition; drop useless \$ escapes (#288) * fix(benchmark): drop useless \$ escapes in tdd-eval prompt string The h1 task description string used \${file}::\${fullName} inside a regular "..." literal. The backslashes are no-ops (JS quietly drops unrecognized escapes in regular strings) but trip CodeQL's js/useless-regexp-character-escape rule. Identical resulting string, cleaner source. * fix(jobs): wait for actual close event after SIGKILL, not a fixed timer stop() had two timing races. Both manifested as the same flake — the returned snapshot's running flag could still be true even though the test (and the user) just told us to stop the job. 1. SIGKILL phase used a fixed 800ms timer, then returned regardless of whether the OS had reaped the process tree. Under Windows scheduler load, taskkill /T /F on a 3-level tree (npm → node → vite) can take over a second to propagate before Node's close event fires. 2. SIGTERM phase awaited readyPromise, which is dual-purpose: it fires on either a startup ready-signal regex match OR child exit. If the job had already matched a ready signal, readyPromise was already resolved, so the SIGTERM grace race short-circuited immediately and we'd jump to SIGKILL with zero pause. Adds closedPromise — fires only on close/error, never on ready signal — and uses it for both the SIGTERM grace race and the post- SIGKILL wait, with a 5s ceiling on the latter so a wedged kernel can't hang us indefinitely.23 天前
feat: integrated-mode polish + sidebar Ctrl+B toggle and label wrap (#1044) * refactor(scene): move the entire scene producer from TS to Rust Stage A of the user's "build the foundation right first" architectural pivot. The JS side no longer builds a SceneFrame layout tree; it just serializes the latest input state. Rust deserializes the state and runs the full layout / palette / row-shape logic against the *real* terminal area. ## Why The TS scene producer was a recurring source of "things don't look right" bugs because: - JS reads its idea of terminal size from Ink's useStdout(), which is the **null stream** under REASONIX_RENDERER=rust mode. Cols and rows fell back to 80 × 24 every render, so layout decisions (boot block size, card cap, dock thickness) were divorced from the actual terminal — even though the renderer rendered against the real area. Producer + renderer disagreed on dimensions. - Two languages held the same constants (palette hexes, row counts, glyphs). Every visual tweak required keeping both sides in sync. - Each behavior change required npm run build + restart and a fresh cargo build if either side drifted. Moving the producer to Rust makes the renderer authoritative for layout. Single source of truth, single rebuild target. ## What New Rust modules under crates/reasonix-render/src/: - state.rsSceneState / SceneCard / SlashMatch / SessionItem / SetupState / Message (tagged enum { type: "trace" | "setup", ... }). - theme.rs — `palette::{bg, bg2, fg, fg1..3, ds, ds_bright, ds_purple, ok, warn, err}` matching the v1 mock's oklch tokens. - producer.rsbuild_trace_frame(&state, cols, rows) + build_setup_frame(&state, cols, rows). Ports the boot block (REASONIX ASCII banner), card kinds (user / reasoning / streaming head+body, tool rich row, generic single-line for others), composer / meta / status dock, slash overlay, sessions picker, approval modal — all from the old TS producer. main.rs reads messages from stdin: tries Message, falls back to bare SceneState, then SetupState, then legacy SceneFrame for back-compat with any pre-rolling-out builds. Each frame calls terminal.size() to get the real cols/rows and feeds them into the producer. ## JS side useSceneTrace.ts shrunk from ~700 LOC to ~270 LOC — purely state shaping now. The hook builds a plain object (`{ type: "trace", ... state }`) and emits it via the renamed emitSceneMessage(message: unknown). Same for useSetupSceneTrace. Helpers that survived (still used to serialize wire-format payloads): toSceneCard, parseRecentCards, parseSlashMatches, parseSessions, summarizeCard. Gone (moved to Rust): buildTraceFrame, buildSetupFrame, all the row builders (composerRow, metaRow, statusBarRow, slashRow, …), the kind-glyph / color / label tables, PALETTE, listWindow, slashWindow, cardsForHeight. ## Tests - Rust (crates/reasonix-render/tests/producer.rs) — 14 new cases covering boot block, card kinds, composer cursor, dock rows, approval / slash / sessions overlays, status bar segments, edit mode color, rich tool format, setup frame, body line cap. - TS (tests/scene-trace-frame.test.ts) — pared down to the 20 wire-format tests that still apply (summarizeCard, toSceneCard, parseRecentCards, parseSlashMatches, parseSessions). \cargo test -p reasonix-render\ — 51 / 51 green (23 render + 14 producer + 6 decode-only + 6 input + 2 round-trip). \npm run verify\ — 3071 / 3074 green via prepush gate (3 pre-existing skipped). ## Migration notes - Rust binary **must** be rebuilt for users running off the prebuilt target/release/reasonix-render.exe — the new producer is the binary's job now. - The legacy SceneFrame JSON shape is still accepted by Rust as a fourth-tier fallback, so a stale JS bundle paired with a fresh Rust binary keeps working until both roll forward. Refs #868 * fix(rust): enable raw mode and drop scroll-area inner padding Two suspected causes of the "底部 / 左右 / 上下 都有点没顶全" gap that survives Stage A: 1. main.rs never called enable_raw_mode() for the render path. In cooked mode the terminal driver treats a write to the last row × last column as an implicit newline / scroll, which leaves the bottom row visually empty under ratatui + alt-screen. run_emit_input (the keystroke-only branch) already enables raw mode for its own loop; the render branch was missing it. 2. scroll_area carried padding_x: 2, padding_y: 1 inside the producer, eating 2 cells off each horizontal edge of the scroll area and 1 row off the top/bottom. The outer box's bg drew the right color underneath, so it didn't read as a true "gap" — but on a terminal whose own cell-edge padding is non-zero the compounded effect looked like all four sides were short. Also added a one-shot debug log gated on REASONIX_RENDER_DEBUG=1 that writes terminal.size() at startup + the first frame's area to stderr (which under rust mode is the ~/.reasonix/rust-render-stderr.log file). Useful for verifying the renderer is seeing the real terminal dimensions, not 80×24. Refs #868 * fix(rust): debug log writes to a file instead of stderr Stderr from the rust child inherits Node's OS-level fd (stdio: "inherit"); the parent's process.stderr.write override is only JS-level and doesn't catch native writes from the rust binary, so the previous eprintln-based diagnostic vanished into the parent terminal (where ratatui's alt screen masked it). Switched to writing the diagnostic to a file — REASONIX_RENDER_DEBUG=1 enables, REASONIX_RENDER_DEBUG_LOG overrides the path; default is ~/.reasonix/rust-render-debug.log. * fix(rust): drop outer background fill, let terminal bg show through User feedback: \"为什么其他他们都能自动顶满,我们还要设置 WT padding?\" Answer: we were painting a #0f1018 dark bg on the outer box, which makes the cell grid contrast with the terminal's own background. The WT pixel padding between window frame and cell grid then renders in the terminal's default bg — visually a colored frame around our drawing area. vim / less / claude / etc. don't paint a bg color. Their cells render in the terminal's default bg, so the WT padding ring matches and disappears. Same treatment here: outer + scroll_area now use BoxLayout::default() (no bg). The dock's composer / status rows keep their bg2 tint — that's a deliberate panel-strip effect inside the content flow, not a full-screen wash. Setup frame also drops its bg for the same reason. Refs #868 * fix(rust): restore outer background fill, drop "no bg" variant User report: dropping outer bg caused ghosting. Diagnosis: ratatui's diff renderer compares cells frame-to-frame and only writes cells that changed. When the outer box paints a solid bg every frame, every cell of the frame is touched with that bg → any leftover content from a prior frame gets overwritten cleanly. When outer bg was removed, cells with no current-frame writes (default Cell = space + Color::Reset) didn't necessarily appear as a change vs the previous frame, and stale characters / fg colors stuck. Restoring outer + setup-frame background = palette::bg(). The cost is the WT pixel-padding ring showing the terminal's default bg instead of ours, which is the same trade-off vim / lazygit / btop / zellij all make. Recommend the WT 0-padding setting in the README rather than try to eliminate the bg layer. Also added terminal.autoresize() at startup so the very first frame already knows the actual terminal size — defensive in case the backend's cached size lags the first draw on Windows. Refs #868 * chore(rust): bump ratatui 0.29 → 0.30 + crossterm 0.28 → 0.29, modernize main Two upstream releases pulled in: - ratatui 0.30.0 (2025-12-26): AlignmentHorizontalAlignment alias, Color::from_crossterm/into_crossterm replacing From/Into, Flex::SpaceAround semantics now match CSS. None of the breaking surface touches our code (we don't use those identifiers). - crossterm 0.29.0: KeyModifiers Display fix; no API change at our call sites. Meanwhile, three modern idioms we were missing: 1. **BufWriter around stdout** — ratatui docs flag unbuffered stdout as the #1 perf footgun. Per-cell write syscalls in the diff loop add up fast on Windows. Wrapping with BufWriter collapses each frame into one flush. 2. **Panic hook that restores the terminal** — without it, a Rust panic leaves the user stuck in raw mode + alt-screen with no keyboard echo and a broken prompt. We now install a hook that disables raw mode and leaves the alt-screen before chaining to the previous hook (so backtraces still print). 3. **Factored init / restore helpers** — init_terminal() does enable_raw_mode + EnterAlternateScreen + new BufWriter backend; restore_terminal() is the matching teardown. Same shape as ratatui::init / ratatui::restore (which we'd use directly if we didn't need the BufWriter / custom backend variant). Renamed local terminal type to RenderTerminal for clarity (Terminal<CrosstermBackend<BufWriter<Stdout>>> is unwieldy at every call site). All 51 Rust tests still pass. JS-side npm verify untouched. Refs #868 * fix(rust): wrap each draw in synchronized output + clear on resize Two stability strategies that modern TUI apps use to eliminate tearing / partial-frame flicker / cross-frame ghosting: ## Synchronized Output Mode (DCS 2026) Every terminal.draw() is now bracketed by ESC[?2026h (BeginSynchronizedUpdate) ESC[?2026l (EndSynchronizedUpdate) Supported by Windows Terminal 1.18+, kitty, foot, alacritty, ghostty, iTerm2 3.5+, recent VS Code terminal. Terminals that don't recognize the sequences silently ignore them — zero downside. What it does: the terminal app stops flushing pixels mid-frame. Whatever ANSI bytes arrive between BSU and ESU are buffered; one atomic display update happens at ESU. Eliminates: - Visible cursor "scanning" across the screen during a draw - Partial-frame artifacts when the diff updates two adjacent rows that haven't been ESC-positioned together - Cross-frame ghosting that survives a diff miss (the next frame's paint lands atomically over the previous, no in-between state) ratatui doesn't auto-wrap draws in BSU/ESU (probably because the escape sequences are still terminal-specific), so we do it ourselves around each call via crossterm::execute!. ## Force terminal.clear() on terminal resize Each iteration of the stream loop reads terminal.size() and compares to the previous frame's size. On change → terminal.clear() which marks the next frame as a full redraw (no diff). Prevents stale cells from showing through after a window resize. ratatui already auto-resizes on terminal.draw, but auto-resize preserves the diff state which can leave the OLD-dimensions content in cells that no longer belong to the new layout. ## Borrow-checker note Terminal::draw returns a CompletedFrame<'_> that borrows the terminal. We discard it with .err() so the second crossterm::execute! (for EndSynchronizedUpdate) can re-borrow the terminal mutably. Refs #868 * refactor(rust): rewrite view layer on top of ratatui widgets, drop custom scene tree The previous architecture had two layers of abstraction stacked on top of ratatui — SceneNode (the protocol type) and BoxNode (our own flex layout engine). The Rust render code wrote cells via buf[(x,y)].set_char() directly, bypassing ratatui's widget system. That bypass was the root cause of every \"shouldn't ratatui handle this\" class of bug we kept hitting: - ghosting from diff misses when our cell-direct writes didn't match ratatui's expected mutation pattern - CJK / emoji width counted twice (once by our unicode-width advance, once by ratatui internally) - no layout cache; full re-layout every frame - ~1500 LOC of producer + render + scene-tree we maintained ourselves instead of using ratatui's tested Layout + Paragraph + Block This PR throws all of that out and uses ratatui's widgets directly. ## What's gone Deleted from the Rust crate: - src/scene.rs (SceneFrame / SceneNode / BoxLayout / TextRun / TextStyle / BorderStyle / Dim / FillToken / FlexDirection / FlexAlign / FlexJustify — 178 LOC) - src/producer.rs (the entire 1,387-LOC scene-tree builder) - src/render.rs (the 389-LOC manual cell-writing renderer with our own flex algorithm) - tests/render.rs (557 LOC of tests against the deleted renderer) - tests/producer.rs (335 LOC against the deleted producer) - tests/round_trip.rs (107 LOC against the deleted SceneFrame protocol type) Deleted from the JS side (no longer used now that JS only emits raw state): - src/cli/ui/scene/build.ts (the box / text / frame helpers) - src/cli/ui/scene/types.ts (SceneNode / BoxLayout etc. — protocol types are Rust's now) - src/cli/ui/scene/theme.ts (palette moved to Rust) - src/cli/ui/scene/lower.ts (the abandoned Ink-tree-to-scene conversion from the original Stage 0 plan) - 4 test files that exercised the deleted modules Total: -3,806 LOC across deleted files. ## What's new src/view.rs (779 LOC) — a single render_trace(state, frame) + render_setup(state, frame) entry point that uses ratatui widgets directly: - Layout::default().direction(Direction::Vertical).constraints([...]) for the scroll-area / dock split (instead of our compute_axis_sizes) - Paragraph::new(Text::from(lines)) with Wrap { trim: false } for the scroll content (instead of our line-by-line set_char loop) - Block::default().style(Style::default().bg(...)) for the dock and status bg tints (instead of our custom bg fill path) - Line::from(Vec<Span>) for every styled row; ratatui handles wide characters / emoji widths internally - render_row_split helper for left-aligned + right-aligned spans on the meta and status rows tests/view.rs (288 LOC) — 15 cases using ratatui::backend::TestBackend to render into an in-memory grid and assert the symbol stream. Every behavior the deleted producer.rs tests covered (boot block, card kinds, composer cursor, dock rows, approval / slash / sessions overlays, status bar segments, edit mode color, rich tool format, setup frame) is re-tested against the new view. theme.rs absorbed Color / NamedColor (previously in scene.rs); palette is unchanged. decode_only.rs switched from serde_json::from_str::<SceneFrame> (no longer exists) to serde_json::Value — it's a dev helper that just counts valid JSON-line frames, doesn't validate their shape. ## Pipeline shape (unchanged, still JS → JSON state → Rust) ``` JS state change ↓ useSceneTrace useEffect ↓ emitSceneMessage({ type: "trace", model, cards, ... }) ↓ child.stdin.write(json + "\n") ↓ Rust child reads line ↓ decode_message → Payload::Trace(SceneState) ↓ render_trace(state, frame) ← new ↓ ratatui widgets emit cell ops ↓ ratatui diff vs previous frame ↓ BufWriter → stdout → terminal ``` Sync-output mode (begin/end synchronized update) and the resize clear-on-change guard from the previous commits are still in place. cargo test -p reasonix-render — 24 / 24 green (15 view + 6 input + 3 decode_only). npm run verify — 3039 / 3042 green (3 pre-existing skipped). Refs #868 * fix(rust): scroll area padding via Block, not by shrinking the area The previous render_scroll manually carved a Rect 2 cells in from each edge of the scroll area and rendered the Paragraph into that inner Rect. The OUTER 2-cell horizontal margin + 1-row top/bottom margin were never touched by the Paragraph and the canvas-block at the start of render_trace only sets style (bg) on cells, not their symbol. So when boot block → cards transitioned, the boot block's LOGO chars at the very edge of the scroll area persisted as ghosts. Switched to `Paragraph::new(...).block(Block::default() .padding(Padding::new(2, 2, 1, 1)).style(...))`. The Block paints its bg over the full scroll-area rect (including the padding ring) and indents the inner content via Padding — so every cell is fresh each frame, no ghosting. Refs #868 * fix(rust): set bg on every Paragraph + bottom-anchor cards in scroll Two real bugs the ratatui source dive (paragraph.rs:407-413) caught: ## 1. Paragraph paints its own style over the OUTER area first The render order inside Paragraph::render is: 1. buf.set_style(area, self.style) ← outer paint 2. self.block.render(area, buf) ← block paint (covers only border / bg cells controlled by the block style) 3. render text into inner(area) If Paragraph has no style set, step 1 paints style=default (no bg), WIPING any bg that a separate Block widget rendered over the same area beforehand. We were doing exactly that — render_status painted a Block with bg=bg2 over the status area, then called render_row_split which used Paragraph::new(left/right) with no style. The Paragraphs painted their default "no bg" over the left and right chunks, overwriting the bg2 strip the Block had just laid down. Net effect: status bar bg2 only survived in cells the Paragraphs didn't touch, which is a fragmented mess after a diff. Fix: set .style(Style::default().bg(...)) on every Paragraph that needs a bg, instead of relying on a separate Block widget. Applied to: - render_scroll (Paragraph.style + Block.style both = palette::bg()) - render_composer (bg2) - render_approval (bg2) - render_meta (bg, via render_row_split bg arg) - render_status (bg2, via render_row_split bg arg) - render_slash_overlay (bg) - render_sessions_picker (bg) - render_setup (bg) render_row_split now takes a bg: Color parameter and applies it to both half-paragraphs. ## 2. Cards anchor at the bottom of the scroll area, not the top Paragraph has no vertical-alignment setter — content always starts at the top row of the area (paragraph.rs:449-458). For chat we want the latest message adjacent to the composer, not the brand banner. scroll_lines() now pads the line list at the TOP with empty Lines when content is shorter than the available height, so cards drift down to sit just above the dock. Boot block (empty cards path) is unchanged — stays at the top of scroll. Together these two fixes address every "didn't fill / ghost / partial bg" complaint that survives the architectural rewrite. Refs #868 * feat(rust): rewrite trace renderer as cell-level WholeScreen widget Replace the ratatui Paragraph + Layout-based render_trace with a single cell-level Widget that hand-paints every (x, y) in the frame. The whole layout splits into the whole_screen/ module: - theme + paint primitives (paint, paint_str, fill_bg, truncate, format_ts) with CJK width handled via unicode-width + set_skip on continuation cells - boot block (REASONIX logo + model/cwd/git/tools/hint rows) - dock: bordered input box, kbd hint row, status bar with ctx bar segments and threshold colors - sidebar: Mission Control header + PLAN / JOBS / CHANGES / SESSION sections, auto-binding to todo + tool cards in state - cards/ subdir, one file per kind: message (user/reasoning/ assistant), todo, tool, diff, output (cmd/fileview/search), notify (subagent/confirm/await/error). 13 kinds total. - slash + @file autocomplete overlays as bordered popups above the dock, with arrow-key navigation and Enter to complete - row-level scrolling via virtual buffer + thumb scrollbar on the right edge; mouse wheel, PgUp/PgDn, Home/End - mouse drag selection with cards-area clamping, scroll-anchor tracking (selection follows content under scroll), auto-scroll when dragging past edges, auto-copy to system clipboard via arboard on mouse-up - tick loop (80ms) drives spinner frames on Running tool cards, composer caret blink, streaming text reveal on the assistant card body Stream-loop now multiplexes JSON state from stdin (in a reader thread feeding an mpsc channel) with crossterm events. Keyboard stays with Node (no protocol change needed for typing); rust captures mouse only and updates local UI state (scroll, selection) independently of incoming state. A new --demo flag drives an interactive playground state with all card kinds populated. 32 new tests in tests/whole_screen.rs cover overlays, card kinds, selection, scrolling, animation. view::render_trace is no longer reached from main.rs (deleted in a follow-up commit). * refactor(rust): drop view::render_trace and its 13 tests stream-loop has rendered Trace payloads through WholeScreen since the previous commit; view::render_trace and its 28 helpers are no longer reachable. Delete them. Keep view::render_setup — that's the API-key entry screen (Setup payload), a separate UI path that WholeScreen doesn't cover. view.rs: 796 → 98 lines tests/view.rs: 15 → 2 tests (kept the 2 setup tests) * feat(rust): composer editing — cursor movement, word-nav, insert at point Track composer cursor index in the demo loop instead of always appending to the buffer. Char keys insert at cursor (not just push to end). Cursor moves through the text: Left / Right one character Ctrl+Left / Right one word Home / End start / end of buffer Backspace delete char before cursor Ctrl+Backspace delete word before cursor Delete delete char after cursor Home / End used to scroll cards top / bottom; that was reassigned to PgUp / PgDn (which were already there). Cursor is the more natural binding for editing. Word boundary uses char::is_whitespace transitions. CJK runs are treated as one word since is_whitespace is false for them; matches typical terminal editing intuition. The caret ▮ renders at composer_cursor in dock.rs — text before cursor + caret + text after cursor, so the caret visually sits between characters when mid-string. * feat(rust): idle empty-state banner + composer cursor + prompt mode hint When state.cards is empty, the cards area now renders a 2-row idle banner (OK rail) instead of leaving the middle blank: ▎ ● idle ready for next task ▎ type below · / commands · @ file refs · ! shell Matches the React mock's Idle component. Shown on first launch and after /clear when production reasonix wires state.cards to empty. dock composer paints the caret ▮ at composer_cursor instead of always at end-of-text, so the caret can sit between characters. The ❯ prompt now signals what mode the user is in by color: default ds-bright (chat) /... ds-bright (slash command — paired with overlay) @... ds-purple (attach file — paired with overlay) !... ok-green (shell mode) 3 new tests: idle banner content, caret at cursor index, !shell prefix doesn't accidentally trigger slash/at overlays. * feat(rust): multi-line composer with Shift+Enter, scroll-to-cursor, line nav Composer now grows from 1 to 5 content rows as the buffer gains newlines (DOCK_HEIGHT + lines - 1, capped at 5+MAX_COMPOSER_ROWS-1 = 9 rows total). Cards area shrinks accordingly; selection cards layout reads the same dock height so mouse coords stay aligned. Shift+Enter inserts \n at cursor Enter submits the whole buffer (newlines preserved) Up / Down when no overlay is active, move cursor between lines preserving column; otherwise navigate the overlay match list (existing behavior) Beyond 5 content rows the box scrolls vertically so the cursor line is always visible. ↑ and ↓ glyphs appear at the box edge to indicate hidden lines above or below the visible window. Slash and @file completion now place the cursor at end of the substituted buffer instead of leaving it stale. * feat(rust): Tab / Shift+Tab cycles slash and @file overlay matches Standard shell-completion gesture. Tab moves selection forward and wraps to top when at end. Shift+Tab moves back and wraps to bottom. Updates the navigate hint in both overlays to show "↑↓/Tab navigate". Up/Down still work as before; Tab is just an extra binding. * feat(rust): --integrated mode — full UI ownership with event proto to Node A new mode flag that combines the demo loop's interactive composer with the stream loop's stdin-driven scene state. Designed to let production reasonix hand the entire UI (typing, slash/@ overlays, scroll, selection, all keybindings) over to the rust renderer instead of splitting input between Node's Ink composer and rust's mouse-only capture. Architecture: stdin Node → rust line-delimited SceneState JSON stderr rust → Node line-delimited event JSON stdout rust → tty rendered frames controlling rust ↔ tty keyboard + mouse via crossterm rust owns composer state locally (buffer, cursor, slash_idx, at_idx, scroll, selection, dragging, tick). On each frame, it overlays composer_text + composer_cursor onto the incoming SceneState clone, then renders WholeScreen as usual. Local UI state never round-trips to Node. Events emitted to stderr: {"event":"submit","text":"…"} plain Enter with non-empty buffer {"event":"interrupt"} Ctrl+C while scene.busy {"event":"exit"} Ctrl+C when idle, or Ctrl+D Setup payloads from Node fall through to render_setup with no keyboard capture for that frame; Node-side Ink can keep handling the API-key entry if needed. Plumbing change: Payload + decode_message moved from main.rs to state.rs so integrated.rs can use them. Composer editing helpers (insert_char_at, cursor math, word boundaries, line nav) moved to a new editor module shared between demo and integrated loops. Node side is unchanged — needs a follow-up PR there to spawn rust with --integrated, pipe stderr, parse events, and disable Ink's own composer. * feat(scene): wire REASONIX_RENDERER_INTEGRATED=1 to spawn rust --integrated The rust renderer's --integrated mode (lands on the whole-screen-prototype branch in the reasonix-render crate) lets it own keyboard + composer state and report submit/interrupt/exit events back to Node via stderr. This wires the Node side to it. When REASONIX_RENDERER=rust and REASONIX_RENDERER_INTEGRATED=1: - spawnRenderer adds --integrated to the child args - the child's stderr is piped (was inherit) and parsed line-by-line as JSON events - the keystroke input child is skipped — rust captures the terminal directly via crossterm - trace exposes setIntegratedEventHandler so chat.tsx can register a single dispatcher before the first scene frame emits Events handled: submit routed to qqSubmitRef.current — same code path the QQ channel uses to feed text into App's queuedSubmit effect, then through handleSubmit exit clean shutdown via stopAndSaveCpuProfile + process.exit interrupt no-op for now (terminal SIGINT already reaches Node); wiring loop.abort is a follow-up Backwards compatible: without the env var the existing rust mode (Ink composer + RustKeystrokeReader input child) keeps working. * chore(rust): add pulldown-cmark for markdown rendering Pulls in pulldown-cmark 0.13 (+ unicase) so the renderer crate can parse markdown bodies from assistant messages. * feat(rust): render markdown bodies in assistant message cards Parses message/notify/output card bodies as markdown (headings, code, lists, tables, blockquotes, inline emphasis) and renders them cell-by-cell within the card body width. Adds shared wrap_visual helper in cards/mod.rs so wrap math stays in one place. * feat(rust): integrated-mode polish — approvals, pickers, completion, sidebar toggle Round of feature work on top of the --integrated runner: - Approval prompts (plan / shell / path / edit / choice / checkpoint) rendered as a full-screen overlay above the dock, with key handlers on the rust side and an event protocol back to Node. - Mode picker (review/auto/yolo) and preset picker (auto/flash/pro) triggered from clicking the pills in the status bar. - Dynamic slash and @file overlays — both now drive from a catalog pushed in scene state (slash_catalog / at_state) instead of hardcoded command and path lists. - Multiline composer wrap + cursor positioning across visual lines. - Live session stats in the right sidebar (model, ctx, ↑/↓ tokens, cache %, cost, balance, last turn) read straight from scene state. - Integrated event loop split into stdin / terminal reader threads with a unified Evt channel. - Sidebar Ctrl+B toggle (was an unimplemented "⌘. toggle" hint); long PLAN / JOBS labels now wrap multi-line inside the sidebar instead of being truncated. - Bounded paint_str_to() so the boot hint, cwd, logo etc. clip to the main-panel width instead of bleeding into sidebar columns. Tests: full sidebar/toggle/wrap regression suite + new dynamic slash/at catalog coverage. * feat(ts): wire React UI to integrated-mode rust events Bridges the rust --integrated runner back to the Node UI layer: - App.tsx routes approval-response / mode-set / preset-set / composer events from the rust child to the existing React handlers via refs. - useSceneTrace pushes the additional scene fields the rust side now consumes (preset, session/last-turn tokens + cost, cache_hit_ratio, slash_catalog, prompt_history, approval, at_state). - State + reducer track session input/output tokens and last turn ms for the new sidebar SESSION block. - Composer text echoed from the rust side feeds useInputRecall so the recall popover sees the live buffer. - renderer-process gains a composer event type; chat.tsx forwards the integrated flag so REASONIX_RENDERER_INTEGRATED=1 spawns rust with --integrated. * chore: demo-utils sample + probe-fanout debug script + tau-bench db tweak - src/demo-utils.ts + tests: tiny sample module used as the target for risk:med submit_plan dogfooding. - scripts/probe-fanout.mts: headless probe that measures tool-call fan-out and ordering for the run_skill flow (issue #675). - benchmarks/tau-bench/db.ts: minor adjustment to test data. * chore: drop demo-utils file header to meet comment-policy 5-line header tripped the ≤2-line rule. Names are doc enough here. * chore(rust): cargo fmt across reasonix-render CI's cargo fmt --check failed on the new test bodies (long single-line strings and asserts). Ran cargo fmt to bring everything in line. * chore(rust): satisfy clippy -D warnings CI runs cargo clippy --all-targets -- -D warnings; the integrated-mode polish landed lints on: - too_many_arguments — paint_str_to, paint_cell, paint_entry, render_block, render_table, render_card_header (renderer helpers with many style/geometry params; #[allow(clippy::too_many_arguments)]) - manual_clamp — overlay.rs / overlay_at.rs: use .clamp() directly - needless_range_loop — md_render.rs table rows: switch to col_widths.iter().enumerate() - large_enum_variant — state::Message / Payload: SceneState dwarfs SetupState; #[allow] on both - needless_lifetimes — overlay_at::entries_for - question_mark — integrated::cycle_or_pick: let-else → ? - dead_code — markdown::MdBlock::Code lang field (kept for parser fidelity) - field_reassign_with_default — three test fixtures; use struct init - unused assignment / parentheses / loop counter — trivial cleanups * chore(rust): reflow message.rs use line after clippy import prune cargo clippy --fix removed unused imports but left the remaining use {...} braces on too many lines for rustfmt's check. * chore(rust): collapse Event::SoftBreak | HardBreak match arm into guard clippy 1.95 added collapsible_match — fold the inner if !in_code into a match guard on the outer SoftBreak/HardBreak arm. --------- Co-authored-by: reasonix <reasonix@deepseek.com>12 天前
feat(v0.2 MVP): transcript replay + diff — offline cache/cost auditing Turns transcripts into first-class audit artifacts. Anyone with a .jsonl transcript can reproduce the headline numbers (cache hit, cost, vs-Claude) without re-running the LLM — the economics travel with the data. New surface: - reasonix replay <transcript> pretty-print a run + rebuild its summary - reasonix diff <a> <b> [--md] compare two transcripts: aggregate deltas + first divergence + prefix stability New modules: - src/transcript.ts — canonical TranscriptRecord + writer/reader. Meta line at top-of-file carries source/model/task/mode/repeat. - src/replay.ts — pure parser that rebuilds SessionSummary from a transcript's usage/cost fields. Tolerates old (pre-usage) transcripts. - src/diff.ts — alignment by turn number, Levenshtein similarity for text divergence, tool-name + tool-args comparison, stdout table + markdown renderers. Format bump (backward-compatible): transcripts now persist usage, cost, model, prefixHash (Reasonix only), and toolArgs. All fields optional on read — v0.1 transcripts still parse and render (cost/cache shown as n/a). Format version lives in the _meta line at the top. Bench runner: new --transcripts-dir <path> flag. Each (task, mode, repeat) writes <taskId>.<mode>.r<n>.jsonl so bench runs produce replay/diff-ready receipts, not just aggregate numbers. Why this closes the loop on v0.1: The τ-bench-lite report claims "baseline 43.9% / reasonix 94.3% cache hit" — but a reader had to trust our aggregate. Now a reader can run reasonix diff on two transcripts from the same task and see, byte by byte, that reasonix's prefixHash stayed stable while baseline's churned — and that the cache/cost delta is mechanically attributable to log stability, not to a different prompt. Tests: +16 (transcript 3, replay 3, diff 10). Suite now 159 green. Also threads toolArgs through LoopEvent so chat's --transcript now persists *what* the model sent to each tool, not just the result. What's explicitly deferred to a later release: - Full Ink TUI for replay (j/k scrubbing, search). Current replay command is stdout-only. - Split-pane diff TUI. Current diff command is stdout + markdown. - MCP client (v0.3). Replay/diff infra is prerequisite: without it we couldn't demonstrate why a Cache-First MCP client matters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 1 个月前
README.md

Benchmarks

This is where validation lives. The v0.1 milestone gates on a reproducible tool-use eval that compares, on the same tasks:

  1. Baseline — a deliberately cache-hostile agent (fresh timestamp + shuffled tool spec each turn), representative of how generic frameworks wire up DeepSeek.
  2. Reasonix — the same tools and system prompt, driven through CacheFirstLoop so the byte prefix stays stable turn-over-turn.

Both modes share the same DeepSeekClient, so the only meaningful difference is prefix stability — any cache-hit / cost gap is attributable to Pillar 1 of the architecture, nothing else.

Scope — this is τ-bench-lite

We don't ship a full port of Sierra's τ-bench (airline + retail, Python). Instead:

  • tau-bench/tasks.ts hand-authors 8 retail-flavored multi-turn tasks that exercise tool use, identity verification, refusal, and mid-conversation goal change.
  • The task schema (tau-bench/types.ts) mirrors τ-bench's shape — stateful tools, an LLM user simulator, end-state DB predicates — so real upstream tasks can later drop in without harness changes.
  • All success predicates are deterministic DB checks, not LLM judges. Refusal tasks pass iff the DB is unchanged.

Files

tau-bench/
├── types.ts       — TaskDefinition / RunResult / BenchReport shapes
├── db.ts          — tiny in-memory WorldState + cloneDb
├── tasks.ts       — the 8 seed tasks + shared tool factories
├── user-sim.ts    — LLM user simulator (V3, T=0.1)
├── baseline.ts    — naive cache-hostile agent runner
├── runner.ts      — orchestrates user-sim × agent × task × mode
└── report.ts      — turns a results-*.json into a report.md

Quickstart

# dry-run: no API calls, just validate the harness is wired up
npx tsx benchmarks/tau-bench/runner.ts --dry

# full run: both modes, all tasks, 1 repeat
export DEEPSEEK_API_KEY=sk-...
npx tsx benchmarks/tau-bench/runner.ts

# tighten variance: 3 repeats per task
npx tsx benchmarks/tau-bench/runner.ts --repeats 3

# narrow to one task while iterating
npx tsx benchmarks/tau-bench/runner.ts --task t01_address_happy --verbose

# render the report
npx tsx benchmarks/tau-bench/report.ts benchmarks/tau-bench/results-<date>.json

# emit per-run transcripts so you can reasonix replay / diff them
npx tsx benchmarks/tau-bench/runner.ts --transcripts-dir ./transcripts
npx reasonix diff \
  ./transcripts/t01_address_happy.baseline.r1.jsonl \
  ./transcripts/t01_address_happy.reasonix.r1.jsonl \
  --md diff.md

The runner writes benchmarks/tau-bench/results-<iso-timestamp>.json. Point report.ts at it (or pass --out report.md to override the output path).

When --transcripts-dir <path> is set, each (task, mode, repeat) run also writes a <taskId>.<mode>.r<n>.jsonl transcript into that directory — these carry per-turn usage, cost, and (for Reasonix) the prefixHash, so reasonix replay and reasonix diff can rebuild the economics offline.

CLI flags

flag default meaning
--task <id> all run only one task by id
--mode baseline | reasonix both restrict to one mode
--repeats <N> 1 repeat each (task, mode) pair N times
--model <id> deepseek-chat agent model
--user-model <id> deepseek-chat user-simulator model
--out <path> results-<ts>.json results file path
--transcripts-dir <path> off write one transcript per run for replay/diff
--dry off skip the LLM; only wire-check
--verbose | -v off print every user / agent / tool line

What a run costs

A full run (8 tasks × 2 modes × 1 repeat) does on the order of 30–60 DeepSeek V3 calls — well under $0.05 at current pricing. --repeats 3 triples that.

Adding tasks

  1. Add a TaskDefinition to tau-bench/tasks.ts. Reuse the tool factories defined at the top of that file, or add new ones (remember: factories so tools close over the per-run db snapshot).
  2. Make the check predicate check the end-state DB, not the agent's text — agents phrase things differently on every run.
  3. Run --task <your_id> --verbose to eyeball the transcript.

Non-goals (for this harness):

  • LLM-as-judge — brittle and expensive, DB predicates are enough.
  • Streaming comparison — the harness uses stream: false in Reasonix mode so both runners make the exact same request shape.
  • Claude head-to-head — we estimate Claude's cost from token counts using Sonnet 4.6 pricing (see src/telemetry.ts); running Claude for real is out of scope.