eval
Execute Python or JavaScript code in persistent cell-based runtimes.
Notice: Do not shell out to
python -c/python -e,bun -e, ornode -evia thebashtool for ad-hoc code execution. Use this tool instead — it gives you persistent state across cells, structureddisplay()output, image/JSON capture, and proper cancellation/timeout handling that one-shot-e/-cinvocations cannot provide.
Source
- Entry:
packages/coding-agent/src/tools/eval.ts - Model-facing prompt:
packages/coding-agent/src/prompts/tools/eval.md - Key collaborators:
packages/coding-agent/src/eval/backend.ts— backend execution contractpackages/coding-agent/src/eval/agent-bridge.ts— host-sideagent()bridge into the subagent executorpackages/coding-agent/src/eval/js/executor.ts— JS backend adapterpackages/coding-agent/src/eval/js/worker-core.ts— JS execution, VM context, display/log capturepackages/coding-agent/src/eval/js/shared/prelude.txt— JS global helper installerpackages/coding-agent/src/eval/js/shared/helpers.ts— JS filesystem/text/env helper implementationspackages/coding-agent/src/eval/py/index.ts— Python backend adapterpackages/coding-agent/src/eval/py/executor.ts— kernel session retention, reset, cleanuppackages/coding-agent/src/eval/py/kernel.ts— Jupyter gateway/kernel protocol, display capturepackages/coding-agent/src/eval/py/prelude.py— Python helper functions and status eventspackages/coding-agent/src/session/streaming-output.ts— truncation, artifacts, streamed chunksdocs/python-repl.md— Python kernel/gateway internals
Inputs
Tool parameters are a JSON object with a single cells field — an ordered array of cell objects. Each cell is a structured record; there is no *** Cell header parsing, no language sniffing, and no implicit single-cell fallback. Cells run in array order; state persists within each language across cells and across tool calls.
| Field | Type | Required | Description |
|---|---|---|---|
cells |
EvalCellInput[] |
Yes | Cells executed in order. At least one cell is required (.min(1)). |
Each EvalCellInput (from evalCellSchema in packages/coding-agent/src/tools/eval.ts):
| Field | Type | Required | Description |
|---|---|---|---|
language |
"py" | "js" |
Yes | Backend selector. "py" maps to the IPython/Jupyter kernel (python backend); "js" maps to the persistent JavaScript VM. |
code |
string |
Yes | Cell body, verbatim. JSON-encoded — embed newlines, quotes, and indentation directly; no fences, no headers. |
title |
string |
No | Short label rendered in the transcript (e.g. "imports", "load config"). |
timeout |
integer |
No | Per-cell timeout in seconds, clamped to 1..600. Defaults to 30 when omitted. |
reset |
boolean |
No | Wipe this cell's language kernel before running. Reset is per-language: a py cell's reset does not touch the JS VM and vice versa. Defaults to false. |
Minimal example matching the live schema:
{
"cells": [
{ "language": "py", "title": "imports", "timeout": 10, "code": "import json\nfrom pathlib import Path" },
{ "language": "py", "title": "load config", "code": "data = json.loads(read('package.json'))\ndisplay(data)" },
{ "language": "js", "title": "summary", "reset": true, "code": "const data = JSON.parse(await read('package.json'));\ndisplay(data);\nreturn data.name;" }
]
}
Outputs
Final result from EvalTool.execute() is single-shot, but onUpdate streams partial text and details while cells run.
Returned shape:
content: one text block containing combined cell output,(displayed N image(s); no text output)when only images exist, or(no output)when nothing visible was produced; image outputs are appended as additional image content blocks.details(EvalToolDetailsfrompackages/coding-agent/src/eval/types.ts):cells: per-cell code, status (pending/running/complete/error), output, duration, exit code, status events, markdown flaglanguage: first backend usedlanguages: distinct backends used, in first-use orderjsonOutputs: structured values emitted viadisplay(...)statusEvents: aggregated helper/tool status eventsnotice: backend fallback notice (currently unused; reserved for future per-cell notices)meta: truncation metadataisError: set on cell failure or cancellation
Renderer behavior in packages/coding-agent/src/tools/eval.ts:
- call preview renders each cell's
codewith syntax highlighting based on its declaredlanguage - result view renders each cell separately, including status, duration, and output
- markdown outputs are rendered with the Markdown component instead of plain text
jsonOutputsrender as a tree, collapsed or expanded depending on UI state- timeout / truncation notices render as dim metadata lines
- images are returned as content image blocks; live updates may also carry
details.imageswhile execution is in progress
Side-channel artifacts:
session.allocateOutputArtifact?.("eval")may allocate anartifact://...backing store for spilled output.- Truncated output metadata points at that artifact when available.
Flow
EvalTool.execute()inpackages/coding-agent/src/tools/eval.tsreceivesparams.cellsalready validated by the Zod schema — no string parsing step.- For each cell,
execute()mapscell.languageto anEvalLanguage("py"→"python","js"→"js") and callsresolveBackend(session, language):pythonis gated oneval.py !== falseandpythonBackend.isAvailable(session).jsis gated oneval.js !== false.- A disabled or unavailable requested backend throws
ToolError; there is no auto-fallback or sniffing.
- The tool allocates an
OutputSink, aTailBuffer, per-cell result objects, and asessionAbortController.session.trackEvalExecution?.(...)can wrap the whole run for external cancellation tracking. - It resolves the executor session id from
session.getEvalSessionId?.(), falling back todefaultEvalSessionId(session). Subagents inherit the parent's id so both sides share the same JS VM and Python kernel for each backend. - Cells execute sequentially within one eval tool call. For each cell,
execute():- clamps
cell.timeout ?? 30seconds throughclampTimeout("eval", ...) - builds a combined abort signal from the tool signal, the timeout, and the session abort controller
- marks the cell
runningand emits an update - calls the backend’s
execute()withcwd,sessionId,sessionFile,kernelOwnerId,deadlineMs,reset(defaults tofalse), artifact info, and chunk callback
- clamps
- JS cells dispatch through
packages/coding-agent/src/eval/js/index.tsintoexecuteJs(); Python cells dispatch throughpackages/coding-agent/src/eval/py/index.tsintoexecutePython(). - Backend text chunks stream into the shared
OutputSink; rich outputs are accumulated separately as JSON, images, markdown markers, and status events. - After each cell:
- text output is trimmed and stored on that cell result
- multi-cell runs prefix text with
[i/n]and the optional title - cancellations return early with
isError: trueand a cell-specific abort message - non-zero exit codes return early with
isError: trueand a message naming the failed cell - later cells are skipped after the first error, but earlier cell state persists in the underlying runtime
- On success, the tool joins all cell outputs, synthesizes
(no text output)or(no output)when needed, and attaches truncation metadata fromsummarizeFinal(). - The renderer uses
details.cells,details.jsonOutputs, anddetails.statusEventsto build notebook-style output.mergeCallAndResult = trueandinline = true, so call and result render together in the transcript.
Modes / Variants
Backend selection
Backend choice is explicit per cell — there is no auto-detection.
language: "py"→ Python (IPython/Jupyter) backendlanguage: "js"→ JavaScript VM backend
If the requested backend is disabled or unavailable, the tool throws ToolError for that cell. The caller chooses; the tool does not silently substitute.
JavaScript runtime
Implemented in packages/coding-agent/src/eval/js/worker-core.ts, packages/coding-agent/src/eval/js/shared/prelude.txt, and packages/coding-agent/src/eval/js/shared/helpers.ts.
- Persistent worker-backed VM sessions keyed by
js:${sessionId} reset: truecallsresetVmContext(sessionKey)before the cell executes; reset is destructive for all live runs on that JS session- Top-level
awaitand barereturnare supported by wrapping code in an async IIFE whenwrapCode()seesawaitorreturn - Top-level static
import ... from ...and dynamicimport(...)calls are routed throughrewriteImports(), which sends them via__omp_import__so the specifier resolves against the session cwd - Module cache is busted for local imports between cells so edits to source files are picked up without restarting the runtime.
__omp_import__deletesrequire.cache[absPath]before re-importing whenever the original specifier is a filesystem path: relative (./x,../x,.,..), POSIX-absolute (/...), home-prefixed (~/...), or Windows drive-letter (C:\.../C:/...). Bare specifiers (react,lodash/x) and URL/scheme specifiers (node:fs,file://...,https://...) are left in cache so package identity stays stable across cells. The cache-bust only fires when the resolved target is an absolute path — unresolved bare-package fallbacks (resolveImportSpecifier()returning the original specifier) skip it. - The prelude installs globals:
display,printread,write,append,sort,uniq,counter,diff,tree,env,outputtool.<name>(args)proxy for arbitrary session tool callsllm(prompt, opts?)for oneshot, stateless LLM calls (see Oneshot LLM helper below)agent(prompt, opts?)for a single subagent call, plus JS-onlyparallel()/pipeline()bounded-pool helpers (see Subagent helper below)
- JS helpers that touch the host/runtime boundary are async and
awaitable; pure text helpers (sort,uniq,counter) return synchronously but may still be safely awaited. - JS helper signatures use a trailing options object rather than Python keyword arguments:
await read(path, { offset?, limit? })await tree(path = ".", { maxDepth?, hidden? })sort(text, { reverse?, unique? }),uniq(text, { count? }),counter(items, { limit?, reverse? })await agent(prompt, { agentType?, model?, context?, label?, schema? })await parallel([() => agent("a"), () => agent("b")], { concurrency? })await pipeline(items, stage1, stage2, { concurrency? })
display(value)behavior:- plain objects/arrays become JSON outputs
{ type: "image", data, mimeType }becomes an image output- scalars become text
- The VM exposes a restricted
processsubset plusBuffer,fetch,Blob,File,Headers,Request,Response,fs,require, and browser-style globals - Concurrent runs on the same VM are not queued end-to-end. Synchronous JS still runs on the single event loop; awaited regions can interleave with sibling runs.
Python runtime
Implemented in packages/coding-agent/src/eval/py/executor.ts, packages/coding-agent/src/eval/py/kernel.ts, and packages/coding-agent/src/eval/py/prelude.py. See docs/python-repl.md for gateway and kernel details.
- Default mode is retained
sessionkernels keyed bypython:${sessionId} - Optional
python.kernelMode = "per-call"creates a fresh kernel for each cell and shuts it down afterward reset: truedisposes the retained kernel for that session before the cell runs; later Python cells in the same tool call reuse the fresh kernel- Startup path:
- availability check
- create/connect kernel
- initialize cwd / env /
sys.path - execute
PYTHON_PRELUDE
- Python cells run in the runner's persistent asyncio event loop, so top-level
awaitworks; the prompt warns not to useasyncio.run(...) - The Python prelude defines helpers with the same surface as JS where practical, including
tool.<name>(args),llm(...), andagent(...)through a per-run loopback bridge - Synchronous statement blocks run in the default executor with ContextVar state copied in; the GIL still serializes bytecode execution, but awaited regions can interleave with sibling cells
- Kernel
display_data/execute_resultmessages map to:application/x-omp-status→ status eventimage/png→ image outputapplication/json→ JSON outputtext/markdown→ markdown outputtext/plain→ text outputtext/html→ HTML converted to markdown withhtmlToBasicMarkdown()
- Interactive stdin is rejected:
input_requestsends an empty reply, marksstdinRequested, and the executor returns exit code1
Oneshot LLM helper (llm)
Both runtimes expose llm() — a single stateless completion against a model tier. It is intentionally minimal: no conversation history, no agent-visible tools, pure text in / text (or object) out. Implemented host-side in packages/coding-agent/src/eval/llm-bridge.ts and routed through the existing tool bridge under the reserved name __llm__.
- Signatures:
- JS:
await llm(prompt, { model?, system?, schema? }) - Python:
llm(prompt, *, model="default", system=None, schema=None)
- JS:
modelselects a tier (default"default"):"smol"→pi/smolrole (fast / cheap)"default"→ the session's active model, falling back to thepi/defaultrole"slow"→pi/slowrole; requests high reasoning effort only on reasoning-capable models
system(optional) supplies a system prompt.schema(optional) is a plain JSON-Schema object. When present, the model is forced to call a single syntheticrespondtool with that schema (loose, non-strict), and the helper returns the parsed object. When absent, the helper returns the completion string.- Errors surface as exceptions: unresolved tier, missing API key, an
error/abortedstop reason, or empty output each raise.
Subagent helper (agent)
Both runtimes expose agent() — a single subagent invocation routed through packages/coding-agent/src/eval/agent-bridge.ts into the same runSubprocess(...) path used by the task tool. It uses the current eval session's spawn policy and inherits the parent eval executor id, so parent and subagent code share JS/Python runtime state.
- Signatures:
- JS:
await agent(prompt, { agentType?, model?, context?, label?, schema? }) - Python:
agent(prompt, *, agent_type="task", model=None, context=None, label=None, schema=None)
- JS:
agentType/agent_typedefaults to the bundledtaskagent and resolves through normal agent discovery, so project and user agents work.modeloverrides the selected agent's model. Without it, normal per-agent settings and the agent frontmatter model apply.contextsupplies shared background;labelcontrols theagent://<id>output label prefix.schemapasses a JSON Schema to the subagent structured-output path. When present, the helper parses the final JSON text and returns an object.- Spawn restrictions use
session.getSessionSpawns()exactly like thetasktool. Eval-driven subagent recursion is capped at depth 3. - JS also exposes
parallel(thunks, { concurrency })andpipeline(items, ...stages, { concurrency }); both use a bounded async pool with default concurrency 4, max 16, preserve item order, and propagate rejections. - Errors surface as exceptions: unknown or disabled agent, disallowed spawn, recursion cap, subagent failure, or invalid structured output all fail the eval cell.
Multi-language call behavior
A single tool call can mix Python and JS cells. Persistence is per language runtime:
reset: trueon a Python cell does not touch JS statereset: trueon a JS cell does not touch Python state- each backend keeps its own retained session keyed from the same session-derived ID
Side Effects
- Filesystem
- JS/Python prelude helpers can read, write, append, diff, and traverse filesystem paths under the session cwd or absolute paths.
- JS helper
read()rejects protocol URIs (://) and directory paths; usetool.read(...)for internal URLs or reader-mode behavior. - Output may spill to an artifact file via
OutputSink.
- Network
- Python backend speaks NDJSON to a local
python3subprocess over stdin/stdout (no network). - JS runtime exposes
fetchandtool.<name>(); those tools may perform additional network I/O.
- Python backend speaks NDJSON to a local
- Subprocesses / native bindings
- Python availability check runs
<python> -c .... - Python backend spawns one
python -u runner.pysubprocess per kernel; cancellation sendsSIGINT. Details indocs/python-repl.md. agent()runs one in-process subagent via the task executor; that subagent may use its configured tools.
- Python availability check runs
- Session state
session.assertEvalExecutionAllowed?.()can block execution.session.trackEvalExecution?.(...)can register cancellable eval work.session.getSessionFile?.(),session.getEvalSessionId?.(), andsession.getEvalKernelOwnerId?.()influence VM/kernel reuse and artifact lookup.- JS VM contexts persist across eval calls until reset/disposal.
- Python retained kernels persist until reset, owner cleanup, or process exit.
agent()allocatesagent://<id>output artifacts and reuses the parent's eval executor id.
- User-visible prompts / interactive UI
- none; stdin requests are rejected programmatically
- Background work / cancellation
- Python retained kernels have heartbeat and idle cleanup timers.
- Cancellation hard-kills/resets the shared executor for that backend: JS terminates the worker, Python sends SIGINT and may escalate to subprocess shutdown.
Limits & Caps
- Per-cell timeout default: 30s (applied when
timeoutis omitted inEvalTool.execute(); clamped throughTOOL_TIMEOUTS.eval.defaultinpackages/coding-agent/src/tools/tool-timeouts.ts) - Schema-level
timeoutrange: integer1..600seconds (enforced by Zod on the cell schema) - Timeout clamp at runtime: 1s minimum, 600s maximum (
TOOL_TIMEOUTS.evalinpackages/coding-agent/src/tools/tool-timeouts.ts) - Transcript code/output preview: 10 lines by default (
EVAL_DEFAULT_PREVIEW_LINESinpackages/coding-agent/src/tools/eval.ts) - Output truncation window: 50KB default (
DEFAULT_MAX_BYTESinpackages/coding-agent/src/session/streaming-output.ts) - Output line cap inside truncation helpers: 3000 lines (
DEFAULT_MAX_LINESinpackages/coding-agent/src/session/streaming-output.ts) - Streaming tail buffer for live updates:
DEFAULT_MAX_BYTES * 2= 100KB (packages/coding-agent/src/tools/eval.ts) - JS
parallel()/pipeline()helper concurrency default: 4; maximum: 16 - Eval-driven
agent()recursion cap: task depth 3 (EVAL_AGENT_MAX_DEPTH) - Python retained kernel idle timeout: 5 minutes (
IDLE_TIMEOUT_MSinpackages/coding-agent/src/eval/py/executor.ts) - Python retained kernel cap: 4 sessions (
MAX_KERNEL_SESSIONSinpackages/coding-agent/src/eval/py/executor.ts) - Python retained kernel cleanup sweep: every 30s (
CLEANUP_INTERVAL_MSinpackages/coding-agent/src/eval/py/executor.ts) - Python owner-cleanup shutdown wait: 2000ms (
OWNER_CLEANUP_KERNEL_SHUTDOWN_TIMEOUT_MSinpackages/coding-agent/src/eval/py/executor.ts) - Python heartbeat interval: 5s (
ensureKernelHeartbeat()inpackages/coding-agent/src/eval/py/executor.ts) - Python external gateway availability check timeout: 5s (
AbortSignal.timeout(5000)inpackages/coding-agent/src/eval/py/kernel.ts) - Python auto-restart budget: one restart per retained session before hard failure (
restartCount > 1inpackages/coding-agent/src/eval/py/executor.ts)
Errors
- Zod validation rejects malformed
cellsarrays beforeexecute()runs (missinglanguage/code, out-of-rangetimeout, emptycells). - Missing session without proxy executor throws
ToolError("Eval tool requires a session when not using proxy executor"). - Disabled/unavailable backends throw
ToolErrorfromresolveBackend():eval.py = falseand apycell is requestedeval.js = falseand ajscell is requested- Python kernel unavailable and a
pycell is requested
- JS runtime exceptions are converted into text output plus
exitCode: 1; cancellations returncancelled: trueand may appendCommand timed out. - Python execution errors from the kernel become text output and
exitCode: 1; later cells are skipped. - Python stdin requests are treated as errors with the message
Kernel requested stdin; interactive input is not supported. - Cancellation is returned, not thrown, once backend execution has started. The tool formats it as a cell failure and sets
details.isError = true. - If output truncates, the tool still succeeds; truncation is surfaced through
details.metaand artifact-backed full output when available.
Shared executor trade-offs
- Parent agents and subagents share eval state bidirectionally when a subagent inherits the parent's executor id. Mutations in either direction are visible to the other participant.
- Async regions of concurrent runs can interleave. Synchronous JS still blocks the VM event loop; synchronous Python still contends on the GIL.
- Cancelling one run is destructive to the shared backend executor. This is intentional: JS worker termination and Python SIGINT/subprocess shutdown are the only reliable way to interrupt arbitrary user code.
reset: trueis destructive for every live run on that backend session id. New starts on that backend are rejected while reset is in flight.
Notes
- Backend selection is strictly explicit per cell:
languagemust be"py"or"js". The previous*** Cellheader parser, theeval.larkconstrained grammar, and the sniffer-based fallback have all been removed. EvalTool.customFormatno longer exists. Tool calls flow through the standard JSON schema; there is no Lark-constrained sampling path.tool.<name>()exists in both JS and Python. Python calls route through a per-run loopback bridge keyed by the current cell id.- JS helper paths reject protocol URIs (
://) inresolveRegularFile()forread(), and resolve other paths against the session cwd or absolute filesystem path. Usetool.read(...)or another tool explicitly for internal URLs. - Python helper
output(...)depends onPI_ARTIFACTS_DIRorPI_SESSION_FILE; it fails outside a session-backed run. display()can produce text and structured outputs from the same value; the renderer prefers markdown overtext/plainwhen both exist.- JS static imports are rewritten only at top level. Nested imports stay invalid and surface normal JS syntax/runtime errors.
EvalToolisconcurrency = "exclusive"within one agent session, but parent and subagent sessions can run eval concurrently when they share an inherited executor id.- The tool description shown to the model is templated by backend availability (
getEvalToolDescription()); if Python is unavailable, the prompt omits Python-specific instructions.