Non-compaction auto-retry policy

This document describes the standard API-error retry path in AgentSession.

It explicitly excludes context-overflow recovery via auto-compaction. Overflow is handled by compaction logic and is documented separately in compaction.md.

Implementation files

Scope boundary vs compaction

Retry and compaction are checked from the same agent_end path, but they are intentionally separated:

agent_end inspects the last assistant message.
#isRetryableError(...) runs first.
If retry is initiated, compaction checks are skipped for that turn.
Context-overflow errors are hard-excluded from retry classification (isContextOverflow(...) short-circuits retry).
Overflow therefore falls through to #checkCompaction(...) instead of standard retry.

So: overload/rate/server/network-style failures use this retry policy; context-window overflow uses compaction recovery.

Retry classification

#isRetryableError(...) requires all of the following:

assistant stopReason === "error"
errorMessage exists
message is not context overflow
errorMessage matches transient transport/envelope patterns or isUsageLimitError(...)

Current retryable inputs are regex/string-classified:

transient transport/envelope failures, including Anthropic stream-envelope failures before message_start
overloaded/provider-returned-error wording
rate limit / usage limit / too many requests
HTTP-like server classes: 429, 500, 502, 503, 504
service unavailable / server/internal error
provider-suggested retry wording, including OpenAI retry your request failures
network/connection/socket failures, refused/closed connections, upstream connect/reset-before-headers, socket hang up, timeout/timed out, fetch failed, terminated, retry delay wording, and unexpected socket close messages

This is string-pattern classification, not typed provider error codes.

Retry lifecycle and state transitions

Session state used by retry:

#retryAttempt: number (0 means idle)
#retryPromise: Promise<void> | undefined (tracks in-progress retry lifecycle)
#retryResolve: (() => void) | undefined (resolves #retryPromise)
#retryAbortController: AbortController | undefined (cancels backoff sleep)

Flow (#handleRetryableError):

Read retry settings group.
If retry.enabled === false, stop immediately (false, no retry started).
Increment #retryAttempt.
Create #retryPromise once (first attempt in a chain).
If attempt exceeded retry.maxRetries, emit final failure event and stop.
Compute base delay: retry.baseDelayMs * 2^(attempt-1).
For usage-limit errors, parse retry hints and call auth storage (markUsageLimitReached(...)); if credential switching succeeds, force delay to 0, otherwise use a larger retry-after/backoff hint when present.
If no credential switch occurred, suppress the current model selector for cooldown, try configured retry model fallback chains, and force delay to 0 on model switch.
If the final delay exceeds retry.maxDelayMs and no credential/model switch happened, emit final failure and do not sleep.
Emit auto_retry_start.
Remove the trailing assistant error message from agent runtime state (kept in persisted session history).
Sleep with abort support.
Schedule agent.continue() through the post-prompt task scheduler (delayMs: 1) for the same prompt generation.

What resets retry counters

#retryAttempt resets to 0 in these cases:

first successful non-error, non-aborted assistant message after retries started (emits auto_retry_end { success: true })
retry cancellation during backoff sleep
max retries exceeded path
max delay exceeded path

#retryPromise resolves/clears when retry chain ends (success, cancellation, max-exceeded, or max-delay failure), via #resolveRetry().

Backoff and max-attempt semantics

Settings:

retry.enabled (default true)
retry.maxRetries (default 3)
retry.baseDelayMs (default 2000)
retry.maxDelayMs (default 300000, 5 minutes; <= 0 disables the fail-fast cap)

Attempt numbering:

attempt counter is incremented before max-check
start events use current attempt (1-based)
max-exceeded end event reports attempt: this.#retryAttempt - 1 (last attempted retry count)

Backoff sequence with default settings:

attempt 1: 2000 ms
attempt 2: 4000 ms
attempt 3: 8000 ms

Delay override inputs can come from parsed retry headers (retry-after-ms, retry-after, x-ratelimit-reset-ms, x-ratelimit-reset) or usage-limit backoff. Credential/model fallback switches set delay to 0; otherwise parsed hints can extend the exponential local delay. If the computed delay is greater than retry.maxDelayMs and no switch succeeded, retry ends immediately with a final error instead of sleeping.

Abort mechanics

Explicit retry abort

abortRetry():

aborts #retryAbortController (if present)
resolves retry promise (#resolveRetry()) so awaiters are unblocked

If abort hits while sleeping, catch path emits:

auto_retry_end { success: false, finalError: "Retry cancelled" }
resets attempt/controller

Global operation abort interaction

abort() calls abortRetry() before aborting the active agent stream. This guarantees retry backoff is cancelled when user issues a general abort.

TUI interaction

On auto_retry_start, EventController:

swaps Esc handler to session.abortRetry()
renders loader text: Retrying (attempt/maxAttempts) in Ns… (esc to cancel)

On auto_retry_end, it restores prior Esc handler and clears loader state.

Streaming and prompt completion behavior

prompt() ultimately waits on #waitForRetry() after agent.prompt(...) returns.

Effect:

a prompt call does not fully resolve until any started retry chain finishes (success/failure/cancel)
retry lifecycle is part of one logical prompt execution boundary

This prevents callers from treating a retrying turn as complete too early.

Controls: settings and RPC

Configuration knobs

Defined in settings schema under retry group:

retry.enabled
retry.maxRetries
retry.baseDelayMs
retry.maxDelayMs
retry.fallbackChains
retry.fallbackRevertPolicy ("cooldown-expiry" by default; "never" disables automatic restoration)

Programmatic toggles in session:

setAutoRetryEnabled(enabled) writes retry.enabled
autoRetryEnabled reads retry.enabled
isRetrying reports whether retry lifecycle promise is active

RPC controls

RPC command surface:

set_auto_retry → session.setAutoRetryEnabled(command.enabled)
abort_retry → session.abortRetry()

Client helpers:

RpcClient.setAutoRetry(enabled)
RpcClient.abortRetry()

Both commands return success responses; retry progress/failure details come from streamed session events, not command response payloads.

Event emission and failure surfacing

Session-level retry events:

auto_retry_start { attempt, maxAttempts, delayMs, errorMessage }
auto_retry_end { success, attempt, finalError? }
retry_fallback_applied { from, to, role }
retry_fallback_succeeded { model, role }

Propagation:

emitted through AgentSession.subscribe(...)
forwarded to extension runner as extension events
in RPC mode, forwarded directly as JSON event objects (session.subscribe(event => output(event)))
in TUI, consumed by EventController for loader/error UI

Final failure surfacing:

On max-exceeded, max-delay failure, or cancellation, auto_retry_end.success === false
TUI shows: Retry failed after N attempts: <finalError>
Extensions/hooks receive auto_retry_end with same fields
RPC consumers receive same event object on stdout stream

Permanent stop conditions

Retry stops and will not auto-continue when any of these occur:

retry.enabled is false
error is not retry-classified
error is context overflow (delegated to compaction path)
max retries exceeded
provider-requested delay exceeds retry.maxDelayMs and no credential/model switch is available
user cancels retry (abort_retry or Esc during retry loader)
global abort (abort) cancels retry first

A new retry chain can still start later on a future retryable error after counters reset.

Operational caveats

Classification is regex text matching; provider-specific structured errors are not used here.
Retry strips the failing assistant error from runtime context before re-continue, but session history still keeps that error entry.
RpcSessionState currently exposes autoCompactionEnabled but not an autoRetryEnabled field; RPC callers must track their own toggle state or query settings through other APIs.
Model fallback changes append temporary model_change entries and may later restore the primary model when its cooldown expires, depending on retry.fallbackRevertPolicy.