1dba122c创建于 1 天前历史提交

Non-compaction auto-retry policy

This document describes the standard API-error retry path in AgentSession.

It explicitly excludes context-overflow recovery via auto-compaction. Overflow is handled by compaction logic and is documented separately in compaction.md.

Implementation files

Scope boundary vs compaction

Retry and compaction are checked from the same agent_end path, but they are intentionally separated:

  1. agent_end inspects the last assistant message.
  2. #isRetryableError(...) runs first.
  3. If retry is initiated, compaction checks are skipped for that turn.
  4. Context-overflow errors are hard-excluded from retry classification (isContextOverflow(...) short-circuits retry).
  5. Overflow therefore falls through to #checkCompaction(...) instead of standard retry.

So: overload/rate/server/network-style failures use this retry policy; context-window overflow uses compaction recovery.

Retry classification

#isRetryableError(...) requires all of the following:

  • assistant stopReason === "error"
  • errorMessage exists
  • message is not context overflow
  • errorMessage matches transient transport/envelope patterns or isUsageLimitError(...)

Current retryable inputs are regex/string-classified:

  • transient transport/envelope failures, including Anthropic stream-envelope failures before message_start
  • overloaded/provider-returned-error wording
  • rate limit / usage limit / too many requests
  • HTTP-like server classes: 429, 500, 502, 503, 504
  • service unavailable / server/internal error
  • provider-suggested retry wording, including OpenAI retry your request failures
  • network/connection/socket failures, refused/closed connections, upstream connect/reset-before-headers, socket hang up, timeout/timed out, fetch failed, terminated, retry delay wording, and unexpected socket close messages

This is string-pattern classification, not typed provider error codes.

Retry lifecycle and state transitions

Session state used by retry:

  • #retryAttempt: number (0 means idle)
  • #retryPromise: Promise<void> | undefined (tracks in-progress retry lifecycle)
  • #retryResolve: (() => void) | undefined (resolves #retryPromise)
  • #retryAbortController: AbortController | undefined (cancels backoff sleep)

Flow (#handleRetryableError):

  1. Read retry settings group.
  2. If retry.enabled === false, stop immediately (false, no retry started).
  3. Increment #retryAttempt.
  4. Create #retryPromise once (first attempt in a chain).
  5. If attempt exceeded retry.maxRetries, emit final failure event and stop.
  6. Compute base delay: retry.baseDelayMs * 2^(attempt-1).
  7. For usage-limit errors, parse retry hints and call auth storage (markUsageLimitReached(...)); if credential switching succeeds, force delay to 0, otherwise use a larger retry-after/backoff hint when present.
  8. If no credential switch occurred, suppress the current model selector for cooldown, try configured retry model fallback chains, and force delay to 0 on model switch.
  9. If the final delay exceeds retry.maxDelayMs and no credential/model switch happened, emit final failure and do not sleep.
  10. Emit auto_retry_start.
  11. Remove the trailing assistant error message from agent runtime state (kept in persisted session history).
  12. Sleep with abort support.
  13. Schedule agent.continue() through the post-prompt task scheduler (delayMs: 1) for the same prompt generation.

What resets retry counters

#retryAttempt resets to 0 in these cases:

  • first successful non-error, non-aborted assistant message after retries started (emits auto_retry_end { success: true })
  • retry cancellation during backoff sleep
  • max retries exceeded path
  • max delay exceeded path

#retryPromise resolves/clears when retry chain ends (success, cancellation, max-exceeded, or max-delay failure), via #resolveRetry().

Backoff and max-attempt semantics

Settings:

  • retry.enabled (default true)
  • retry.maxRetries (default 3)
  • retry.baseDelayMs (default 2000)
  • retry.maxDelayMs (default 300000, 5 minutes; <= 0 disables the fail-fast cap)

Attempt numbering:

  • attempt counter is incremented before max-check
  • start events use current attempt (1-based)
  • max-exceeded end event reports attempt: this.#retryAttempt - 1 (last attempted retry count)

Backoff sequence with default settings:

  • attempt 1: 2000 ms
  • attempt 2: 4000 ms
  • attempt 3: 8000 ms

Delay override inputs can come from parsed retry headers (retry-after-ms, retry-after, x-ratelimit-reset-ms, x-ratelimit-reset) or usage-limit backoff. Credential/model fallback switches set delay to 0; otherwise parsed hints can extend the exponential local delay. If the computed delay is greater than retry.maxDelayMs and no switch succeeded, retry ends immediately with a final error instead of sleeping.

Abort mechanics

Explicit retry abort

abortRetry():

  • aborts #retryAbortController (if present)
  • resolves retry promise (#resolveRetry()) so awaiters are unblocked

If abort hits while sleeping, catch path emits:

  • auto_retry_end { success: false, finalError: "Retry cancelled" }
  • resets attempt/controller

Global operation abort interaction

abort() calls abortRetry() before aborting the active agent stream. This guarantees retry backoff is cancelled when user issues a general abort.

TUI interaction

On auto_retry_start, EventController:

  • swaps Esc handler to session.abortRetry()
  • renders loader text: Retrying (attempt/maxAttempts) in Ns… (esc to cancel)

On auto_retry_end, it restores prior Esc handler and clears loader state.

Streaming and prompt completion behavior

prompt() ultimately waits on #waitForRetry() after agent.prompt(...) returns.

Effect:

  • a prompt call does not fully resolve until any started retry chain finishes (success/failure/cancel)
  • retry lifecycle is part of one logical prompt execution boundary

This prevents callers from treating a retrying turn as complete too early.

Controls: settings and RPC

Configuration knobs

Defined in settings schema under retry group:

  • retry.enabled
  • retry.maxRetries
  • retry.baseDelayMs
  • retry.maxDelayMs
  • retry.fallbackChains
  • retry.fallbackRevertPolicy ("cooldown-expiry" by default; "never" disables automatic restoration)

Programmatic toggles in session:

  • setAutoRetryEnabled(enabled) writes retry.enabled
  • autoRetryEnabled reads retry.enabled
  • isRetrying reports whether retry lifecycle promise is active

RPC controls

RPC command surface:

  • set_auto_retrysession.setAutoRetryEnabled(command.enabled)
  • abort_retrysession.abortRetry()

Client helpers:

  • RpcClient.setAutoRetry(enabled)
  • RpcClient.abortRetry()

Both commands return success responses; retry progress/failure details come from streamed session events, not command response payloads.

Event emission and failure surfacing

Session-level retry events:

  • auto_retry_start { attempt, maxAttempts, delayMs, errorMessage }
  • auto_retry_end { success, attempt, finalError? }
  • retry_fallback_applied { from, to, role }
  • retry_fallback_succeeded { model, role }

Propagation:

  • emitted through AgentSession.subscribe(...)
  • forwarded to extension runner as extension events
  • in RPC mode, forwarded directly as JSON event objects (session.subscribe(event => output(event)))
  • in TUI, consumed by EventController for loader/error UI

Final failure surfacing:

  • On max-exceeded, max-delay failure, or cancellation, auto_retry_end.success === false
  • TUI shows: Retry failed after N attempts: <finalError>
  • Extensions/hooks receive auto_retry_end with same fields
  • RPC consumers receive same event object on stdout stream

Permanent stop conditions

Retry stops and will not auto-continue when any of these occur:

  • retry.enabled is false
  • error is not retry-classified
  • error is context overflow (delegated to compaction path)
  • max retries exceeded
  • provider-requested delay exceeds retry.maxDelayMs and no credential/model switch is available
  • user cancels retry (abort_retry or Esc during retry loader)
  • global abort (abort) cancels retry first

A new retry chain can still start later on a future retryable error after counters reset.

Operational caveats

  • Classification is regex text matching; provider-specific structured errors are not used here.
  • Retry strips the failing assistant error from runtime context before re-continue, but session history still keeps that error entry.
  • RpcSessionState currently exposes autoCompactionEnabled but not an autoRetryEnabled field; RPC callers must track their own toggle state or query settings through other APIs.
  • Model fallback changes append temporary model_change entries and may later restore the primary model when its cooldown expires, depending on retry.fallbackRevertPolicy.