id: gateway-startup-diagnostics title: Gateway Startup Diagnostics type: runtime-diagnostics ownedPaths:
- electron/gateway/**
- electron/utils/openclaw-auth.ts
- electron/utils/paths.ts
- src/stores/gateway.ts
- src/pages/Dreams/** requiredProfiles:
- fast
- comms requiredRules:
- gateway-readiness-policy
- renderer-main-boundary
- backend-communication-boundary
- api-client-transport-policy
- comms-regression
- docs-sync
Use this spec when ClawX shows the Gateway as starting/running but UI data does not refresh, Dreams cannot load, or Gateway RPC calls time out after a restart.
ClawX should prefer OpenClaw-native signals over stderr string matching:
system-presenceproves the core RPC router is serving requests.healthprovides the Gateway health snapshot; use cachedprobe:falsefirst.statusprovides presence, health, stateVersion, uptime, and session defaults.channels.statusis the channel capability signal.doctor.memory.statusis the memory/dreams capability signal.gateway.ready,health, andpresenceevents should update ClawX's main-process capability cache.
stderr is supporting evidence only. It should not be the primary source for deciding whether the Gateway is ready, blocked, or should be restarted.
Failure Shape
Treat these as the same incident family until proven otherwise:
Gateway ready fallback triggered; probing RPC router before marking readyGateway ready fallback RPC router probe failed: RPC timeout: system-presence[gateway:rpc] doctor.memory.status failed[gateway:rpc] doctor.memory.dreamDiary failedchat.history unavailable during gateway startup- Port
18789is listening, but Gateway HTTP or WebSocket RPC does not return.
Important distinction:
- Port ready only means the process is listening.
- Handshake ready only means ClawX connected to the Gateway socket.
- RPC ready means a cheap call such as
system-presencesucceeds.
UI features that depend on Gateway runtime data must prefer RPC-ready evidence over port-ready evidence.
Capability failures are not Gateway core failures:
doctor.memory.statustimeout means memory capability degraded untilsystem-presencealso fails.channels.statustimeout means channel capability degraded untilsystem-presencealso fails.- dreams cron unavailable, missing memory files, stale session keys, or provider credential errors do not trigger Gateway restart by themselves.
Fast Triage
- Confirm the process and ports:
lsof -nP -iTCP:18789 -sTCP:LISTEN || true
lsof -nP -iTCP:5173 -sTCP:LISTEN || true
- Read recent ClawX logs:
tail -n 160 "$HOME/Library/Application Support/clawx/logs/clawx-$(date +%F).log"
- Probe OpenClaw-native signals in this order. Redirect output for memory-related calls because successful responses may contain user data:
pnpm exec openclaw gateway call system-presence >/tmp/clawx-system-presence.json
pnpm exec openclaw gateway call health --params '{"probe":false}' >/tmp/clawx-health.json
pnpm exec openclaw gateway call status >/tmp/clawx-status.json
pnpm exec openclaw gateway call channels.status --params '{"probe":false}' >/tmp/clawx-channels-status.json
pnpm exec openclaw gateway call doctor.memory.status >/tmp/clawx-memory-status.json
pnpm exec openclaw gateway call doctor.memory.dreamDiary >/tmp/clawx-dream-diary.json
- Only if port is listening and the core RPC probe (
system-presence) times out, agree on the sampling scope, then sample the Gateway process on macOS:
sample <gateway-pid> 3 -mayDie >/tmp/clawx-gateway.sample.txt
Look for heavy main-thread stacks around uv_fs_open, uv_fs_scandir, open, read, write, mkdir, or repeated plugin/skill initialization frames. This usually means the Gateway event loop is busy with synchronous file work and cannot service RPCs yet.
Sampling Coordination Protocol
Sampling is diagnostic work against a live local process. Coordinate it explicitly when the user is depending on the current Gateway for active work.
Before sampling, state:
- Which process will be sampled, including PID and why it is believed to be the Gateway.
- The sampling command and duration. Default to
sample <pid> 3 -mayDie; increase duration only after explaining why. - Expected impact. A short macOS sample is read-only and usually low impact, but it can produce a large file and may briefly add system load.
- Where the artifact will be written, usually
/tmp/clawx-gateway.sample.txt. - What will be inspected and what will not be shared verbatim.
Do not proceed without explicit user agreement when:
- Sampling a process that is not clearly the ClawX-owned Gateway child.
- Increasing sample duration above 5 seconds or repeating samples many times.
- Collecting process environment, open files, memory dumps, trace archives, or any artifact likely to contain secrets.
- Killing, restarting, or force-cleaning Gateway while active tasks, cron jobs, or user-visible work may be running.
Safe-by-default commands:
lsof -nP -iTCP:18789 -sTCP:LISTEN || true
ps -axo pid,ppid,etime,command | rg -i "openclaw-gateway|Electron|vite" | rg -v "rg -i"
sample <gateway-pid> 3 -mayDie >/tmp/clawx-gateway.sample.txt
Avoid by default:
env,/proc/<pid>/environ, or full process command dumps that include tokens.- Full memory dumps.
- Pasting raw
doctor.memory.*output. - Pasting complete sample files when only top stack signatures are needed.
When analyzing a sample, report a compact summary:
- Main-thread state: idle, synchronous fs I/O, network connect, CPU-bound JS, or unknown.
- Dominant stack signature, such as
uv_fs_openunder plugin runtime setup. - Whether the finding points to ClawX-owned prelaunch cleanup, OpenClaw runtime startup cost, active user work, or inconclusive data.
- Recommended next action and whether it requires another user approval.
Known Causes
Stale Runtime Dependency Cache
Symptoms:
- Sample shows many synchronous
opencalls while plugin runtime setup is running. ~/.openclaw/plugin-runtime-deps/openclaw-*contains symlink trees pointing at an old worktree or oldnode_modules/openclaw.- Startup takes much longer than expected before RPC router becomes responsive.
Expected mitigation:
cleanupStalePluginRuntimeDeps()runs before Gateway launch.- It removes only immediate
openclaw-*cache roots when symlinks inside point at an OpenClaw package path outside the current bundled package. - It must not remove arbitrary third-party plugin caches.
Over-Broad Plugin Allowlist
Symptoms:
plugins.allowcontains many provider or media plugins that are not configured or active.- Gateway mirrors or loads more runtime plugin roots than the current user setup needs.
Expected mitigation:
- Preserve external plugins that are installed, configured in
plugins.entries, or loaded throughplugins.load/plugins.load.paths. - Preserve configured bundled plugins, active provider plugins, and core runtime plugins such as
browser,acpx,device-pair, andmemory-core. - Do not re-add optional provider-like bundled plugins such as
alibaba,deepgram,elevenlabs,groq,microsoft,phone-control,runway,talk-voice, orvoyageunless configured or active.
Escaped Skill Symlinks
Symptoms:
- Logs repeatedly show
Skipping escaped skill path outside its configured root. - A managed root such as
~/.openclaw/skillscontains symlinks whose realpath points outside that root.
Expected mitigation:
cleanupAgentsSymlinkedSkills()removes symlinks under OpenClaw managed skill roots whose realpath escapes the same root.- Real directories and symlinks that stay inside the managed skills root must be preserved.
- This cleanup is safe because the hardened OpenClaw loader would reject those escaped entries anyway.
Startup Work Competing With RPC
Symptoms:
- Gateway handshake completes, but
system-presence,chat.history, ordoctor.memory.*times out during the first minutes. - Logs mention cron repair, channel account checks, session lock cleanup, memory-core cron reconciliation, or active embedded/task runs.
Expected behavior:
- Do not mark Gateway fully ready from a pure timer fallback.
- The fallback must probe
system-presencebefore emitting ready. - Heartbeat recovery may defer restart during the initial grace window, but it should not loop restart while the Gateway is still performing startup work.
Capability Degraded But Core Alive
Symptoms:
system-presence,health, orstatussucceeds.doctor.memory.status,doctor.memory.dreamDiary, orchannels.statustimes out.- stderr may mention dreams cron unavailable, missing memory files, stale session keys, or credentials provider errors.
Expected behavior:
- Keep global Gateway state based on process, transport, and core RPC readiness.
- Mark only the relevant capability as degraded.
- Do not restart Gateway automatically.
- Let the user retry the capability probe or fix provider/channel credentials.
Restart Deferral By Active Work
Symptoms:
- Logs mention restart deferral because operations, embedded runs, or task runs are still active.
- A restart takes minutes even though the process is otherwise alive.
Expected handling:
- Explain to users that restart cost is dominated by active Gateway work, not by ClawX UI rendering.
- Avoid triggering full Gateway restart for feature toggles when a narrower config reload or plugin RPC is available.
Remediation Order
- Avoid renderer-side transport workarounds. Renderer code must continue to use
host-api/api-client. - Preserve
~/.openclaw/openclaw.json; never replace it with a skeleton on parse errors. - Run the startup sanitizers and cleanup hooks locally:
pnpm exec tsx -e "import { sanitizeOpenClawConfig } from './electron/utils/openclaw-auth.ts'; import { cleanupAgentsSymlinkedSkills, cleanupStalePluginRuntimeDeps } from './electron/gateway/skills-symlink-cleanup.ts'; sanitizeOpenClawConfig().then(() => { console.log(cleanupAgentsSymlinkedSkills()); console.log(cleanupStalePluginRuntimeDeps()); });"
- Restart the app or Gateway and watch for the startup metric:
[metric] gateway.startup {
"configSyncMs": ...,
"spawnToReadyMs": ...,
"readyToConnectMs": ...,
"totalMs": ...
}
- Confirm core RPC readiness:
pnpm exec openclaw gateway call system-presence >/tmp/clawx-system-presence.json
- Confirm cached OpenClaw health before deeper probes:
pnpm exec openclaw gateway call health --params '{"probe":false}' >/tmp/clawx-health.json
pnpm exec openclaw gateway call status >/tmp/clawx-status.json
- Only after
system-presencesucceeds, verify feature-specific RPCs such as Dreams, memory doctor calls, or channel probes.
Acceptance Criteria
- Gateway starts without restart loops.
configSyncMsstays small relative to total startup time.system-presencesucceeds after startup settles.healthandstatusare captured in Gateway diagnostics when available.- Dreams page can refresh once the Gateway process is running and RPC-ready.
doctor.memory.statusanddoctor.memory.dreamDiaryreturn when Dreams is enabled.doctor.memory.*andchannels.statusfailures degrade their capability only and do not trigger Gateway restart.- Logs no longer repeat stale runtime cache or escaped managed-skill symlink warnings for entries ClawX can safely clean.
Required Regression Coverage
For fixes in this area, run:
pnpm run typecheck
pnpm run lint:check
pnpm exec vitest run tests/unit/openclaw-auth.test.ts tests/unit/skills-symlink-cleanup.test.ts tests/unit/gateway-manager-heartbeat.test.ts tests/unit/gateway-ready-fallback.test.ts
pnpm exec playwright test tests/e2e/openclaw-dreams.spec.ts
pnpm run build:vite
If the change touches Gateway send/receive, fallback, readiness, or chat history, also run:
pnpm run comms:replay
pnpm run comms:compare
Reporting Notes
When sharing findings:
- Quote log patterns and timing metrics, not full memory doctor output.
- Redact tokens, account identifiers, device IDs, and channel recipients.
- State whether the failure is port readiness, handshake readiness, or RPC readiness.
- Separate ClawX-owned cleanup issues from OpenClaw runtime initialization cost.