perf(gateway): defer QQAdapter and YuanbaoAdapter imports via PEP 562 (#22790)
gateway/platforms/__init__.py eagerly imported QQAdapter and
YuanbaoAdapter at package-init time, which transitively pulled in
qqbot's chunked-upload + keyboards + onboard machinery and yuanbao's
websocket stack. About 84 ms wall and 23 MB RSS on every fresh process
that touched anything under gateway.platforms — including `hermes
chat` (via run_agent → cli's plugin discovery transitive import).
Nothing in the codebase actually consumes these symbols from the
package root; every real call site uses the long-form path
(from gateway.platforms.qqbot import QQAdapter,
from gateway.platforms.yuanbao import YuanbaoAdapter in gateway/run.py).
The eager re-export was only there for convenience.
Replace with a PEP 562 module-level __getattr__ that lazily imports
on first attribute access. Public API stays identical:
from gateway.platforms import QQAdapter keeps working but only
pays the import cost when the symbol is actually touched. __dir__
preserves help() / autocomplete behavior.
Measured impact (7-run medians, 9950X3D):
import gateway.platforms 127 → 43 ms (-66%)
50 → 27 MB (-46%)
import gateway.platforms.base 127 → 44 ms (-65%)
50 → 27 MB (-46%)
import cli (full chat path) 745 → 710 ms ( -5%)
96 → 90 MB ( -6%)
hermes chat -q (cold) -5 MB
The per-import win is biggest because qqbot/yuanbao deps don't overlap
with anything on the gateway-platforms path — full import cli
already loads aiohttp/websockets transitively from other places, so
the marginal CLI win is smaller than the isolated import benchmark.
The gateway.platforms.base win is what matters most for long-lived
gateway processes: every gateway boot saves 23 MB resident.
All 144 qqbot tests pass; broader gateway suite (5132 tests) passes
modulo 4 pre-existing flakes also failing on main without this change.
fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451)
Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot
+ Feishu on macOS behind Cloudflare Warp.
1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py).
Every long-lived platform adapter now constructs httpx.AsyncClient
with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's
default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the
default 5s window let idle keepalive sockets sit in CLOSE_WAIT long
enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal,
BlueBubbles, WeCom-callback, plus the transient Feishu helper) to
compound to the 256-fd ulimit. Tunable via
HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and
HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars.
2. whatsapp.send_typing aiohttp leak. The call was
'await self._http_session.post(...)' with no 'async with' and no
variable capture — the ClientResponse went out of scope unclosed,
holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in
'async with'. This was the only bare-await aiohttp leak in the
gateway/tools/plugins tree per audit; all other aiohttp sites use
the context-manager pattern correctly.
The underlying reporter also saw Feishu SDK (lark-oapi) connections in
CLOSE_WAIT — those are inside the SDK and out of our direct control, but
tightening httpx keepalive across adapters reduces the aggregate pool
pressure regardless of which individual adapter leaks.
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
fix(dingtalk): transcribe native voice notes
Sibling fix to PR #28918 (Discord voice notes). DingTalk's rich-text
"voice" item type is its native voice-message format, but the adapter
was routing it to MessageType.AUDIO — which gateway/run.py:7605 skips
for STT. The docs claim every voice-capable platform auto-transcribes,
so this brings DingTalk in line.
Generic audio uploads (mapped to "file" by DINGTALK_TYPE_MAPPING) are
unchanged — they were already classified as DOCUMENT, not AUDIO.
Adds tests/gateway/test_dingtalk.py::TestExtractMedia covering both the
voice path and the audio-passthrough invariant.
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
fix(async): close unscheduled coroutines in all threadsafe bridges (#26584)
Wraps every sync->async coroutine-scheduling site in the codebase with a
new agent.async_utils.safe_schedule_threadsafe() helper that closes the
coroutine on scheduling failure (closed loop, shutdown race, etc.)
instead of leaking it as 'coroutine was never awaited' RuntimeWarnings
plus reference leaks.
22 production call sites migrated across the codebase:
- acp_adapter/events.py, acp_adapter/permissions.py
- agent/lsp/manager.py
- cron/scheduler.py (media + text delivery paths)
- gateway/platforms/feishu.py (5 sites, via existing _submit_on_loop helper
which now delegates to safe_schedule_threadsafe)
- gateway/run.py (10 sites: telegram rename, agent:step hook, status
callback, interim+bg-review, clarify send, exec-approval button+text,
temp-bubble cleanup, channel-directory refresh)
- plugins/memory/hindsight, plugins/platforms/google_chat
- tools/browser_supervisor.py (3), browser_cdp_tool.py,
computer_use/cua_backend.py, slash_confirm.py
- tools/environments/modal.py (_AsyncWorker)
- tools/mcp_tool.py (2 + 8 _run_on_mcp_loop callers converted to
factory-style so the coroutine is never constructed on a dead loop)
- tui_gateway/ws.py
Tests: new tests/agent/test_async_utils.py covers helper behavior under
live loop, dead loop, None loop, and scheduling exceptions. Regression
tests added at three PR-original sites (acp events, acp permissions,
mcp loop runner) mirroring contributor's intent.
Live-tested end-to-end:
- Helper stress test: 1500 schedules across live/dead/race scenarios,
zero leaked coroutines
- Race exercised: 5000 schedules with loop killed mid-flight, 100 ok /
4900 None returns, zero leaks
- hermes chat -q with terminal tool call (exercises step_callback bridge)
- MCP probe against failing subprocess servers + factory path
- Real gateway daemon boot + SIGINT shutdown across multiple platform
adapter inits
- WSTransport 100 live + 50 dead-loop writes
- Cron delivery path live + dead loop
Salvages PR #2657 — adopts contributor's intent over a much wider site
list and a single centralized helper instead of inline try/except at
each site. 3 of the original PR's 6 sites no longer exist on main
(environments/patches.py deleted, DingTalk refactored to native async);
the equivalent fix lives in tools/environments/modal.py instead.
Co-authored-by: JithendraNara <jithendranaidunara@gmail.com>
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451)
Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot
+ Feishu on macOS behind Cloudflare Warp.
1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py).
Every long-lived platform adapter now constructs httpx.AsyncClient
with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's
default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the
default 5s window let idle keepalive sockets sit in CLOSE_WAIT long
enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal,
BlueBubbles, WeCom-callback, plus the transient Feishu helper) to
compound to the 256-fd ulimit. Tunable via
HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and
HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars.
2. whatsapp.send_typing aiohttp leak. The call was
'await self._http_session.post(...)' with no 'async with' and no
variable capture — the ClientResponse went out of scope unclosed,
holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in
'async with'. This was the only bare-await aiohttp leak in the
gateway/tools/plugins tree per audit; all other aiohttp sites use
the context-manager pattern correctly.
The underlying reporter also saw Feishu SDK (lark-oapi) connections in
CLOSE_WAIT — those are inside the SDK and out of our direct control, but
tightening httpx keepalive across adapters reduces the aggregate pool
pressure regardless of which individual adapter leaks.
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600)
Stop the gateway from exiting (or systemd-restart-looping) when a single
messaging adapter fails at startup or runtime. A misconfigured WhatsApp
(npm install timeout, unpaired bridge, missing creds.json) used to take
the entire gateway down, killing cron jobs and any other connected
platforms with it.
Changes:
• Startup (gateway/run.py): when connected_count==0 but the only
errors are retryable, log a degraded-state warning and keep the
gateway alive instead of returning False. Reconnect watcher then
recovers platforms as their underlying problem clears.
• Runtime (gateway/run.py _handle_adapter_fatal_error): when the last
adapter goes down with a retryable error and is queued for
reconnection, stay alive instead of exit-with-failure. Previously
this triggered systemd Restart=on-failure, which created infinite
restart loops on persistent retryable failures (proxy outage,
repeated bridge crashes).
• Reconnect watcher (gateway/run.py _platform_reconnect_watcher):
replace the 20-attempt hard drop with a circuit-breaker pause.
After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the
platform stays in _failed_platforms with paused=True so the watcher
skips it but the operator can still see and resume it. Non-retryable
errors still drop out of the queue immediately. Resolves #17063
(gateway giving up on Telegram after 20 attempts).
• WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start
the Node bridge when creds.json is missing. Sets a non-retryable
whatsapp_not_paired fatal error so the watcher drops it cleanly
with a single 'run hermes whatsapp' log line instead of paying the
30s bridge bootstrap timeout on every gateway start.
• WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set
WHATSAPP_ENABLED=true once pairing actually succeeds. Previously
the wizard wrote the env var at step 2 (before npm install and QR
pairing), so any Ctrl+C left .env claiming WhatsApp was ready when
the bridge had no creds.json. Also propagate the env var when the
user keeps an existing pairing on a re-run.
• /platform slash command (hermes_cli/commands.py + gateway/run.py):
new gateway-only command for manual circuit-breaker control.
/platform list — show connected + failed/paused platforms
/platform pause <name> — silence a known-broken platform
/platform resume <name> — re-queue a paused platform
Tests:
• New: pause/resume helpers, /platform list|pause|resume command,
WhatsApp creds.json preflight, WhatsApp setup ordering.
• Updated: stale assertions that codified the old 'exit and let
systemd restart' behavior in test_runner_fatal_adapter.py,
test_runner_startup_failures.py, and test_platform_reconnect.py
(the 20-attempt give-up test became a circuit-breaker pause test).
5488 tests pass in tests/gateway/.
feat(state.db): persist platform_message_id; restore yuanbao exact-id recall
PR #29211 dropped JSONL gateway transcripts and noted that the platform's
own message_id field (used by Yuanbao's recall guard to redact a
message by exact platform id) was no longer preserved — falling back to
content-match. That fallback works for the common case but redacts the
wrong row when two messages share text (or fails to match when content
is post-processed).
Restore exact-id matching by giving state.db a column for it:
- New platform_message_id TEXT column on the messages table
(SCHEMA_VERSION bump 11 → 12; column added via declarative reconciler
on existing DBs, no version-gated migration block needed)
- Partial index idx_messages_platform_msg_id on
(session_id, platform_message_id) to keep recall's point-lookup cheap
even on large sessions
- append_message() and replace_messages() accept the new value:
the gateway-facing append_to_transcript in gateway/session.py
forwards either message["platform_message_id"] or the legacy
message["message_id"] key (yuanbao's existing convention)
- get_messages_as_conversation() surfaces the column back on the
message dict as message_id so platform code reads the same shape
it used to read from JSONL
- Yuanbao _patch_transcript: restore branch A1 (exact id match)
ahead of A2 (content match) ahead of B (system-note). Both branches
log which one fired so operators can tell from gateway.log whether
recall hit the canonical path or had to fall back.
Tests:
- New low-level round-trip tests in test_hermes_state.py for both
append_message and replace_messages paths
- The PR's test_yuanbao_recall_db_only.py was rewritten to assert
the new contract: branch A1 (id match) works against DB-only
transcripts, and branch A2 (content match) still recovers rows that
were observed without a platform id (e.g. agent-processed @bot
messages where run.py doesn't carry msg_id through)
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).