perf(gateway): defer QQAdapter and YuanbaoAdapter imports via PEP 562 (#22790)
gateway/platforms/__init__.py eagerly imported QQAdapter and
YuanbaoAdapter at package-init time, which transitively pulled in
qqbot's chunked-upload + keyboards + onboard machinery and yuanbao's
websocket stack. About 84 ms wall and 23 MB RSS on every fresh process
that touched anything under gateway.platforms — including `hermes
chat` (via run_agent → cli's plugin discovery transitive import).
Nothing in the codebase actually consumes these symbols from the
package root; every real call site uses the long-form path
(from gateway.platforms.qqbot import QQAdapter,
from gateway.platforms.yuanbao import YuanbaoAdapter in gateway/run.py).
The eager re-export was only there for convenience.
Replace with a PEP 562 module-level __getattr__ that lazily imports
on first attribute access. Public API stays identical:
from gateway.platforms import QQAdapter keeps working but only
pays the import cost when the symbol is actually touched. __dir__
preserves help() / autocomplete behavior.
Measured impact (7-run medians, 9950X3D):
import gateway.platforms 127 → 43 ms (-66%)
50 → 27 MB (-46%)
import gateway.platforms.base 127 → 44 ms (-65%)
50 → 27 MB (-46%)
import cli (full chat path) 745 → 710 ms ( -5%)
96 → 90 MB ( -6%)
hermes chat -q (cold) -5 MB
The per-import win is biggest because qqbot/yuanbao deps don't overlap
with anything on the gateway-platforms path — full import cli
already loads aiohttp/websockets transitively from other places, so
the marginal CLI win is smaller than the isolated import benchmark.
The gateway.platforms.base win is what matters most for long-lived
gateway processes: every gateway boot saves 23 MB resident.
All 144 qqbot tests pass; broader gateway suite (5132 tests) passes
modulo 4 pre-existing flakes also failing on main without this change.
fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451)
Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot
+ Feishu on macOS behind Cloudflare Warp.
1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py).
Every long-lived platform adapter now constructs httpx.AsyncClient
with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's
default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the
default 5s window let idle keepalive sockets sit in CLOSE_WAIT long
enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal,
BlueBubbles, WeCom-callback, plus the transient Feishu helper) to
compound to the 256-fd ulimit. Tunable via
HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and
HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars.
2. whatsapp.send_typing aiohttp leak. The call was
'await self._http_session.post(...)' with no 'async with' and no
variable capture — the ClientResponse went out of scope unclosed,
holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in
'async with'. This was the only bare-await aiohttp leak in the
gateway/tools/plugins tree per audit; all other aiohttp sites use
the context-manager pattern correctly.
The underlying reporter also saw Feishu SDK (lark-oapi) connections in
CLOSE_WAIT — those are inside the SDK and out of our direct control, but
tightening httpx keepalive across adapters reduces the aggregate pool
pressure regardless of which individual adapter leaks.
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
fix(matrix,gateway): Matrix E2EE installs full dep set; plugins respect is_connected
Fixes #31116 — two distinct bugs in fresh-install Matrix gateway:
1. Matrix E2EE setup installed only mautrix[encryption], leaving asyncpg
/ aiosqlite / Markdown / aiohttp-socks uninstalled. The first encrypted
connect failed with 'No module named asyncpg' deep inside
MatrixAdapter.connect(). Root cause: the setup wizard hand-rolled a
pip install of one package instead of using lazy_deps.ensure(
'platform.matrix'), and check_matrix_requirements() short-circuited the
runtime installer on 'import mautrix' alone — so the other 4 packages
were never pulled in.
2. Discord auto-enabled itself on every gateway start, even when the user
never selected Discord and had no DISCORD_BOT_TOKEN. Root cause:
gateway/config.py plugin-enablement loop gated enablement on
entry.check_fn() (just 'is the SDK importable?') and ignored
entry.is_connected (the 'did the user configure credentials?' probe).
Same bug class as commit 7849a3d73 fixed for _platform_status in the
setup wizard; this is the runtime counterpart. Affects Discord, Teams,
and Google Chat.
Changes:
- hermes_cli/setup.py::_setup_matrix — install via
lazy_deps.ensure('platform.matrix') to pull the full feature group.
- gateway/platforms/matrix.py::_check_e2ee_deps — verify asyncpg +
aiosqlite + PgCryptoStore in addition to OlmMachine, so E2EE failures
surface at startup instead of at first encrypted-room connect.
- gateway/platforms/matrix.py::check_matrix_requirements — use
feature_missing('platform.matrix') as the install gate instead of a
single 'import mautrix' check, so partial installs trigger the lazy
installer correctly.
- gateway/config.py plugin-enablement loop — consult entry.is_connected
before flipping enabled=True. Explicit YAML enabled=true still wins.
Tests: 3 new in tests/gateway/test_matrix.py (asyncpg-required,
aiosqlite-required, partial-install lazy-runs), 5 new in
tests/gateway/test_platform_registry.py (is_connected=False blocks,
is_connected=True enables, is_connected=None falls back to check_fn,
raising probe doesn't enable, explicit YAML wins).
Validation: 310 tests across affected test modules pass.
fix(telegram): gate send() on send-path health after reconnect storms (#31165)
After sustained Bad Gateway / TimedOut reconnect cycles, the PTB httpx
client can enter a state where bot.send_message() returns a valid
Message (real message_id) but the message never reaches the recipient.
TelegramAdapter.send returns SendResult(success=True) and cron's
live-adapter branch marks the run delivered while the message is
silently dropped.
Add a _send_path_degraded flag. _handle_polling_network_error sets it
on reconnect storms; the existing _verify_polling_after_reconnect
heartbeat probe clears it once getMe() confirms the Bot client is
healthy. While the flag is set, send() short-circuits with
SendResult(success=False, retryable=True) so cron falls through to
the standalone delivery path (fresh HTTP session).
Closes #31165.
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600)
Stop the gateway from exiting (or systemd-restart-looping) when a single
messaging adapter fails at startup or runtime. A misconfigured WhatsApp
(npm install timeout, unpaired bridge, missing creds.json) used to take
the entire gateway down, killing cron jobs and any other connected
platforms with it.
Changes:
• Startup (gateway/run.py): when connected_count==0 but the only
errors are retryable, log a degraded-state warning and keep the
gateway alive instead of returning False. Reconnect watcher then
recovers platforms as their underlying problem clears.
• Runtime (gateway/run.py _handle_adapter_fatal_error): when the last
adapter goes down with a retryable error and is queued for
reconnection, stay alive instead of exit-with-failure. Previously
this triggered systemd Restart=on-failure, which created infinite
restart loops on persistent retryable failures (proxy outage,
repeated bridge crashes).
• Reconnect watcher (gateway/run.py _platform_reconnect_watcher):
replace the 20-attempt hard drop with a circuit-breaker pause.
After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the
platform stays in _failed_platforms with paused=True so the watcher
skips it but the operator can still see and resume it. Non-retryable
errors still drop out of the queue immediately. Resolves #17063
(gateway giving up on Telegram after 20 attempts).
• WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start
the Node bridge when creds.json is missing. Sets a non-retryable
whatsapp_not_paired fatal error so the watcher drops it cleanly
with a single 'run hermes whatsapp' log line instead of paying the
30s bridge bootstrap timeout on every gateway start.
• WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set
WHATSAPP_ENABLED=true once pairing actually succeeds. Previously
the wizard wrote the env var at step 2 (before npm install and QR
pairing), so any Ctrl+C left .env claiming WhatsApp was ready when
the bridge had no creds.json. Also propagate the env var when the
user keeps an existing pairing on a re-run.
• /platform slash command (hermes_cli/commands.py + gateway/run.py):
new gateway-only command for manual circuit-breaker control.
/platform list — show connected + failed/paused platforms
/platform pause <name> — silence a known-broken platform
/platform resume <name> — re-queue a paused platform
Tests:
• New: pause/resume helpers, /platform list|pause|resume command,
WhatsApp creds.json preflight, WhatsApp setup ordering.
• Updated: stale assertions that codified the old 'exit and let
systemd restart' behavior in test_runner_fatal_adapter.py,
test_runner_startup_failures.py, and test_platform_reconnect.py
(the 20-attempt give-up test became a circuit-breaker pause test).
5488 tests pass in tests/gateway/.
feat(state.db): persist platform_message_id; restore yuanbao exact-id recall
PR #29211 dropped JSONL gateway transcripts and noted that the platform's
own message_id field (used by Yuanbao's recall guard to redact a
message by exact platform id) was no longer preserved — falling back to
content-match. That fallback works for the common case but redacts the
wrong row when two messages share text (or fails to match when content
is post-processed).
Restore exact-id matching by giving state.db a column for it:
- New platform_message_id TEXT column on the messages table
(SCHEMA_VERSION bump 11 → 12; column added via declarative reconciler
on existing DBs, no version-gated migration block needed)
- Partial index idx_messages_platform_msg_id on
(session_id, platform_message_id) to keep recall's point-lookup cheap
even on large sessions
- append_message() and replace_messages() accept the new value:
the gateway-facing append_to_transcript in gateway/session.py
forwards either message["platform_message_id"] or the legacy
message["message_id"] key (yuanbao's existing convention)
- get_messages_as_conversation() surfaces the column back on the
message dict as message_id so platform code reads the same shape
it used to read from JSONL
- Yuanbao _patch_transcript: restore branch A1 (exact id match)
ahead of A2 (content match) ahead of B (system-note). Both branches
log which one fired so operators can tell from gateway.log whether
recall hit the canonical path or had to fall back.
Tests:
- New low-level round-trip tests in test_hermes_state.py for both
append_message and replace_messages paths
- The PR's test_yuanbao_recall_db_only.py was rewritten to assert
the new contract: branch A1 (id match) works against DB-only
transcripts, and branch A2 (content match) still recovers rows that
were observed without a platform id (e.g. agent-processed @bot
messages where run.py doesn't carry msg_id through)
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).