文件最后提交记录最后更新时间
fix(qqbot): authorize approval button interactions by session owner (#30737)12 天前
refactor(plugins): add apply_yaml_config_fn registry hook Lets platform plugins own their YAML→env config bridge instead of forcing core gateway/config.py to know every platform's schema. The hook receives the full parsed config.yaml and the platform's own sub-dict, may mutate os.environ (env > YAML precedence preserved via the standard not os.getenv(...) guards), and may return a dict to merge into PlatformConfig.extra. It runs during load_gateway_config() after the existing generic shared-key loop and before _apply_env_overrides(), mirroring the env_enablement_fn dispatch pattern (#21306, #21331). Pure addition — no behavior change for existing platforms. Each of the eight platforms with hardcoded YAML→env blocks today (discord, telegram, whatsapp, slack, dingtalk, mattermost, matrix, feishu, ~252 LOC in gateway/config.py) can migrate in independent follow-up PRs; the hardcoded blocks remain functional in the meantime, and their not os.getenv(...) guards make them no-ops for any env var the hook already set. Test coverage: 10 new tests in tests/gateway/test_platform_registry.py covering field default, callable acceptance, env mutation, extras merge, both signature args, exception swallowing, missing/non-dict sections, and env > YAML precedence. Refs #3823, #24356. Closes #24836. 22 天前
perf(gateway): defer QQAdapter and YuanbaoAdapter imports via PEP 562 (#22790) gateway/platforms/__init__.py eagerly imported QQAdapter and YuanbaoAdapter at package-init time, which transitively pulled in qqbot's chunked-upload + keyboards + onboard machinery and yuanbao's websocket stack. About 84 ms wall and 23 MB RSS on every fresh process that touched anything under gateway.platforms — including `hermes chat` (via run_agent → cli's plugin discovery transitive import). Nothing in the codebase actually consumes these symbols from the package root; every real call site uses the long-form path (from gateway.platforms.qqbot import QQAdapter, from gateway.platforms.yuanbao import YuanbaoAdapter in gateway/run.py). The eager re-export was only there for convenience. Replace with a PEP 562 module-level __getattr__ that lazily imports on first attribute access. Public API stays identical: from gateway.platforms import QQAdapter keeps working but only pays the import cost when the symbol is actually touched. __dir__ preserves help() / autocomplete behavior. Measured impact (7-run medians, 9950X3D): import gateway.platforms 127 → 43 ms (-66%) 50 → 27 MB (-46%) import gateway.platforms.base 127 → 44 ms (-65%) 50 → 27 MB (-46%) import cli (full chat path) 745 → 710 ms ( -5%) 96 → 90 MB ( -6%) hermes chat -q (cold) -5 MB The per-import win is biggest because qqbot/yuanbao deps don't overlap with anything on the gateway-platforms path — full import cli already loads aiohttp/websockets transitively from other places, so the marginal CLI win is smaller than the isolated import benchmark. The gateway.platforms.base win is what matters most for long-lived gateway processes: every gateway boot saves 23 MB resident. All 144 qqbot tests pass; broader gateway suite (5132 tests) passes modulo 4 pre-existing flakes also failing on main without this change.27 天前
fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451) Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot + Feishu on macOS behind Cloudflare Warp. 1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py). Every long-lived platform adapter now constructs httpx.AsyncClient with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the default 5s window let idle keepalive sockets sit in CLOSE_WAIT long enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback, plus the transient Feishu helper) to compound to the 256-fd ulimit. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars. 2. whatsapp.send_typing aiohttp leak. The call was 'await self._http_session.post(...)' with no 'async with' and no variable capture — the ClientResponse went out of scope unclosed, holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in 'async with'. This was the only bare-await aiohttp leak in the gateway/tools/plugins tree per audit; all other aiohttp sites use the context-manager pattern correctly. The underlying reporter also saw Feishu SDK (lark-oapi) connections in CLOSE_WAIT — those are inside the SDK and out of our direct control, but tightening httpx keepalive across adapters reduces the aggregate pool pressure regardless of which individual adapter leaks. 1 个月前
fix(state): restrict sensitive store file permissions response_store.db (api server) holds conversation history including tool payloads, prompts, and results. webhook_subscriptions.json holds per-route HMAC secrets. Under a permissive umask (e.g. 0o022, default on most distros) both files were created mode 0o644 — readable by other local users on shared boxes. - gateway/platforms/api_server.py: ResponseStore tightens itself + WAL/SHM sidecars to 0o600 after __init__, then trusts the inode. (Original contributor patch chmod'd after every _commit() — wasteful on a hot api_server path; chmod-on-create is sufficient since SQLite preserves mode bits across writes.) - hermes_cli/webhook.py: _save_subscriptions writes via tempfile.mkstemp (which itself creates the file with 0o600), chmods the temp before the atomic rename, and re-asserts 0o600 on the destination so an existing permissive file from before this fix gets narrowed. Tests cover (a) creation under permissive umask leaves 0o600 and (b) an existing 0o644 webhook_subscriptions.json gets narrowed on next save. Tests guarded with skipif os.name=='nt' since POSIX mode bits don't apply on Windows. Salvaged from PR #30917 by @Hinotoi-agent. Reworked the api_server.py side from chmod-on-every-commit to chmod-on-create. Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com> 12 天前
fix(gateway): drop text snippet from debounce debug log (CodeQL) CodeQL py/clear-text-logging-sensitive-data flagged the candidate-accept debug log including event.text[:60]. Log text_len instead — sufficient for debugging burst behavior without surfacing message contents. Co-authored-by: Paulo Nascimento <pnascimento9596@gmail.com> 12 天前
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining. 133 files, +626/-626 (net zero).25 天前
fix(dingtalk): finalize open streaming cards before disconnect AI Card "tool progress" cards created with finalize=False were left in streaming state on DingTalk's UI after a gateway restart because disconnect() called _streaming_cards.clear() without first closing them via _close_streaming_siblings. Move the finalization loop before self._http_client.aclose() so the HTTP client is still available when the finalize requests are sent. Adds a regression test that asserts the HTTP client is alive during finalization. 12 天前
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining. 133 files, +626/-626 (net zero).25 天前
fix(feishu): validate verification token before reflecting url_verification challenge When FEISHU_VERIFICATION_TOKEN is configured, an unauthenticated remote could previously prove endpoint control by sending a url_verification payload with any attacker-controlled challenge string — the handler reflected the challenge BEFORE running the token check. Move the verification_token check ahead of the url_verification echo so the challenge response is gated on a valid token. Add a regression test covering the wrong-token case. Also fix the stale test_connect_webhook_mode_starts_local_server fixture to set FEISHU_VERIFICATION_TOKEN (post #30746 webhook mode requires a secret). Salvaged from PR #29663 by @m0n3r0 — kept the url_verification reorder and its regression test; dropped the host-conditional weakening of the #30746 secret guard (we want webhook secrets required regardless of bind host, not only on 0.0.0.0/::). Docs updated to call out the gating. Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com> 12 天前
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining. 133 files, +626/-626 (net zero).25 天前
chore: ruff auto-fix C401, C416, C408, PLR1722 (#23940) C401: set(x for x in y) -> {x for x in y} (set comprehension) C416: [(k,v) for k,v in d] -> list(d.items()) (unnecessary listcomp) C408: tuple()/dict() -> ()/{} (unnecessary collection call) PLR1722: exit() -> sys.exit() (adds import sys where needed) 21 instances fixed, 0 remaining. 19 files, +40/-36.25 天前
fix(gateway): preserve underscores in plain-text identifiers 19 天前
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining. 133 files, +626/-626 (net zero).25 天前
fix(matrix,gateway): Matrix E2EE installs full dep set; plugins respect is_connected Fixes #31116 — two distinct bugs in fresh-install Matrix gateway: 1. Matrix E2EE setup installed only mautrix[encryption], leaving asyncpg / aiosqlite / Markdown / aiohttp-socks uninstalled. The first encrypted connect failed with 'No module named asyncpg' deep inside MatrixAdapter.connect(). Root cause: the setup wizard hand-rolled a pip install of one package instead of using lazy_deps.ensure( 'platform.matrix'), and check_matrix_requirements() short-circuited the runtime installer on 'import mautrix' alone — so the other 4 packages were never pulled in. 2. Discord auto-enabled itself on every gateway start, even when the user never selected Discord and had no DISCORD_BOT_TOKEN. Root cause: gateway/config.py plugin-enablement loop gated enablement on entry.check_fn() (just 'is the SDK importable?') and ignored entry.is_connected (the 'did the user configure credentials?' probe). Same bug class as commit 7849a3d73 fixed for _platform_status in the setup wizard; this is the runtime counterpart. Affects Discord, Teams, and Google Chat. Changes: - hermes_cli/setup.py::_setup_matrix — install via lazy_deps.ensure('platform.matrix') to pull the full feature group. - gateway/platforms/matrix.py::_check_e2ee_deps — verify asyncpg + aiosqlite + PgCryptoStore in addition to OlmMachine, so E2EE failures surface at startup instead of at first encrypted-room connect. - gateway/platforms/matrix.py::check_matrix_requirements — use feature_missing('platform.matrix') as the install gate instead of a single 'import mautrix' check, so partial installs trigger the lazy installer correctly. - gateway/config.py plugin-enablement loop — consult entry.is_connected before flipping enabled=True. Explicit YAML enabled=true still wins. Tests: 3 new in tests/gateway/test_matrix.py (asyncpg-required, aiosqlite-required, partial-install lazy-runs), 5 new in tests/gateway/test_platform_registry.py (is_connected=False blocks, is_connected=True enables, is_connected=None falls back to check_fn, raising probe doesn't enable, explicit YAML wins). Validation: 310 tests across affected test modules pass. 12 天前
fix(mattermost): resolve thread root_id and route progress to threads Two Mattermost thread-related bugs: 1. _resolve_root_id() — Mattermost CRT requires root_id to be the thread root post. Using any reply's own ID as root_id causes '400 Invalid RootId'. Add _resolve_root_id() that walks up the post chain via API to find the actual root, and apply it in send(), _send_url_as_file(), and _send_local_file(). 2. _progress_reply_to — The condition in run.py only checked Platform.FEISHU, missing Mattermost entirely. This caused tool progress messages to always land in the main channel instead of the thread. Add Platform.MATTERMOST to the condition so progress messages are routed to threads when reply_mode=thread. Impact: Tool progress messages now appear in Mattermost threads instead of flooding the main channel; thread replies no longer fail with Invalid RootId when the reply target is itself a reply. 17 天前
Harden msgraph webhook auth requirements (#30169)12 天前
feat(signal): add require_mention filter for group chats Add a configurable mention filter to the Signal adapter so the bot only responds in groups when it is explicitly @mentioned. Changes: - gateway/platforms/signal.py: read require_mention from adapter extra config or SIGNAL_REQUIRE_MENTION env var; skip group messages that don't mention the bot account (checked in rendered text and raw mention metadata) - gateway/config.py: map signal.require_mention YAML key to the SIGNAL_REQUIRE_MENTION env var (env var takes precedence) Config example: signal: require_mention: true Or via env var: SIGNAL_REQUIRE_MENTION=true 17 天前
feat(gateway/signal): add support for multiple images sending Adds a new send_multiple_images method to the BasePlatformAdapter that implements the default "One image per message" loop and allows for platform-specific overriding. Implements such an override for the Signal adapter, batching images and trying (best-effort) to work around rate-limits for voluminous batches using a specific scheduler. Also implements batching + rate-limit handling in the send_message tool. New tests added for the Signal adapter, its rate-limit scheduler and the send_message tool 1 个月前
fix(gateway): add trust_env=True to aiohttp sessions in SMS, Slack, Teams, Google Chat adapters aiohttp.ClientSession defaults to trust_env=False, which silently ignores HTTP_PROXY, HTTPS_PROXY, and ALL_PROXY environment variables. Users behind a corporate or network proxy cannot reach external APIs on any of these platforms — all outbound requests fail with connection errors. Symmetric with wecom.py (line 276), weixin.py (lines 1055/1268/1274), and matrix.py (no-proxy path) which already set this flag. Complements the open LINE fix (#26635) with the remaining gateway and plugin adapters. Changed: - gateway/platforms/sms.py: persistent Twilio session (connect) + fallback session (send) — both hit https://api.twilio.com - gateway/platforms/slack.py: ephemeral response_url POST session — hits https://hooks.slack.com/... callback URLs - plugins/platforms/teams/adapter.py: standalone send session — hits login.microsoftonline.com (token) + Bot Framework service URL - plugins/platforms/google_chat/adapter.py: standalone send session — hits https://chat.googleapis.com/v1/... WhatsApp sessions are excluded: they connect to http://127.0.0.1:{port} (local bridge) and must not be routed through a system proxy. 19 天前
fix(gateway): add trust_env=True to aiohttp sessions in SMS, Slack, Teams, Google Chat adapters aiohttp.ClientSession defaults to trust_env=False, which silently ignores HTTP_PROXY, HTTPS_PROXY, and ALL_PROXY environment variables. Users behind a corporate or network proxy cannot reach external APIs on any of these platforms — all outbound requests fail with connection errors. Symmetric with wecom.py (line 276), weixin.py (lines 1055/1268/1274), and matrix.py (no-proxy path) which already set this flag. Complements the open LINE fix (#26635) with the remaining gateway and plugin adapters. Changed: - gateway/platforms/sms.py: persistent Twilio session (connect) + fallback session (send) — both hit https://api.twilio.com - gateway/platforms/slack.py: ephemeral response_url POST session — hits https://hooks.slack.com/... callback URLs - plugins/platforms/teams/adapter.py: standalone send session — hits login.microsoftonline.com (token) + Bot Framework service URL - plugins/platforms/google_chat/adapter.py: standalone send session — hits https://chat.googleapis.com/v1/... WhatsApp sessions are excluded: they connect to http://127.0.0.1:{port} (local bridge) and must not be routed through a system proxy. 19 天前
fix(telegram): gate send() on send-path health after reconnect storms (#31165) After sustained Bad Gateway / TimedOut reconnect cycles, the PTB httpx client can enter a state where bot.send_message() returns a valid Message (real message_id) but the message never reaches the recipient. TelegramAdapter.send returns SendResult(success=True) and cron's live-adapter branch marks the run delivered while the message is silently dropped. Add a _send_path_degraded flag. _handle_polling_network_error sets it on reconnect storms; the existing _verify_polling_after_reconnect heartbeat probe clears it once getMe() confirms the Bot client is healthy. While the flag is set, send() short-circuits with SendResult(success=False, retryable=True) so cron falls through to the standalone delivery path (fresh HTTP session). Closes #31165. Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com> 12 天前
fix(telegram): reset sticky fallback IP on connect failure, retry primary DNS When a sticky fallback IP (from DoH discovery) becomes unreachable, the transport previously got stuck in an attempt_order that only tried the dead IP. This prevented the gateway from recovering until the service was restarted. Changes: - Always include primary DNS path (None) after the sticky IP in the attempt_order so that a primary-path retry happens on sticky failure. - Reset self._sticky_ip to None when the currently sticky IP hits a connect timeout / connect error, allowing the next request to retry from scratch. Fixes silent Telegram disconnection when discovered fallback IPs are transiently or permanently unreachable. 17 天前
fix(webhook): use 403 not 500 for missing-secret rejection Operator misconfiguration is a client/setup error, not an internal server exception. 403 "forbidden" more accurately reflects "this route refuses to authenticate" than 500 "internal server error" — the latter triggers incident alerting on operator monitoring and conflates real bugs with config drift. Follow-up tweak to PR #29629 by @m0n3r0. 12 天前
fix(wecom): guard flush task against cancel-delivery race to prevent message loss When asyncio.sleep() fires just before Task.cancel() is called, CPython sets _must_cancel=True but cannot cancel the already-completed sleep future, so CancelledError is delivered at the next await (handle_message) rather than at the sleep. By that point the superseded task has already popped the merged event from _pending_text_batches, so the superseding task sees an empty batch and silently drops the message. Fix: add a synchronous task-registry check between the sleep and the pop. No await between the check and the pop means no other coroutine can interleave, so the guard is race-free. 12 天前
fix(wecom-callback): retry send with fresh token on errcode 40001/42001 When WeCom returns errcode=40001 (invalid credential) or 42001 (token expired), send() was returning a failure without evicting the bad token from _access_tokens. All subsequent sends then kept using the same invalid cached token until its TTL naturally expired (~7200s). Fix: on the first token-rejection errcode, evict the cache entry and retry once with a freshly fetched token. Non-token errcodes fail immediately as before. If the refreshed token also fails, the error is returned without looping further. Adds four regression tests covering: successful retry on 40001, successful retry on 42001, no retry on unrelated errcode, and clean failure when the refresh does not help. 12 天前
feat(gateway): add WeCom callback-mode adapter for self-built apps Add a second WeCom integration mode for regular enterprise self-built applications. Unlike the existing bot/websocket adapter (wecom.py), this handles WeCom's standard callback flow: WeCom POSTs encrypted XML to an HTTP endpoint, the adapter decrypts, queues for the agent, and immediately acknowledges. The agent's reply is delivered proactively via the message/send API. Key design choice: always acknowledge immediately and use proactive send — agent sessions take 3-30 minutes, so the 5-second inline reply window is never useful. The original PR's Future/pending-reply machinery was removed in favour of this simpler architecture. Features: - AES-CBC encrypt/decrypt (BizMsgCrypt-compatible) - Multi-app routing scoped by corp_id:user_id - Legacy bare user_id fallback for backward compat - Access-token management with auto-refresh - WECOM_CALLBACK_* env var overrides - Port-in-use pre-check before binding - Health endpoint at /health Salvaged from PR #7774 by @chqchshj. Simplified by removing the inline reply Future system and fixing: secrets.choice for nonce generation, immediate plain-text acknowledgment (not encrypted XML containing 'success'), and initial token refresh error handling. 1 个月前
Fix unsafe gateway media path delivery 13 天前
fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600) Stop the gateway from exiting (or systemd-restart-looping) when a single messaging adapter fails at startup or runtime. A misconfigured WhatsApp (npm install timeout, unpaired bridge, missing creds.json) used to take the entire gateway down, killing cron jobs and any other connected platforms with it. Changes: • Startup (gateway/run.py): when connected_count==0 but the only errors are retryable, log a degraded-state warning and keep the gateway alive instead of returning False. Reconnect watcher then recovers platforms as their underlying problem clears. • Runtime (gateway/run.py _handle_adapter_fatal_error): when the last adapter goes down with a retryable error and is queued for reconnection, stay alive instead of exit-with-failure. Previously this triggered systemd Restart=on-failure, which created infinite restart loops on persistent retryable failures (proxy outage, repeated bridge crashes). • Reconnect watcher (gateway/run.py _platform_reconnect_watcher): replace the 20-attempt hard drop with a circuit-breaker pause. After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the platform stays in _failed_platforms with paused=True so the watcher skips it but the operator can still see and resume it. Non-retryable errors still drop out of the queue immediately. Resolves #17063 (gateway giving up on Telegram after 20 attempts). • WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start the Node bridge when creds.json is missing. Sets a non-retryable whatsapp_not_paired fatal error so the watcher drops it cleanly with a single 'run hermes whatsapp' log line instead of paying the 30s bridge bootstrap timeout on every gateway start. • WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set WHATSAPP_ENABLED=true once pairing actually succeeds. Previously the wizard wrote the env var at step 2 (before npm install and QR pairing), so any Ctrl+C left .env claiming WhatsApp was ready when the bridge had no creds.json. Also propagate the env var when the user keeps an existing pairing on a re-run. • /platform slash command (hermes_cli/commands.py + gateway/run.py): new gateway-only command for manual circuit-breaker control. /platform list — show connected + failed/paused platforms /platform pause <name> — silence a known-broken platform /platform resume <name> — re-queue a paused platform Tests: • New: pause/resume helpers, /platform list|pause|resume command, WhatsApp creds.json preflight, WhatsApp setup ordering. • Updated: stale assertions that codified the old 'exit and let systemd restart' behavior in test_runner_fatal_adapter.py, test_runner_startup_failures.py, and test_platform_reconnect.py (the 20-attempt give-up test became a circuit-breaker pause test). 5488 tests pass in tests/gateway/.21 天前
feat(state.db): persist platform_message_id; restore yuanbao exact-id recall PR #29211 dropped JSONL gateway transcripts and noted that the platform's own message_id field (used by Yuanbao's recall guard to redact a message by exact platform id) was no longer preserved — falling back to content-match. That fallback works for the common case but redacts the wrong row when two messages share text (or fails to match when content is post-processed). Restore exact-id matching by giving state.db a column for it: - New platform_message_id TEXT column on the messages table (SCHEMA_VERSION bump 11 → 12; column added via declarative reconciler on existing DBs, no version-gated migration block needed) - Partial index idx_messages_platform_msg_id on (session_id, platform_message_id) to keep recall's point-lookup cheap even on large sessions - append_message() and replace_messages() accept the new value: the gateway-facing append_to_transcript in gateway/session.py forwards either message["platform_message_id"] or the legacy message["message_id"] key (yuanbao's existing convention) - get_messages_as_conversation() surfaces the column back on the message dict as message_id so platform code reads the same shape it used to read from JSONL - Yuanbao _patch_transcript: restore branch A1 (exact id match) ahead of A2 (content match) ahead of B (system-note). Both branches log which one fired so operators can tell from gateway.log whether recall hit the canonical path or had to fall back. Tests: - New low-level round-trip tests in test_hermes_state.py for both append_message and replace_messages paths - The PR's test_yuanbao_recall_db_only.py was rewritten to assert the new contract: branch A1 (id match) works against DB-only transcripts, and branch A2 (content match) still recovers rows that were observed without a platform id (e.g. agent-processed @bot messages where run.py doesn't carry msg_id through) 16 天前
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining. 133 files, +626/-626 (net zero).25 天前
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining. 133 files, +626/-626 (net zero).25 天前
yuanbao platform (#16298) Co-authored-by: loongzhao <loongzhao@tencent.com>1 个月前