chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via ruff --fix --unsafe-fixes, 0 remaining.
133 files, +626/-626 (net zero).
fix(tui-gateway): harden stdio transport against half-closed pipes + SIGTERM races (#17118)
* fix(tui-gateway): harden stdio transport against half-closed pipes + SIGTERM races
tui_gateway reports tui_gateway_crash.log traces where the main
thread sits in sys.stdin while a worker holds _stdout_lock mid-
flush, and SIGTERM then calls sys.exit(0) while the lock is still
held — the interpreter shutdown stalls behind the wedged write.
Two narrowly scoped hardenings:
**tui_gateway/transport.py**
* Move JSON serialisation outside the lock — long messages no longer
block sibling writers while we serialise.
* Treat BrokenPipeError, ValueError ("I/O on closed file") and
generic OSError from both write and flush as "peer is gone":
return False instead of bubbling, matching what write_json's
callers in entry.py already expect.
* Split flush into its own try block so a stuck flush never strands
a partial write or holds the lock indefinitely on its way out.
* Optional HERMES_TUI_GATEWAY_NO_FLUSH=1 env knob to skip explicit
flush() entirely on environments where a half-closed read pipe
produces an indefinite kernel-level block. Default unchanged.
**tui_gateway/entry.py**
* _log_signal now spawns a 1-second daemon timer that calls
os._exit(0) if the orderly sys.exit(0) path is itself stuck
behind a wedged worker. Atexit handlers run inside the grace
window when they can; the timer is the safety net so a deadlocked
flush no longer strands the gateway process.
Tests:
* test_write_json_closed_stream_returns_false — ValueError path.
* test_write_json_oserror_on_flush_returns_false — OSError on flush
must not strand the lock; the write portion still landed before the
flush failure.
* test_write_json_no_flush_env_skips_flush — env knob bypass.
Validation: scripts/run_tests.sh tests/tui_gateway/test_protocol.py
(42/42 pass; one pre-existing failure on
test_session_resume_returns_hydrated_messages is unrelated to this
change — same include_ancestors mock kwarg issue tracked elsewhere).
scripts/run_tests.sh tests/test_tui_gateway_server.py 90/90 pass.
* review(copilot): tighten transport hardening comments + test cleanup
* review(copilot): narrow exception capture, configurable grace, simpler no-flush test
* fix(tui-gateway): narrow ValueError to closed-stream; surface UnicodeEncodeError
Copilot review on PR #17118: UnicodeEncodeError is a ValueError
subclass, so a non-UTF-8 stdout (mismatched PYTHONIOENCODING / locale)
would have been silently swallowed as 'peer gone' under
except ValueError. That hides a real environment bug.
Now:
- UnicodeEncodeError → log with exc_info (warning) and drop the frame
- ValueError where str(e) contains 'closed file' → peer gone, return False
- Any other ValueError → log loudly, drop frame (defensive, but visible)
Same shape applied to flush. Adds two regression tests.
* fix(tui-gateway): reserve write() False for peer-gone; re-raise programming errors
Round 2 Copilot review on PR #17118: Transport.write() returning
False is documented as 'peer is gone', and entry.py reacts by
calling sys.exit(0). But the implementation also returned False
for non-IO conditions (non-JSON-safe payloads, UnicodeEncodeError,
unrelated ValueErrors), so a programming error or local env bug would
present as a clean disconnect — exactly the diagnosis pain we wanted
to eliminate.
Now:
- json.dumps failure → re-raises (TypeError/ValueError surfaces in crash log)
- BrokenPipeError → False (peer gone)
- ValueError('...closed file...') → False (peer gone)
- UnicodeEncodeError and any other ValueError → re-raise
- OSError → False (existing IO-failure semantics, debug-logged)
Tests updated to assert the re-raise behaviour and added a
non-serializable-payload regression test.
* fix(tui-gateway): narrow OSError to peer-gone errnos; honest test naming
Round 3 Copilot review on PR #17118:
- Docstring claimed False = peer gone, but generic OSError on write/flush
also returned False — meaning ENOSPC/EACCES/EIO would silently exit.
Added _PEER_GONE_ERRNOS = {EPIPE, ECONNRESET, EBADF, ESHUTDOWN, +WSA}
and narrowed the OSError handlers; non-peer-gone errnos re-raise.
Docstring now lists OSError as peer-gone branch with the errno set.
- The _DISABLE_FLUSH test was named after the env var but actually
patched the module constant. Renamed it to reflect the contract being
tested (skips flush when constant is true) AND added a real
end-to-end test that sets the env var, reloads transport.py, and
asserts the constant flips. Cleanup reload restores defaults so
parallel tests stay isolated.
Self-review (avoid round 4):
- Verified TeeTransport's secondary-swallow stays intentional.
- _log_signal grace path already covered by separate tests.
fix(async): close unscheduled coroutines in all threadsafe bridges (#26584)
Wraps every sync->async coroutine-scheduling site in the codebase with a
new agent.async_utils.safe_schedule_threadsafe() helper that closes the
coroutine on scheduling failure (closed loop, shutdown race, etc.)
instead of leaking it as 'coroutine was never awaited' RuntimeWarnings
plus reference leaks.
22 production call sites migrated across the codebase:
- acp_adapter/events.py, acp_adapter/permissions.py
- agent/lsp/manager.py
- cron/scheduler.py (media + text delivery paths)
- gateway/platforms/feishu.py (5 sites, via existing _submit_on_loop helper
which now delegates to safe_schedule_threadsafe)
- gateway/run.py (10 sites: telegram rename, agent:step hook, status
callback, interim+bg-review, clarify send, exec-approval button+text,
temp-bubble cleanup, channel-directory refresh)
- plugins/memory/hindsight, plugins/platforms/google_chat
- tools/browser_supervisor.py (3), browser_cdp_tool.py,
computer_use/cua_backend.py, slash_confirm.py
- tools/environments/modal.py (_AsyncWorker)
- tools/mcp_tool.py (2 + 8 _run_on_mcp_loop callers converted to
factory-style so the coroutine is never constructed on a dead loop)
- tui_gateway/ws.py
Tests: new tests/agent/test_async_utils.py covers helper behavior under
live loop, dead loop, None loop, and scheduling exceptions. Regression
tests added at three PR-original sites (acp events, acp permissions,
mcp loop runner) mirroring contributor's intent.
Live-tested end-to-end:
- Helper stress test: 1500 schedules across live/dead/race scenarios,
zero leaked coroutines
- Race exercised: 5000 schedules with loop killed mid-flight, 100 ok /
4900 None returns, zero leaks
- hermes chat -q with terminal tool call (exercises step_callback bridge)
- MCP probe against failing subprocess servers + factory path
- Real gateway daemon boot + SIGINT shutdown across multiple platform
adapter inits
- WSTransport 100 live + 50 dead-loop writes
- Cron delivery path live + dead loop
Salvages PR #2657 — adopts contributor's intent over a much wider site
list and a single centralized helper instead of inline try/except at
each site. 3 of the original PR's 6 sites no longer exist on main
(environments/patches.py deleted, DingTalk refactored to native async);
the equivalent fix lives in tools/environments/modal.py instead.
Co-authored-by: JithendraNara <jithendranaidunara@gmail.com>