| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
upload source code | 8 个月前 | |
feat(sandbox): sync luo/develop - sandbox runtime, traefik, snapshot and instance enhancements This commit consolidates 148 commits from luo/develop into feature/sandbox, covering a major refactor of the sandbox runtime execution model plus several new capabilities across traefik routing, snapshot management, exec service, instance management, and observability. --- Complete rewrite of the sandbox container execution layer: - Split CommandBuilder into per-language strategy classes: CppStrategy, PythonStrategy, JavaStrategy, NodejsStrategy, GoStrategy - New SandboxExecutor replacing ContainerExecutor, supporting Normal / WarmUp / Snapshot start modes - CheckpointOrchestrator extracted for checkpoint lifecycle + ref-count safety - RuntimeStateManager encapsulating all 6 executor maps atomically - SandboxStartGuard RAII to prevent map leaks on failed start paths - Unified Restore into Start interface: ckpt_dir on StartRequest replaces the dedicated RestoreRequest / Restore RPC - Inject YR_LANGUAGE env so entryfile.sh selects the correct Python version (fixes SIGSEGV with Python 3.12/3.13 bytecode on 3.11 runtime) - System-level environment variable blacklist (SYSTEM_ENV_BLACKLIST) to strip host OS env vars from container processes - Resource limit (CPU_LIMIT / MEMORY_LIMIT) read from scheduling extension and applied as cgroup effective value - Disable LookPath in CommandBuilder (container-based launch, not local exec) - Warmup routing fixed to check request.warmuptype() instead of stateManager - shared_from_this capture in CheckpointOrchestrator lambdas to prevent UAF - Port Python 3.12 and 3.13 into service.json runtime version list - 39 new unit tests: LanguageStrategyTest, RuntimeStateManagerTest, SandboxRequestBuilderTest, SandboxExecutorTest Dynamic routing configuration via pull-based HTTP polling: - TraefikRouteCache: in-memory route table built from instance portForward extensions; sorted-key JSON guarantees byte-identical output across polls (prevents spurious Traefik reloads from FNV hash drift); conditionally omits empty routers/services keys incompatible with Traefik Go parser - TraefikApiRouterRegister: GET /traefik/config handler; leader serves from local cache, standby transparently forwards to leader; returns 503 on forward failure so Traefik retains last-known-good config - TraefikLeaderContext: thread-safe leader state (atomic<bool> + shared_mutex) updated by Explorer's leader election callback - InstanceManagerActor integration: MetaStore watch events drive route table (RUNNING adds, FATAL/EVICTED/EXITED/DELETE removes) - Flags: enable_traefik_provider, traefik_http_entry_point, traefik_enable_tls, traefik_servers_transport, traefik_forward_timeout_ms - 44+ unit tests including concurrent stress and Traefik compatibility regression - List and delete snapshots filtered by function key and tenant - SnapshotCache maintains a function-key index for O(1) lookup - New proto messages: ListSnapshotsByFunctionKeyRequest / Response, DeleteSnapshotsByFunctionKeyRequest / Response - HTTP handlers in SnapManagerDriver for the new endpoints - LocalSchedSrv and its actor updated to dispatch the new request types - SnapCtrlActor enhanced to propagate functionType from FunctionKey so that list_checkpoints by function key returns correct results - Implement Docker and Podman container runtime backends - Warmup mount support for container instances - Custom rootfs: local filesystem path handling + RootfsSpecMeta in function metadata structures; validation and logging for rootfs specs - Custom mount support for user-defined volume bindings - Port forwarding: inject YR_INTERNAL_HOST_IP and YR_PORT_FORWARDINGS env vars so sandbox instances can expose internal cluster URLs via get_internal_urls() - Docker Exec gRPC bidirectional streaming service - Multi-session exec support with bugfixes - Epoll-based exec I/O loop replacing blocking read - PTY resize: fix TIOCSWINSZ to use slaveFd (ensures SIGWINCH delivery to foreground process group); also explicitly send kill(pid, SIGWINCH) as fallback when setsid/TIOCSCTTY is not established - Router info extended with gRPC address - Configurable system tenant ID (--system_tenant_id flag, default "0"): when request tenant_id matches, returns all tenants' instances - Enhanced tenant ID handling and authorization across instance APIs - Named instance registration API (/instance-manager/named-ins) - Vertical scaling support when node has insufficient resources - Fix for GroupCache not updating groupId→InstanceInfo mapping on put event - Integrate OpenTelemetry C++ SDK with ABI compatibility shim - Unify trace component name to "yuanrong-kernel" across all subsystems - Add libcurl dependency for OTLP HTTP exporter - Fix trace ID padding: use append instead of insert - Resource view change processing: async batching with unordered_map merge to reduce redundant updates under high change rate - JWT token support with HMAC-SHA256 signature and never-expire option - Configurable token expiration time span per function/tenant - fc-agent retry parameters exposed as configurable flags - Cache-based download to skip redundant package fetches - Tenant ID propagated from DeployInstanceRequest through DSAccessor to KVClient for correct DataSystem KV context - bypass_datasystem flag propagated through InvokeRequestToCallRequest - Datasystem router .so added to install directory - Data affinity support in schedule decisions - Fix request ID conflict in strictly-packed group scheduling - NPU health callback retry mechanism in metrics subsystem - Rolling compression of user logs - Proxy group suspend: checkpoint retry on idle instances; fix proxy restart caused by master failover; fix index out-of-bounds on empty NPU IP list - State machine: fix instance unable to update info when resuming to a previously-suspended node; fix inconsistent lifecycles of parent-child instances; release state machine owner correctly on suspend - IAM client: deduplicate concurrent token requests for same tenant - MAX_GRPC_SIZE default value unified across function-proxy - Static function deadlock fix on fault injection - ERR_INSTANCE_BUSY error code for concurrent request handling Signed-off-by: luowenyu <luowenyu4@huawei.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Lwy_Robb <luowenyu4@huawei.com> | 2 个月前 | |
feat(sandbox): sync luo/develop - sandbox runtime, traefik, snapshot and instance enhancements This commit consolidates 148 commits from luo/develop into feature/sandbox, covering a major refactor of the sandbox runtime execution model plus several new capabilities across traefik routing, snapshot management, exec service, instance management, and observability. --- Complete rewrite of the sandbox container execution layer: - Split CommandBuilder into per-language strategy classes: CppStrategy, PythonStrategy, JavaStrategy, NodejsStrategy, GoStrategy - New SandboxExecutor replacing ContainerExecutor, supporting Normal / WarmUp / Snapshot start modes - CheckpointOrchestrator extracted for checkpoint lifecycle + ref-count safety - RuntimeStateManager encapsulating all 6 executor maps atomically - SandboxStartGuard RAII to prevent map leaks on failed start paths - Unified Restore into Start interface: ckpt_dir on StartRequest replaces the dedicated RestoreRequest / Restore RPC - Inject YR_LANGUAGE env so entryfile.sh selects the correct Python version (fixes SIGSEGV with Python 3.12/3.13 bytecode on 3.11 runtime) - System-level environment variable blacklist (SYSTEM_ENV_BLACKLIST) to strip host OS env vars from container processes - Resource limit (CPU_LIMIT / MEMORY_LIMIT) read from scheduling extension and applied as cgroup effective value - Disable LookPath in CommandBuilder (container-based launch, not local exec) - Warmup routing fixed to check request.warmuptype() instead of stateManager - shared_from_this capture in CheckpointOrchestrator lambdas to prevent UAF - Port Python 3.12 and 3.13 into service.json runtime version list - 39 new unit tests: LanguageStrategyTest, RuntimeStateManagerTest, SandboxRequestBuilderTest, SandboxExecutorTest Dynamic routing configuration via pull-based HTTP polling: - TraefikRouteCache: in-memory route table built from instance portForward extensions; sorted-key JSON guarantees byte-identical output across polls (prevents spurious Traefik reloads from FNV hash drift); conditionally omits empty routers/services keys incompatible with Traefik Go parser - TraefikApiRouterRegister: GET /traefik/config handler; leader serves from local cache, standby transparently forwards to leader; returns 503 on forward failure so Traefik retains last-known-good config - TraefikLeaderContext: thread-safe leader state (atomic<bool> + shared_mutex) updated by Explorer's leader election callback - InstanceManagerActor integration: MetaStore watch events drive route table (RUNNING adds, FATAL/EVICTED/EXITED/DELETE removes) - Flags: enable_traefik_provider, traefik_http_entry_point, traefik_enable_tls, traefik_servers_transport, traefik_forward_timeout_ms - 44+ unit tests including concurrent stress and Traefik compatibility regression - List and delete snapshots filtered by function key and tenant - SnapshotCache maintains a function-key index for O(1) lookup - New proto messages: ListSnapshotsByFunctionKeyRequest / Response, DeleteSnapshotsByFunctionKeyRequest / Response - HTTP handlers in SnapManagerDriver for the new endpoints - LocalSchedSrv and its actor updated to dispatch the new request types - SnapCtrlActor enhanced to propagate functionType from FunctionKey so that list_checkpoints by function key returns correct results - Implement Docker and Podman container runtime backends - Warmup mount support for container instances - Custom rootfs: local filesystem path handling + RootfsSpecMeta in function metadata structures; validation and logging for rootfs specs - Custom mount support for user-defined volume bindings - Port forwarding: inject YR_INTERNAL_HOST_IP and YR_PORT_FORWARDINGS env vars so sandbox instances can expose internal cluster URLs via get_internal_urls() - Docker Exec gRPC bidirectional streaming service - Multi-session exec support with bugfixes - Epoll-based exec I/O loop replacing blocking read - PTY resize: fix TIOCSWINSZ to use slaveFd (ensures SIGWINCH delivery to foreground process group); also explicitly send kill(pid, SIGWINCH) as fallback when setsid/TIOCSCTTY is not established - Router info extended with gRPC address - Configurable system tenant ID (--system_tenant_id flag, default "0"): when request tenant_id matches, returns all tenants' instances - Enhanced tenant ID handling and authorization across instance APIs - Named instance registration API (/instance-manager/named-ins) - Vertical scaling support when node has insufficient resources - Fix for GroupCache not updating groupId→InstanceInfo mapping on put event - Integrate OpenTelemetry C++ SDK with ABI compatibility shim - Unify trace component name to "yuanrong-kernel" across all subsystems - Add libcurl dependency for OTLP HTTP exporter - Fix trace ID padding: use append instead of insert - Resource view change processing: async batching with unordered_map merge to reduce redundant updates under high change rate - JWT token support with HMAC-SHA256 signature and never-expire option - Configurable token expiration time span per function/tenant - fc-agent retry parameters exposed as configurable flags - Cache-based download to skip redundant package fetches - Tenant ID propagated from DeployInstanceRequest through DSAccessor to KVClient for correct DataSystem KV context - bypass_datasystem flag propagated through InvokeRequestToCallRequest - Datasystem router .so added to install directory - Data affinity support in schedule decisions - Fix request ID conflict in strictly-packed group scheduling - NPU health callback retry mechanism in metrics subsystem - Rolling compression of user logs - Proxy group suspend: checkpoint retry on idle instances; fix proxy restart caused by master failover; fix index out-of-bounds on empty NPU IP list - State machine: fix instance unable to update info when resuming to a previously-suspended node; fix inconsistent lifecycles of parent-child instances; release state machine owner correctly on suspend - IAM client: deduplicate concurrent token requests for same tenant - MAX_GRPC_SIZE default value unified across function-proxy - Static function deadlock fix on fault injection - ERR_INSTANCE_BUSY error code for concurrent request handling Signed-off-by: luowenyu <luowenyu4@huawei.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Lwy_Robb <luowenyu4@huawei.com> | 2 个月前 | |
chore: consolidate sandbox follow-up fixes Signed-off-by: mhsong2 <songminhui2@huawei.com> | 9 天前 | |
chore: consolidate sandbox follow-up fixes Signed-off-by: mhsong2 <songminhui2@huawei.com> | 9 天前 |
| 文件 | 最后提交记录 | 最后更新时间 |
|---|---|---|
| 8 个月前 | ||
| 2 个月前 | ||
| 2 个月前 | ||
| 9 天前 | ||
| 9 天前 |