| perf: bench ffn using Criterion (#80) * perf: bench ffn using Criterion * chore: clippy & format * better ffn benchmarking * chore: explain why we use cuda to initialize weights | 10 个月前 |
| Add dockerfile and related docker-compose for quick testing (#23) * feat: add dockerfile for controller and worker * fix: add migration tool diesel feat: add brief docker-compose file for quick deployment * feat[version]: add rust-toolchains.toml for rust version config feat[worker]: worker read ctrl addr from env feat[computation]: change [::1] to 0.0.0.0 for cross network usage feat[docker]: install specific rust-toolchain from rust-toolchain.toml feat[docker]: update torch version to 2.7.0 for RTX50 support fix[torch_test]: fix module path import error * feat: get config from get_config_key method feat: update transformers version for qwen3 support * feat: format dockerfile * feat[dockerfile]: extract common envs to independent stage feat[compose]: update compose file according to dockerfile * add ek-cli into docker and fix weight server proxy issue feat: add ek-cli into docker image feat: change test config according to new config.rs fix: remove proxy settings in ek-runtime environments * fix: add max_conn_size settings for * feat: refactor configuration structure * refactor: migrate config to new format * feat: Add dockerfile for kunpeng arm and x86 * fix: redundant space in dockerfile * feat: add diesel cli tool in runtime refactor: only leave ek-cli as entry of ek * feat: using compose.override to extract user define variables feat: add a simple readme | 1 年前 |
| feat: inject detailed activation to clickhouse | 1 年前 |
| update: add release tags (#72) | 10 个月前 |
| feat: RDMA Queue impl (#102) * perf: local shared memory for controller-worker communication * chore: clippy & format * feat: raii for shm queue * chore: enlarge queue * chore: remove verbose log * chore: clippy & format * perf: tuning * chore: gracefully terminate worker (shm version) * fix: ExpertRegistry route mistake after rebalance action (#1) * docs: add some comments for dispatcher logic * feat: add new config for Commuicate Backend (grpc && shm) feat: add uniform get_registry for both backend * feat: set node to deactivate in db when node exit, avoiding wrong info used by schedule * chore: clippy & format * chore: reorganize shmq module * feat: rdma implementation * feat: can successfully establish connection * test: example for rdmaQueue * feat: impl rdma queue into registry and state service * Merge branch 'testing' into perf/rdma * feat: rdma runable * refactor: change write logic from "read remote" to "write remote" feat: add sleep for controller side after rdma connection established. * feat: add some debug info * feat: add interface to RdmaBytes for real lenth feat: rdma will only send real data to the remote * clippy & format --------- Co-authored-by: Yip Coekjan <cn_yzr@qq.com> | 9 个月前 |
| perf: use ggml operators to optimize cpu ffn forwarding (#94) * perf: use ggml operators to optimize cpu ffn forwarding * perf: supports bf16 on ggml backend * chore: make clippy happy * chore: align the types * chore: tuning & fix serialization * fix: fix padding and context size * feat: allow dropping cache after loading expert backend * chore: statically link ggml * feat: allocating tensor data from rust side * feat: allow specifying computation backend * chore: clippy & format * chore: tuning * chore: delete unused feature flags * chore: remove ggml-cuda * fix: ggml-cpu.h includes ggml.h * perf: single thread for better throughput | 10 个月前 |
| chore: set tonic blocking threads via env | 9 个月前 |
| fix: Make Rdma connection establishment stable (#107) * feat: change is_connect judge logic * feat: change connection establish logic * feat: simplify rdma connection establish procedure * feat: RdmaEndpointServer shutdown after connection established. * chore: make compiler happy * feat: add retry logic for rdma connection establish * chore: using url:Url for robust url parsing * feat: add proper disconnect logic for connection rebuild feat: RdmaEndpointServer looped for connection rebuild feat: add proper close logic when disconnect from controller feat: graceful shutdown for rdmaEndpointServer * chore: cargo fmt --all * chore: detailed prepared_qp build error * chore: remove unnecessary _ for some fields in RdmaQueue * chore: remove unused variables | 7 个月前 |
| feat: add transformer mixtral (#92) * feat: add transformer mixtral example * feat: fully integrates mixtral models --------- Co-authored-by: Yip Coekjan <cn_yzr@qq.com> | 9 个月前 |
| fix: add lib64 search path for ek-ggml static linking on kylin Linux Add cargo rustc-link-search for $dst/lib64 in ek-ggml/build.rs so static ggml archives can be found on systems that install libraries under lib64. Refs #1 | 2 个月前 |
| feat: add transformer mixtral (#92) * feat: add transformer mixtral example * feat: fully integrates mixtral models --------- Co-authored-by: Yip Coekjan <cn_yzr@qq.com> | 9 个月前 |
| fix: Make Rdma connection establishment stable (#107) * feat: change is_connect judge logic * feat: change connection establish logic * feat: simplify rdma connection establish procedure * feat: RdmaEndpointServer shutdown after connection established. * chore: make compiler happy * feat: add retry logic for rdma connection establish * chore: using url:Url for robust url parsing * feat: add proper disconnect logic for connection rebuild feat: RdmaEndpointServer looped for connection rebuild feat: add proper close logic when disconnect from controller feat: graceful shutdown for rdmaEndpointServer * chore: cargo fmt --all * chore: detailed prepared_qp build error * chore: remove unnecessary _ for some fields in RdmaQueue * chore: remove unused variables | 7 个月前 |
| Minor fix for running deepseek-v3 671B (#25) * feat: add torch install role * feat: convert safetensor to torch tensor without copy * dev: update ansible settings * chore: tweak rust grpc settings and coding style * dev: tweak python settings * fix: remove absolute path * feat: auto reconnect to controller * feat: use rwlock in worker * dev: load worker on start Signed-off-by: Liu Hancheng <cn_lhc@qq.com> * dev: auto reconnect * doc: update readme.md --------- Signed-off-by: Liu Hancheng <cn_lhc@qq.com> | 1 年前 |
| build: add init docker file | 1 年前 |
| test: add test resources | 1 年前 |
| fix: typo in torch integration and small enhancement | 1 年前 |
| perf: use ggml operators to optimize cpu ffn forwarding (#94) * perf: use ggml operators to optimize cpu ffn forwarding * perf: supports bf16 on ggml backend * chore: make clippy happy * chore: align the types * chore: tuning & fix serialization * fix: fix padding and context size * feat: allow dropping cache after loading expert backend * chore: statically link ggml * feat: allocating tensor data from rust side * feat: allow specifying computation backend * chore: clippy & format * chore: tuning * chore: delete unused feature flags * chore: remove ggml-cuda * fix: ggml-cpu.h includes ggml.h * perf: single thread for better throughput | 10 个月前 |
| fix: final try of .lfsconfig | 1 年前 |
| chore: optimize dependencies (#76) * chore: optimize deps * chore: remove unused deps * chore: mark ek-cli as default member | 10 个月前 |
| fix: Make Rdma connection establishment stable (#107) * feat: change is_connect judge logic * feat: change connection establish logic * feat: simplify rdma connection establish procedure * feat: RdmaEndpointServer shutdown after connection established. * chore: make compiler happy * feat: add retry logic for rdma connection establish * chore: using url:Url for robust url parsing * feat: add proper disconnect logic for connection rebuild feat: RdmaEndpointServer looped for connection rebuild feat: add proper close logic when disconnect from controller feat: graceful shutdown for rdmaEndpointServer * chore: cargo fmt --all * chore: detailed prepared_qp build error * chore: remove unnecessary _ for some fields in RdmaQueue * chore: remove unused variables | 7 个月前 |
| fix: Make Rdma connection establishment stable (#107) * feat: change is_connect judge logic * feat: change connection establish logic * feat: simplify rdma connection establish procedure * feat: RdmaEndpointServer shutdown after connection established. * chore: make compiler happy * feat: add retry logic for rdma connection establish * chore: using url:Url for robust url parsing * feat: add proper disconnect logic for connection rebuild feat: RdmaEndpointServer looped for connection rebuild feat: add proper close logic when disconnect from controller feat: graceful shutdown for rdmaEndpointServer * chore: cargo fmt --all * chore: detailed prepared_qp build error * chore: remove unnecessary _ for some fields in RdmaQueue * chore: remove unused variables | 7 个月前 |
| doc: update readme | 1 年前 |
| Fix typo in README.md (#97) | 7 个月前 |
| feat: support onnxruntime #30 (#38) * feat: onnx export from rust * dev: add onnx supporting * fix: benchmark script for onnxruntime * feat: ek-cli for exporting onnx file * feat: support onnxruntime backend in benchmark * dev: fix merge conflicts * fix: typo | 1 年前 |
| chore: fix uv deps & enhance qwen3 integration (device & response) (#78) * chore: fix uv deps & enhance qwen3 integration (device & response) * Update ek-integration/expertkit_torch/expertkit_torch/models/qwen3_moe.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: fix copilot suggestion --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> | 10 个月前 |
| tool: introduce ruff to standardize py style | 1 年前 |
| chore: fix uv deps & enhance qwen3 integration (device & response) (#78) * chore: fix uv deps & enhance qwen3 integration (device & response) * Update ek-integration/expertkit_torch/expertkit_torch/models/qwen3_moe.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: fix copilot suggestion --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> | 10 个月前 |