文件最后提交记录最后更新时间
Split comm instruction implementations into a2a3/a5 folders1 个月前
Split comm instruction implementations into a2a3/a5 folders1 个月前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
[pto-comm] Fix critical bugs in CPU simulator and async communication1 个月前
Split comm instruction implementations into a2a3/a5 folders1 个月前
Split comm instruction implementations into a2a3/a5 folders1 个月前
Supplementary instruction constraints2 个月前
fix: align C++ Intrinsic signatures with pto_instr.hpp and fix SVG diagram subscripts2 个月前
Supplementary instruction constraints2 个月前
Refactor the readme and update the implementation status of the instructions1 个月前
Split comm instruction implementations into a2a3/a5 folders1 个月前
Split comm instruction implementations into a2a3/a5 folders1 个月前
Supplementary instruction constraints2 个月前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
feat(comm): add CCU engine path for TREDUCE/TSCATTER/TBROADCAST/TGATHER26 天前
Refactor the readme and update the implementation status of the instructions1 个月前
Supplementary instruction constraints2 个月前
Refactor the readme and update the implementation status of the instructions1 个月前
Supplementary instruction constraints2 个月前
README.md

PTO Communication ISA Reference

This directory contains the per-instruction reference for the PTO Communication ISA.

  • Source of truth (C++ intrinsics): include/pto/comm/pto_comm_inst.hpp
  • Type definitions: include/pto/comm/comm_types.hpp

Point-to-Point Communication (Synchronous)

  • TPUT: Remote write (GM → UB → GM)
  • TGET: Remote read (GM → UB → GM)

Point-to-Point Communication (Asynchronous)

  • TPUT_ASYNC: Asynchronous remote write (GM → DMA engine → GM)
  • TGET_ASYNC: Asynchronous remote read (GM → DMA engine → GM)

Signal-Based Synchronization

  • TNOTIFY: Send notification to remote NPU
  • TWAIT: Blocking wait for signal condition
  • TTEST: Non-blocking test signal condition

Collective Communication

  • TGATHER: Gather data from all ranks
  • TSCATTER: Scatter data to all ranks
  • TREDUCE: Reduce data from all ranks to local
  • TBROADCAST: Broadcast from current NPU to all ranks

Type Definitions

NotifyOp

Operation type for TNOTIFY:

Value Description
NotifyOp::AtomicAdd Atomic add (signal += value)
NotifyOp::Set Direct set (signal = value)

WaitCmp

Comparison operators for TWAIT and TTEST:

Value Description
WaitCmp::EQ Equal (==)
WaitCmp::NE Not equal (!=)
WaitCmp::GT Greater than (>)
WaitCmp::GE Greater or equal (>=)
WaitCmp::LT Less than (<)
WaitCmp::LE Less or equal (<=)
// Usage (unified runtime parameter style):
comm::TNOTIFY(signal, 1, comm::NotifyOp::Set);
comm::TWAIT(signal, 1, comm::WaitCmp::EQ);
comm::TTEST(signal, 1, comm::WaitCmp::GE);

ReduceOp

Reduction operators for TREDUCE:

Value Description
ReduceOp::Sum Element-wise sum
ReduceOp::Max Element-wise maximum
ReduceOp::Min Element-wise minimum

AtomicType

Atomic operation type for TPUT (defined in include/pto/common/constants.hpp):

Value Description
AtomicType::AtomicNone No atomic operation (default)
AtomicType::AtomicAdd Atomic add operation

DmaEngine

DMA backend selection for TPUT_ASYNC and TGET_ASYNC:

Value Description
DmaEngine::SDMA SDMA engine (supports 1D transfer)
DmaEngine::URMA URMA engine (supports 1D transfer, Ascend950 / NPU_ARCH 3510 only)

AsyncEvent

Returned by TPUT_ASYNC / TGET_ASYNC. Use to synchronize completion:

struct AsyncEvent {
    uint64_t handle;
    DmaEngine engine;

    bool valid() const;                        // true if handle != 0
    bool Wait(const AsyncSession &session) const; // block until transfer completes
    bool Test(const AsyncSession &session) const; // non-blocking completion check
};

AsyncSession

Engine-agnostic session for async DMA operations. Build once, pass to all async calls:

comm::AsyncSession session;
comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, workspace, session);

Defined in include/pto/comm/async_common/async_types.hpp. See TPUT_ASYNC for construction details and parameters.

ParallelGroup

Wrapper for collective communication across multiple NPUs:

template <typename GlobalData>
struct ParallelGroup {
    // Pointer to an array of `GlobalData` objects (each wraps a GM address).
    // The array itself is local metadata; the wrapped addresses may refer to local or remote GM,
    // depending on the collective instruction.
    GlobalData *tensors;
    int nranks;   // Number of ranks
    int rootIdx;  // Root NPU's rank index
    
    // Factory function (recommended): build from an existing tensor array.
    static ParallelGroup Create(GlobalData *tensorArray, int size, int rank_id);
};