cann-robotRefactor A2A3 syncall ST to dav-c220 chevron patterns and update SYNCALL docs.

SYNCALL

Instruction Diagram

No SYNCALL.svg is provided (unlike most vector ops). SYNCALL is a cross-core control-plane primitive rather than a per-element Tile transform; semantically it means "all selected participants rendezvous at this point before any may proceed."

The following conceptual diagram distinguishes the hardware (FFTS) and software (GM polling) paths:

flowchart TB
  subgraph hard [Hard Mode / FFTS]
    H1[Each participant reaches call site] --> H2[ffts_cross_core_sync etc.]
    H2 --> H3[wait_flag_dev etc.]
    H3 --> H4[Barrier complete]
  end
  subgraph soft [Soft Mode / GM Polling]
    S1[Write local GM slot counter] --> S2[Poll all slots reach current generation]
    S2 --> S3[Barrier complete]
  end

Summary

SYNCALL is a cross-core synchronization barrier supporting A2/A3 and A5 NPU backends. The template parameter SyncCoreType selects the core-type mode:

AIV-only (default): SYNCALL() synchronizes all AIV cores.
AIC-only: SYNCALL<SyncCoreType::AICOnly>() synchronizes all AIC cores (A2/A3 supports both hardware and software modes; A5 supports hardware mode only).
MIX (AIC+AIV): SYNCALL<SyncCoreType::Mix>() synchronizes mixed AIC and AIV cores.

SyncAllMode (specified explicitly in workspace-bearing overloads) selects hardware mode (FFTS) or software mode (GM polling). The workspace-free overload corresponds to the hardware path.

Mathematical Semantics

Not applicable as an elementwise arithmetic operation. SYNCALL expresses a barrier arrival relation:

At a given dynamic program point, every core in the participant set defined by the current SyncCoreType must execute past the SYNCALL call before any participant may proceed beyond that point.
Hardware mode: cross-core visibility is guaranteed by FFTS flags and device-side wait_flag_dev primitives.
Software mode: each participant owns a monotonically-increasing counter slot in GM; dcci/dsb coherency primitives and polling determine "all participants have reached the current generation."

This semantic does not provide additional guarantees on GM or other buffer contents after the barrier; cross-core data visibility must be maintained by the caller — see "Cross-Core GM Communication Notes".

C++ Built-in Interface

Declared in include/pto/common/pto_instr.hpp. Software-mode interfaces use type-safe GlobalTensor and Tile parameters (constrained via SFINAE):

// Hardware mode (all CoreType variants)
template <SyncCoreType CoreType = SyncCoreType::AIVOnly>
PTO_INST void SYNCALL();

// Software mode — AIV-only (GlobalTensor + Vec Tile)
template <SyncAllMode Mode, SyncCoreType CoreType = SyncCoreType::AIVOnly,
          typename GlobalData, typename TileData,
          std::enable_if_t<is_global_data_v<GlobalData> &&
                           is_tile_data_v<TileData> && TileData::Loc == TileType::Vec, int> = 0>
PTO_INST void SYNCALL(GlobalData &gmWorkspace, TileData &ubWorkspace, int32_t usedCores = 0);

// Software mode — AIC-only (GlobalTensor + Mat Tile)
template <SyncAllMode Mode, SyncCoreType CoreType = SyncCoreType::AICOnly,
          typename GlobalData, typename TileData,
          std::enable_if_t<is_global_data_v<GlobalData> &&
                           is_tile_data_v<TileData> && TileData::Loc == TileType::Mat, int> = 0>
PTO_INST void SYNCALL(GlobalData &gmWorkspace, TileData &l1Workspace, int32_t usedCores = 0);

// Software mode — MIX (GlobalTensor + Vec Tile + Mat Tile)
template <SyncAllMode Mode, SyncCoreType CoreType = SyncCoreType::Mix,
          typename GlobalData, typename UbTileData, typename L1TileData,
          std::enable_if_t<is_global_data_v<GlobalData> &&
                           is_tile_data_v<UbTileData> && UbTileData::Loc == TileType::Vec &&
                           is_tile_data_v<L1TileData> && L1TileData::Loc == TileType::Mat, int> = 0>
PTO_INST void SYNCALL(GlobalData &gmWorkspace, UbTileData &ubWorkspace, L1TileData &l1Workspace,
                       int32_t usedCores = 0);

Parameters

gmWorkspace: GlobalTensor<int32_t, pto::Shape<>, pto::Stride<>> (when using namespace pto coexists with Ascend C headers, qualify with pto:: to avoid name collision with the compiler-intrinsic Stride enum). GM workspace for software mode; must be zero-initialized before the call. Each participating core occupies 8 int32_t values (cache-line-isolated sync counter).
ubWorkspace: Tile<TileType::Vec, int32_t, 1, SYNCALL_SOFT_SLOT_INT32>. UB scratch for AIV-only and MIX software mode; capacity must be at least usedCores * 8 * sizeof(int32_t).
l1Workspace: Tile<TileType::Mat, int32_t, 1, SYNCALL_SOFT_SLOT_INT32>. L1 (cbuf) scratch for AIC-only and MIX software mode; used by create_cbuf_matrix to fill a sync value then DMA-transfer to GM.
usedCores: Number of cores participating in the software barrier. When 0, automatically inferred — AIV-only / AIC-only use get_block_num(), MIX uses SYNCALL_GET_MIX_PARTICIPANT_COUNT() (i.e. AIC blocks × (1 + AIV ratio)).

Kernel Meta Macros

The following scenarios require hand-written .ascend.meta in the ELF for correct runtime scheduling: hard AIV-only, soft AIC-only, and register-ELF MIX (e.g. 1:1 hard). For dav-c220 auto-split builds, Bisheng generates meta automatically — see the note at the end of this section. Macros are defined in include/pto/common/kernel_meta.hpp:

kernelName must exactly match the __global__ entry symbol (stored in section .ascend.meta.<kernelName>).

// AIV-side kernel (ktype=MIX_AIV_MAIN, AIC:AIV ratio fixed 0:1)
PTO_SYNCALL_AIV_KERNEL_META(kernelName);

// AIC-only kernel (ktype=AIC_ONLY, ratio fixed 1:0)
PTO_SYNCALL_AIC_KERNEL_META(kernelName);

// AIC-side MIX kernel (ktype=MIX_AIC_MAIN, specify AIC:AIV ratio)
PTO_SYNCALL_MIX_AIC_KERNEL_META(kernelName, aicRatio, aivRatio);

Usage examples

Hard AIV-only (single kernel, chevron launch):

PTO_SYNCALL_AIV_KERNEL_META(MyKernel_mix_aiv);
extern "C" __global__ AICORE void MyKernel_mix_aiv(...) { SYNCALL(); }

Soft AIC-only (single kernel, chevron launch):

PTO_SYNCALL_AIC_KERNEL_META(MyKernel);
extern "C" __global__ AICORE void MyKernel(...) { SYNCALL<SyncAllMode::Soft, SyncCoreType::AICOnly>(...); }

register-ELF generic pairing (AIC side sets the ratio + AIV side). Note: the MIX 1:2 case in the current syncall ST now uses dav-c220 auto-split and needs no hand-written meta; the example below only demonstrates the register-ELF macro pairing:

PTO_SYNCALL_MIX_AIC_KERNEL_META(MyKernel_mix_aic, 1, 2);
PTO_SYNCALL_AIV_KERNEL_META(MyKernel_mix_aiv);

register-ELF MIX 1:1 hard (both AIC and AIV sides use PTO_SYNCALL_MIX_AIC_KERNEL_META(..., 1, 1) — do not use PTO_SYNCALL_AIV_KERNEL_META on the AIV side):

PTO_SYNCALL_MIX_AIC_KERNEL_META(MyKernel_mix_aic, 1, 1);
PTO_SYNCALL_MIX_AIC_KERNEL_META(MyKernel_mix_aiv, 1, 1);

Common scenarios that do not require hand-written meta (full mapping in the "Compilation and Launch Guide" quick-reference table below):

AIV-only soft (dav-c220-vec)
MIX 1:2 hard/soft, hard AIC-only on A2/A3 (dav-c220 auto-split)
MIX 1:1 soft (dual-stream chevron)

dav-c220 auto-split: When compiled with --cce-aicore-arch=dav-c220, Bisheng automatically generates AIC/AIV sub-kernels and corresponding .ascend.meta at a fixed physical ratio of 1:2 (two AIV subblocks per AIC block). In this case you do not need to hand-write PTO_SYNCALL_MIX_AIC_KERNEL_META, and you cannot override the ratio to 1:1 via meta alone (see "MIX 1:1" below).

Compilation and Launch Guide (A2/A3)

This section uses the ST suite tests/npu/a2a3/src/st/testcase/syncall/ as the reference for which compile arch, kernel meta, and host launch pattern to use for each SyncCoreType / mode / AIC:AIV ratio. The host decides the launch grid at runtime via syncall_core_config.hpp (910B1: 24 AIC + 48 AIV; 910B4: 20 AIC + 40 AIV), so the same kernel binaries work across chips.

Scenario Quick Reference

Scenario	Sync Mode	Participants	Compile `--cce-aicore-arch`	Kernel Meta	Host Launch	Reference Source
AIV-only	Hard	`aiv`	`dav-c220-vec`	`PTO_SYNCALL_AIV_KERNEL_META`	chevron `<<<aiv>>>`	`syncall_kernel.cpp`
AIV-only	Soft	`aiv`	`dav-c220-vec`	none	chevron `<<<aiv>>>`	`syncall_soft_kernel.cpp`
AIC-only	Hard	`aic`	`dav-c220` (MIX auto-split, empty AIV stub)	auto-generated by Bisheng	chevron `<<<aic>>>`	`syncall_aic_hard_kernel.cpp`
AIC-only	Soft	`aic`	`dav-c220-cube`	`PTO_SYNCALL_AIC_KERNEL_META`	chevron `<<<aic>>>`	`syncall_aic_kernel.cpp`
MIX 1:2	Hard / Soft	`aic×3`	`dav-c220`	auto-generated by Bisheng	chevron `<<<aic>>>` (hard + soft share one `.so`)	`syncall_mix_1_2_kernel.cpp`
MIX 1:1	Soft	`aic×2`	cube + vec `.o` each	none	dual-stream chevron: AIC `<<<aic>>>` + AIV `<<<aic>>>`	`syncall_mix_1_1_soft_kernel.cpp`
MIX 1:1	Hard	`aic×2`	cube + vec `.o` each	`PTO_SYNCALL_MIX_AIC_KERNEL_META(..., 1, 1)`	register ELF + `rtKernelLaunchWithHandleV2`	`syncall_mix_1_1_kernel.cpp`

Hard and soft kernels must not share the same .so in AIV-only / AIC-only cases (soft corrupts hard FFTS config and hangs). MIX 1:2 hard and soft may live in the same dav-c220 auto-split source and .so.

Path Details

1. Chevron single-arch (AIV-only / AIC-only soft)

Compile: one source file + matching arch (dav-c220-vec or dav-c220-cube) → separate .so.
Launch: kernel<<<blockDim, nullptr, stream>>>(..., totalBlocks); blockDim and totalBlocks come from the host at runtime (ST: syncall_cfg::GetCoreConfig()).
Hard AIV-only: declare PTO_SYNCALL_AIV_KERNEL_META.

2. Chevron MIX auto-split (MIX 1:2, hard AIC-only)

Compile: --cce-aicore-arch=dav-c220; CMake helper pto_syncall_chevron_kernel(<target> <source>).
Launch: single chevron <<<aic>>>; runtime brings up all MIX participants at physical 1:2.
Kernel args: aicBlocks and totalParticipants passed as scalars from the host (read on both AIC and AIV sides) for 910B1/910B4 portability.
Hard AIC-only exception: pure dav-c220-cube cannot establish the FFTS context for AIC-only hard sync. Use dav-c220 MIX compile: AIC runs SYNCALL<AICOnly>(), AIV is an empty stub; totalBlocks passed from the host.

3. Dual-arch dual-stream (MIX 1:1 soft)

Why: on the ccec/bisheng path GetTaskRation() is always 2; dav-c220 auto-split is physically fixed at 1:2 and cannot produce true 1:1.
Compile: same source compiled once as dav-c220-cube (-DSYNCALL_MIX_BUILD_AIC) and once as dav-c220-vec (-DSYNCALL_MIX_BUILD_AIV), linked into one .so; CMake helper pto_syncall_mix11_soft_kernel.
Launch: AIC and AIV chevron <<<aic>>> on separate aclrtStreams; aicBlocks / totalParticipants from the host at runtime.

4. Register ELF (MIX 1:1 hard)

Why: hard MIX sync needs a single MIX FFTS context; chevron auto-split cannot produce true 1:1 on ccec.
Compile: cube / vec objects with PTO_SYNCALL_MIX_AIC_KERNEL_META(name, 1, 1), register-only objects with -DSYNCALL_MIX_REGISTER_BUILD, merged by make_mix_register_elf.py; CMake helper pto_syncall_mix_kernel.
Launch: rtRegisterAllKernel + rtKernelLaunchWithHandleV2(handle, tilingKey, aicBlocks, ...); device derives participant count via get_block_num() (register path passes only ffts/out/flags).

Mode Support Matrix

A2/A3

Core Type	Hardware Mode	Software Mode
AIV-only	Supported	Supported
AIC-only	Supported	Supported
MIX	Supported	Supported

A5

Core Type	Hardware Mode	Software Mode
AIV-only	Supported	Supported
AIC-only	Supported	Not supported
MIX	Not supported	Supported

Constraints

Software-mode GM write paths per platform:
- A2/A3 (AIC-only and the AIC side of MIX): AIC writes GM slots via copy_cbuf_to_gm (L1→GM DMA); the AIV side of MIX writes via UB workspace.
- A5 MIX: A5 AIC (dav-c310-cube) lacks copy_cbuf_to_gm; instead it delegates UB→GM writes to AIV subblock 0 of the same block via intra_block signaling.
A5 platform limitations (cf. "Mode Support Matrix"):
- AIC-only software unavailable: A5 AIC lacks an independent GM DMA write path (no copy_cbuf_to_gm), so GM-polling sync is infeasible.
- Hardware MIX unavailable: rtGetC2cCtrlAddr returns RT_ERROR_FEATURE_NOT_SUPPORT (207000) on A5 (CHIP_DAVID), preventing FFTS base address retrieval.
- AIC-only hardware: implemented via ffts_cross_core_sync + wait_flag_dev, without needing set_ffts_base_addr.
Software mode requires all participating cores to enter the same barrier group in the same order (based on monotonic generation counting; mismatched entry count/order causes misalignment or deadlock).
SYNCALL does not participate in PTO's automatic Event dependency scheduling: it neither accepts WaitEvents nor returns a RecordEvent that later instructions can wait on. Consequently it does not automatically wait for preceding data instructions (e.g. TSTORE) to complete; the ordering and visibility between SYNCALL and surrounding data instructions must be ensured by the caller (see "Cross-Core GM Communication Notes").
In the auto build path (__PTO_AUTO__), SYNCALL is a no-op and emits no cross-core hardware synchronization (consistent with TSYNC etc.); actual synchronization happens only in manual kernels.

Cross-Core GM Communication Notes

SYNCALL only provides barrier arrival semantics (for both hard and soft), and does not guarantee cross-core cache visibility of business data around the barrier. When an operator writes GM per core before the barrier and reads other cores' GM after the barrier (e.g. cross-core histogram / prefix-sum), the caller must satisfy the two points below, otherwise stale reads or lost writes will occur.

1. Cache coherency: explicit `dcci` / `dsb` is required

Writer: after copy_ubuf_to_gm / copy_cbuf_to_gm, issue dcci(addr, SINGLE_CACHE_LINE) + dsb(DSB_DDR) to flush data out to DDR.
Reader: before reading, issue dcci(addr, SINGLE_CACHE_LINE) (invalidate) + dsb to ensure the latest DDR value is read instead of a stale local cache.
set_flag / wait_flag alone (intra-core pipeline sync) is not sufficient for cross-core visibility.
This is independent of the barrier mode: the hardware FFTS barrier also does not flush cache; it only guarantees the "all arrived" control-plane ordering.
SYNCALL internally applies full dcci + dsb(DDR) to its own sync slots, but does not flush the caller's business data.

dcci / DMA operate at 32B cache-line granularity; if adjacent cores' slots share one cache line, cross-core flushes overwrite / lose each other's writes.
Each core's slot should be 32B-aligned and own one full cache line (for int32, that means stride = 8, not 4).
SYNCALL's own sync slots follow this design: SYNCALL_SOFT_SLOT_INT32 = 8 (see include/pto/common/type.hpp); the caller's business workspace should follow the same isolation principle.

Examples

Manual — Hardware Mode

#include <pto/pto-inst.hpp>

using namespace pto;

// AIV-only: FFTS barrier across all AIV cores (requires correct kernel meta / ELF)
void example_hard_aiv() {
  SYNCALL();
}

// AIC-only: on A2/A3 hard paths use dav-c220 MIX compile (empty AIV stub); on A5 pure cube hard mode is verified
void example_hard_aic() {
  SYNCALL<SyncCoreType::AICOnly>();
}

// MIX: for compile/launch details see "Kernel Meta Macros" and "Compilation and Launch Guide" above
void example_hard_mix() {
  SYNCALL<SyncCoreType::Mix>();
}

Manual — Software Mode

Software mode requires a zero-initialized GM workspace and a correctly-sized UB/L1 Tile. Mode must be SyncAllMode::Soft (Hard ignores the workspace and behaves like the workspace-free SYNCALL_IMPL).

#include <pto/pto-inst.hpp>

using namespace pto;

// The AIV software barrier reads all participating cores' slots into UB,
// so UB capacity must be >= usedCores * SYNCALL_SOFT_SLOT_INT32 (each core owns one cache line);
// here we declare it with kMaxAivCores (the target chip's max AIV core count) as a compile-time upper bound.
constexpr int32_t kMaxAivCores = 48;  // e.g. 48 on 910B1
void example_soft_aiv(__gm__ int32_t *gmPtr) {
  GlobalTensor<int32_t, pto::Shape<>, pto::Stride<>> gmWs(gmPtr);
  Tile<TileType::Vec, int32_t, 1, kMaxAivCores * SYNCALL_SOFT_SLOT_INT32> ub;
  SYNCALL<SyncAllMode::Soft, SyncCoreType::AIVOnly>(gmWs, ub, 0);  // usedCores=0 -> get_block_num()
}

MIX software mode requires both UB and L1 (Mat) Tiles; on A5 the AIC side delegates GM writes via a proxy path — see the "Constraints" section.