Agent Task Workflow Guide

This document defines the standard task workflows for AI coding agents working on MindSpeed-MM. Use it to determine the correct sequence of actions, files to modify, and validation steps for common development tasks.

Always read architecture.md first to identify the target backend before following any workflow below.


Quick Decision: Which Backend?

User request mentions:
├── "FSDP2" / "YAML config" / "plugin" / "ModelHub" / "trainer.py"
│   → Use FSDP2 workflows (Section 1)
│
├── "Megatron" / "pretrain_" / "provider" / "training.py"
│   → Use Megatron workflows (Section 2)
│
├── "checkpoint" / "convert" / "hf2mg" / "mg2hf" / "DCP"
│   → Use Checkpoint workflows (Section 3)
│
├── "profile" / "FLOPs" / "MFU" / "memory"
│   → Use Tooling workflows (Section 4)
│
└── "test" / "CI" / "unit test"
    → Use Testing workflows (Section 5)

1. FSDP2 Task Workflows

1.1 Onboard a New VLM (Vision-Language Model)

This is the most common FSDP2 task. Follow the detailed skill at skills/mindspeed-mm-fsdp2-model-only-vlm-migration/SKILL.md.

Quick checklist:

Step Action Key Files
1 Read upstream model card, config, processor, tokenizer config.json, modeling_*.py, processing_*.py
2 Record forward kwargs and confirm labels.loss Model forward() signature
3 Choose weight loading strategy (HF direct / DCP meta-init) modelhub.py
4 Create model plugin at mindspeed_mm/fsdp/models/<name>/ modeling_<name>.py, npu_patch.py
5 Register with @model_register.register("<model_id>") register.py
6 Decide data branch (A: reuse / B: partial / C: custom) qwen2vl_dataset.py, template.py, mm_plugin.py
7 Align template, mm_plugin, collator, batch keys template.py, mm_plugin.py, data_collator.py
8 Create examples/<name>/ with YAML + launch script examples/<name>/<name>_config.yaml
9 Fill fsdp_plan.apply_modules from model.named_modules() YAML parallel.fsdp_plan
10 Verify: 1-step train runs without error Launch script

Decision: Data Branch Selection

Can raw samples map to ShareGPT/Alpaca via DatasetAttr?
├── YES → Can existing template + mm_plugin + collator handle it?
│   ├── YES → Branch A: Only write YAML
│   └── NO  → Branch B: Add template/mm_plugin/collator, keep dataset_type=huggingface
└── NO  → Branch C: Write custom dataset plugin

1.2 Add a New Dataset to FSDP2

Step Action Key Files
1 Determine if Branch A/B/C applies See decision tree above
2 Branch A: Configure data.dataset_param.attr in YAML <model>_config.yaml
3 Branch B: Add template in template.py or mm_plugin in mm_plugin.py template.py, mm_plugin.py
4 Branch B: Register new collator key in data_collator.py if needed data_collator.py
5 Branch C: Create mindspeed_mm/fsdp/data/datasets/<name>/ New *_dataset.py
6 Branch C: Register with @data_register.register("<dataset_type>") register.py
7 Update YAML: dataset_type, attr, template, collate_param.model_name <model>_config.yaml
8 Verify: DataLoader produces correct batch keys Print batch keys in 1-step run

1.3 Write or Modify an FSDP2 YAML Config

Required YAML sections:

model:           # model_id, model_name_or_path, parallel (TP/EP/CP plan)
data:            # dataset_param (type, attr, preprocess), dataloader_param (collate)
parallel:        # fsdp_plan (apply_modules, hook_modules, ignored_modules)
training:        # plugin list, train_iters, batch_size, lr, load/save paths
features:        # recompute, activation_offload, chunk_loss (optional)
tools:           # profiler, memory_profile (optional)

Validation checklist:

Check How to Verify
model.model_id matches @model_register grep for register("<model_id>")
data.dataset_type matches @data_register grep for register("<dataset_type>")
training.plugin imports all model/data packages Check import paths exist
parallel.fsdp_plan.apply_modules uses real module names Print model.named_modules()
parallel.tensor_parallel_size is 1 (FSDP2 convention) YAML value
All paths use local directories (not HF Hub at runtime) Check model_name_or_path, dataset

1.4 Debug an FSDP2 Training Failure

Systematic debugging order:

1. Plugin import errors
   → Check training.plugin paths exist and are importable
   → Verify @register decorators executed (add print in register function)

2. Model build errors
   → Check model_id matches registration
   → Verify model_name_or_path contains valid config.json
   → For custom models: check _from_config / from_pretrained implementation

3. Data pipeline errors
   → Check dataset_type is registered
   → Verify attr mapping matches raw JSON field names
   → Check template name is registered or tokenizer has chat_template
   → Verify collate_param.model_name is a registered DATA_COLLATOR key

4. Parallel apply errors
   → Check apply_modules match real module names (case-sensitive)
   → Verify EP plan modules are subset of FSDP plan modules
   → Check TP size is 1 (FSDP2 convention)

5. Forward errors
   → Print batch keys from dataloader
   → Check model forward(**batch) accepts all keys
   → Verify labels produce loss when loss_type=raw

6. OOM errors
   → Reduce batch_size or enable activation_offload
   → Check recompute plan covers large modules
   → Enable chunk_loss for large vocab models

2. Megatron Task Workflows

2.1 Onboard a New Model (Megatron Backend)

Step Action Key Files
1 Create model implementation in mindspeed_mm/models/<name>/ modeling_<name>.py
2 Register model provider mindspeed_mm/pretrain_*.py
3 Register data provider mindspeed_mm/data/
4 Create example scripts in examples/<name>/ *.sh
5 Add Megatron argument parsing if needed mindspeed_mm/arguments.py

2.2 Modify Megatron Training Flow

Change Type Primary Files
Forward step logic mindspeed_mm/training.py
Pretrain entry point mindspeed_mm/pretrain_*.py
Model provider mindspeed_mm/models/
Data provider mindspeed_mm/data/
Loss function mindspeed_mm/loss/

3. Checkpoint Conversion Workflows

3.1 Determine Conversion Path

Source format → Target format:
├── HuggingFace → DCP (FSDP2 training)
│   → checkpoint/fsdp/generic_dcp_converter.py
│
├── DCP → HuggingFace (inference / release)
│   → checkpoint/fsdp/generic_dcp_converter.py (dcp_to_hf)
│
├── HuggingFace → Megatron (PTD training)
│   → checkpoint/ptd/
│
├── Megatron → HuggingFace
│   → checkpoint/ptd/
│
└── Custom format
    → checkpoint/common/converter.py (base class)

3.2 Add a New Checkpoint Converter

Step Action
1 Identify source and target formats
2 Extend checkpoint/common/converter.py::Converter
3 Implement key mapping (source → target parameter names)
4 Handle special cases: tied weights, MoE expert reshape, DTensor
5 Add conversion script in examples/<model>/

4. Tooling Workflows

4.1 Profile Training Performance

Step Action Key Files
1 Enable profiler in YAML: tools.profiler <model>_config.yaml
2 Set level: 0 (light) / 1 (medium) / 2 (heavy) tools_args.py
3 Run training, collect trace profiler.py
4 Analyze: python -m mindspeed_mm.fsdp.tools.profiler analyse <trace> profiler.py
5 Identify bottleneck: DataLoader / H2D / Compute / Communication Chrome trace

4.2 Calculate FLOPs / MFU

Step Action Key Files
1 Identify model config: layers, hidden_size, num_heads, vocab_size config.json
2 Use existing FLOPs tool or extend for new model tools/flops_tool/
3 Calculate theoretical FLOPs per step Formula in tool
4 Measure actual step time from training log Training log
5 MFU = theoretical_FLOPs / (step_time × device_peak_FLOPs) -

4.3 Memory Profiling

Step Action
1 Enable in YAML: tools.memory_profile
2 Set record_peak_memory: true
3 Run training, check memory report
4 Identify peak memory modules
5 Apply recompute / offload to peak modules

5. Testing Workflows

5.1 Add Unit Tests

Step Action Key Files
1 Identify test scope: model / data / parallel / checkpoint tests/
2 Create test file: tests/fsdp/models/test_<name>.py tests/
3 Follow existing test patterns (see tests/fsdp/models/test_patch_mimo_v2.py) tests/
4 Test minimal: model build + 1 forward pass -
5 Test data: dataset build + 1 batch iteration -
6 Run: python -m pytest tests/fsdp/models/test_<name>.py -xvs -

5.2 Run CI Locally

# FSDP2 unit tests
python -m pytest tests/fsdp/ -xvs -k "<test_name>"

# All tests
python -m pytest tests/ -xvs

6. Cross-Cutting Rules

6.1 File Naming Conventions

Component Pattern Example
Model implementation modeling_<name>.py modeling_qwen3_vl_moe.py
NPU patches npu_patch.py npu_patch.py
Model config YAML <name>_<size>_config.yaml qwen3vl_30B_config_v1.yaml
Launch script finetune_<name>_<size>.sh finetune_qwen3vl_30B_v1_A5.sh
Test file test_<name>.py test_patch_mimo_v2.py
Dataset plugin <name>_dataset.py qwen2vl_dataset.py

6.2 Import Conventions

# FSDP2 imports
from mindspeed_mm.fsdp.utils.register import model_register, data_register
from mindspeed_mm.fsdp.models.modelhub import ModelHub
from mindspeed_mm.fsdp.train.trainer import Trainer
from mindspeed_mm.fsdp.train.train_engine import TrainEngine
from mindspeed_mm.fsdp.distributed.torch_parallelize import ParallelApplier
from mindspeed_mm.fsdp.features.apply_features import FeaturesApplier
from mindspeed_mm.fsdp.data import build_mm_dataset, build_mm_dataloader

# Megatron imports
from mindspeed_mm.training import pretrain
from mindspeed_mm.arguments import parse_args

# Common imports
from mindspeed_mm.fsdp.utils.device import get_device_type, get_torch_device
from mindspeed_mm.fsdp.distributed.parallel_state import get_parallel_state

6.3 When to Create New Files vs Modify Existing

Scenario Action
New model (FSDP2) Create mindspeed_mm/fsdp/models/<name>/
New model (Megatron) Create mindspeed_mm/models/<name>/
New dataset type Create mindspeed_mm/fsdp/data/datasets/<name>/
New template Add to template.py
New mm_plugin Add to mm_plugin.py
New collator Add to data_collator.py
New parallel strategy Modify torch_parallelize.py
New feature (recompute/offload) Create in features/ + register in apply_features.py
Bug fix in existing model Modify existing file
Configuration change Modify YAML or *_args.py

6.4 Common Pitfalls

Pitfall Prevention
Modifying Megatron code for FSDP2 task Check architecture.md backend decision tree first
Using HF Hub paths in YAML Always use local paths for model_name_or_path and dataset
Guessing apply_modules names Print model.named_modules() and copy exact names
Forgetting to register Every model needs @model_register, every dataset needs @data_register
Missing training.plugin entry YAML must list all packages with @register decorators
Enabling EP/prefetch/offload too early First bring-up should use minimal features
Ignoring batch key compatibility Print batch keys and verify model forward accepts them all

7. Task-to-File Mapping (Quick Reference)

User Request Start Here
"Add model X to FSDP2" skills/.../SKILL.mdmindspeed_mm/fsdp/models/examples/
"Add dataset Y" mindspeed_mm/fsdp/data/datasets/data_collator.py → YAML
"Fix training OOM" features/memory/ → YAML features section
"Convert checkpoint A→B" checkpoint/converter.py
"Profile training" tools/profiler.py → YAML tools section
"Add parallel strategy" distributed/torch_parallelize.pyparallel_args.py
"Fix forward/loss" mindspeed_mm/fsdp/models/<name>/modeling_*.py
"Fix data loading" data/datasets/data_collator.pyconvert.py
"Add unit test" tests/fsdp/models/
"Update docs" docs/zh/features/ or examples/<name>/README.md

8. Validation Checklist Before Submitting PR

Check Command / Method
Model builds 1-step training run
Data loads Print batch keys from dataloader
Forward produces loss Check output.loss is scalar
Backward completes 1 optimizer step without error
Config is valid YAML parses without error
Plugin imports No ModuleNotFoundError for plugin packages
No hardcoded paths All paths from YAML config
NPU compatible No CUDA-only API calls (use device.py wrappers)
Code style Follow existing patterns in same directory