Enabling Tools for Common Frameworks

Overview

This document describes how to enable debugging tools in common frameworks, including the dump tool and monitor tool.

Tool Enablement

Reference

Data Dump for PyTorch

Method for Locating the Position to Add Tools

To locate the target position, you can print the call stack information of any API and find the position from the call stack.

For example, you can print the call stack of linear in either of the following ways:

Method 1: Printing the call stack in torch

Method 2: Replacing the original `linear` in the startup script

import torch
import torch.nn.functional as F
import traceback

#  Save the original function.
original_linear = F.linear

#  Replace linear in functional and use *args and **kwargs to accommodate all parameters.
def custom_linear(*args, **kwargs):
    print("="*50)
    print("Call F.linear. The call stack is as follows:")
    traceback.print_stack()
    #  Pass the received parameters to the original function without any modification.
    return original_linear(*args, **kwargs)

F.linear = custom_linear

Note: Once the call stack is obtained, finding the position to add tools becomes easy.

Tool Adding Position in Common Frameworks

MindSpeed-LLM

MindSpeed-MM

LLaMA-Factory

accelerate + DeepSpeed

TorchTitan (FSDP2)

verl (FSDP)

Positions where deterministic computing is enabled:

generate_sequences
The preceding enabling modes are only applicable to the vLLM eager backend, which may vary depending on configurations.

update_actor

compute_log_prob

compute_ref_log_prob

Note: The preceding enabling modes are only applicable to the vLLM eager backend. Positions and enabling modes may vary depending on configurations.