Enabling Tools for Common Frameworks

Overview

This document describes how to enable debugging tools in common frameworks, including the dump tool and monitor tool.

Tool Enablement

Reference

Method for Locating the Position to Add Tools

To locate the target position, you can print the call stack information of any API and find the position from the call stack.

For example, you can print the call stack of linear in either of the following ways:

Method 1: Printing the call stack in torch

image.png

Method 2: Replacing the original linear in the startup script

import torch
import torch.nn.functional as F
import traceback

#  Save the original function.
original_linear = F.linear

#  Replace linear in functional and use *args and **kwargs to accommodate all parameters.
def custom_linear(*args, **kwargs):
    print("="*50)
    print("Call F.linear. The call stack is as follows:")
    traceback.print_stack()
    #  Pass the received parameters to the original function without any modification.
    return original_linear(*args, **kwargs)

F.linear = custom_linear

Note: Once the call stack is obtained, finding the position to add tools becomes easy.

image.png

Tool Adding Position in Common Frameworks

MindSpeed-LLM

image.png

MindSpeed-MM

image.png

LLaMA-Factory

image.png

accelerate + DeepSpeed

image.png

TorchTitan (FSDP2)

image.png

verl (FSDP)

Positions where deterministic computing is enabled: image.png

generate_sequences image.png
The preceding enabling modes are only applicable to the vLLM eager backend, which may vary depending on configurations.

update_actor image.png

compute_log_prob image.png

compute_ref_log_prob image.png

Note: The preceding enabling modes are only applicable to the vLLM eager backend. Positions and enabling modes may vary depending on configurations.