yiyanzhi_akane1update docs for br_agents

Trace System

1. Overview

The Trace System provides tree-based inference tracing for AKG Agents. It records the entire reasoning and execution process of agents, supporting:

Tree structure: Single tree with multi-fork exploration
Node switching: Navigate to any node in the trace tree
Incremental history: Action records accumulated per node
Checkpoint resume: Persist state and resume from any point
Visualization: Text and Rich terminal visualization
Fork & merge: Fork at ask_user nodes, merge branches, parallel exploration
Blame: Track which node introduced each line of generated code

2. Core Concepts

Trace Tree

The entire trace is a single tree — no separate "branch" concept. A node can have multiple children (forks). When a new action is performed on a node that already has children, a new child node is automatically created.

Data Models

Model	Description
`TraceTree`	Root data structure containing all nodes and metadata
`TraceNode`	A single node in the tree (agent info, task info, state, children)
`NodeState`	Per-node state snapshot
`ActionRecord`	A single action record (tool call + result)
`ActionHistoryFact`	Factual action history accumulated per node
`ActionHistoryCompressed`	Compressed action history for LLM context
`AgentInfo`	Agent metadata (name, id)
`TaskInfo`	Task metadata (task_id, input, domain, custom metadata)
`ExecutionInfo`	Execution counters (tool calls, turns)
`FileInfo`	File metadata for code snapshots
`Metrics`	Performance and quality metrics for a node
`ThinkingState`	Plan and decision state (thinking.json)
`PlanState`	Structured plan state within ThinkingState
`DecisionRecord`	A single thinking/decision record
`PendingTool`	A pending tool call awaiting completion
`PendingToolsState`	Collection of pending tools (pending_tools.json)

3. TraceSystem

TraceSystem is the main entry point for trace management.

Constructor

trace = TraceSystem(task_id="my_task", base_dir="~/.akg")

Initialization

# Initialize (creates trace.json and root node)
trace.initialize(task_input="Generate a relu kernel")

# Force re-initialization (overwrites existing trace)
trace.initialize(task_input="Generate a relu kernel", force=True)

# Resume existing trace (auto-loads if trace.json exists)
trace.initialize(force=False)

Node Operations

# Add a new node (returns node_id)
node_id = trace.add_node(
    action={"type": "verify_kernel", "arguments": {"op_name": "relu"}},
    result={"status": "success", "output": "..."},
    metrics={"performance": 1.5},          # optional
    state_snapshot={"iteration": 3},       # optional
)

# Get current node ID
current = trace.get_current_node()

# Get a node object
node = trace.get_node(node_id)

# Switch to a different node
trace.switch_node(node_id)

# Get path from root to a node
path = trace.get_path_to_node(node_id)

# Get node depth in the tree
depth = trace.get_node_depth(node_id)

Action History

# Get full action history for a node (includes all ancestors)
history = trace.get_full_action_history(node_id)

# Get compressed history for LLM context (async, requires LLM client)
compressed = await trace.get_compressed_history_for_llm(
    llm_client, node_id, max_tokens=2000
)

Fork & Merge

# Fork at an ask_user node (creates new child with same question, empty answer)
new_node_id = trace.fork_ask_user(node_id)

# Create parallel exploration forks
fork_ids = trace.create_parallel_forks(n=3, action_template={"type": "explore"})

# Complete a fork with results
trace.complete_fork(fork_id, result={"status": "success"}, metrics={...})

# Merge two branches (three-way merge of code files)
merged_node_id = trace.merge_nodes(target_node_id, source_node_id)

Node Status

# Mark a node as completed
trace.mark_node_completed(node_id, metrics={"performance": 1.5})

# Mark a node as failed
trace.mark_node_failed(node_id, error="Verification timeout")

# Update a node's result
trace.update_node_result(node_id, result={...}, metrics={...})

Tree Queries

# Get all leaf nodes
leaves = trace.get_all_leaf_nodes()

# Get the best leaf node by a metric
best = trace.get_best_leaf_node(metric="performance")

# Compare two nodes
diff = trace.compare_nodes(node_1, node_2)

# Find lowest common ancestor
lca = trace.find_lca(node_id1, node_id2)

Code Blame

# Blame: track which node introduced each line
blame_info = trace.blame_file(node_id, "kernel.py")

# Blame all files at a node
all_blame = trace.blame_all_files(node_id)

Resume

# Get resume info for checkpoint recovery
resume_info = trace.get_resume_info()

4. FileSystemState

FileSystemState handles persistence of trace data to the filesystem.

~/.akg/tasks/{task_id}/
├── trace.json              # Trace tree structure
├── .traceconfig            # Trace configuration
└── nodes/
    ├── root/
    │   ├── state.json      # Node state
    │   ├── result.json     # Action results
    │   ├── action_history_fact.json
    │   ├── thinking.json   # Plan/decision state
    │   ├── pending_tools.json
    │   ├── code/           # Code snapshots (CoW)
    │   ├── logs/           # Verification logs, etc.
    │   └── system_prompts/ # Per-turn system prompts
    ├── node_001/
    │   ├── state.json
    │   ├── result.json
    │   └── ...
    └── ...

Key Methods

Method	Description
`save_node_state(node_id, state)`	Persist node state to `state.json`
`load_node_state(node_id)`	Load node state from `state.json`
`append_action(node_id, action)`	Append an action record to the node's history
`save_code_file(node_id, filename, content)`	Save a code file snapshot (CoW)
`load_code_file(node_id, filename)`	Load a code file from a node's snapshot
`diff_nodes(node_a, node_b)`	Generate a diff between two nodes' code
`copy_node_state(from_node, to_node)`	Copy state between nodes
`save_system_prompt(node_id, turn, prompt)`	Save per-turn system prompt

5. ActionCompressor

ActionCompressor compresses action history to fit within LLM context windows. It summarizes older actions while preserving recent ones in full detail.

from akg_agents.core_v2.filesystem import ActionCompressor

compressor = ActionCompressor(llm_client)
compressed = await compressor.compress_history(action_history, max_tokens=4000)

Note: ActionCompressor requires an LLMClient instance for summarization.

6. Visualization

Text Visualization

from akg_agents.core_v2.filesystem.trace_visualizer import visualize_text

text = visualize_text(trace, focus_node="node_005", depth=4)
print(text)

Rich Terminal Visualization

from akg_agents.core_v2.filesystem.trace_visualizer import visualize_rich

rich_text = visualize_rich(trace, focus_node="node_005", depth=4)

Node Detail

from akg_agents.core_v2.filesystem.trace_visualizer import format_node_detail_rich

detail = format_node_detail_rich(trace, "node_003")

7. CLI: /trace Command

In akg_cli interactive mode, the /trace slash command (alias: /t) provides trace tree inspection and forking capabilities.

Usage

/trace [<id>|root|show <id>|node <id>|history|fork <id>] [-n depth]

Subcommands

Subcommand	Example	Description
(none)	`/trace`	Show path view centered on current node (default depth: 4)
`root`	`/trace root`	Show tree from root node downward
`<id>`	`/trace 005`	Show path view centered on the specified node (`005` is auto-expanded to `node_005`)
`-n <N>`	`/trace -n 8`	Set display depth to N
`show <id>`	`/trace show 003`	Show detailed info for a node (action, arguments, result)
`node <id>`	`/trace node 003`	Same as `show`
`history`	`/trace history`	Show full action history of the current node
`fork <id>`	`/trace fork 005`	Fork at an `ask_user` node to provide a different answer

Fork Behavior

/trace fork <id> only works on ask_user type nodes. When executed:

Copies the original ask_user node's action (the question), clears the user response
Creates a new child node under the same parent
Updates the agent's current node and action history
Re-displays the original question and waits for a new answer

Examples

# View the trace tree
/trace

# View from root with depth 8
/trace root -n 8

# Inspect node 003
/trace show 003

# Fork at node 005 to try a different answer
/trace fork 005

8. Exceptions

Exception	Description
`FileSystemStateError`	Base exception for filesystem state errors
`NodeNotFoundError`	Node does not exist in the trace
`TraceSystemError`	General trace system error
`InvalidNodeStateError`	Node state is invalid
`TraceNotInitializedError`	Trace has not been initialized
`TraceAlreadyExistsError`	Trace already exists (when force=False)