HHytninitial commit

11a0e446创建于 1月12日历史提交

文件	最后提交记录	最后更新时间
benchmarks	initial commit	4 个月前
context	initial commit	4 个月前
utils	initial commit	4 个月前
README.md	initial commit	4 个月前
browser_processor.py	initial commit	4 个月前
context_processor.py	initial commit	4 个月前
data_test_copy.py	initial commit	4 个月前
evaluate_and_report.py	initial commit	4 个月前
gaia_report_generator.py	initial commit	4 个月前
prompt.py	initial commit	4 个月前
run_evaluation.py	initial commit	4 个月前

AgentToLeaP

Evaluation Framework for LLM Agents

A comprehensive and extensible evaluation framework designed to measure the performance
of LLM agents across a wide range of complex, long-horizon benchmarks.

✨ Features

📊 Multi-Benchmark Support - Out-of-the-box support for GAIA, HLE, BrowseComp, Frames, WebWalkerQA, and more.
🚀 Parallel Execution - High-performance multi-process engine for concurrent task evaluation.
📈 Automated Reporting - Detailed success/fail analysis with reasoning trajectory and automated scoring.
🛠️ MCP Integration - Seamlessly connects with AgentDock for secure tool use and environment interaction.
🧩 Extensible Design - Easily add new datasets and custom evaluation logic with minimal configuration.

🏗️ Architecture

AgentToLeaP/
├── benchmarks/                 # Benchmark-specific configurations and scripts
│   ├── gaia/                  
│   ├── hle/                    
│   └── ...                     # Other 8+ integrated benchmarks
├── context/                    # Agent context management
├── run_evaluation.py           # Main entry point for parallel execution
├── evaluate_and_report.py      # Core scoring and report generation logic
├── gaia_report_generator.py    # Specialized report generator for GAIA
├── data_test_copy.py           # Evaluation task implementation
└── browser_processor.py        # Web content purification and processing

🚀 Quick Start

1. Environment Setup (Recommended)

The easiest way to run evaluations is using our pre-built Docker image which contains all necessary dependencies:

docker pull yuyangfu/agenttoleap-eval
docker run -dit --name agenttoleap --gpus all --network host -v $(pwd):/workspace yuyangfu/agenttoleap-eval
docker exec -it agenttoleap /bin/bash
cd /workspace

2. Run a Benchmark

Navigate to a benchmark directory and execute the provided run.sh script:

cd AgentToLeaP/benchmarks/gaia
# Edit run.sh to configure your API_KEY and MODEL_NAME
bash run.sh

⚙️ Configuration

Evaluations are primarily configured via environment variables in the run.sh scripts.

1. Primary Model Configuration

Variable	Example	Description
`MODEL_NAME`	`"Qwen3-4B"`	Name of the model under evaluation (API `model` field)
`BASE_URL`	`"https://api.openai.com/v1"`	Primary model API base URL
`API_KEY`	`"sk-..."`	Primary model API key
`RESULT_DIR_NAME`	`"Qwen3-4B-test-0109"`	Result identifier used to generate output directory name

2. Auxiliary Model Configuration

Variable	Example	Description
`PROCESSOR_MODEL_NAME`	`"Qwen3-14B"`	Auxiliary model for summarization and long-context processing
`PROCESSOR_BASE_URL`	`"..."`	Auxiliary model API base URL

3. Evaluation Environment

Variable	Example	Description
`MANAGER_URL`	`"http://localhost:8000/mcpapi"`	Address of the AgentDock (MCP Manager) service
`EVALUATION_ROOT_DIR`	`"/path/to/outputs"`	Root directory for evaluation outputs
`FILES_DIR`	`"/path/to/files"`	Directory for benchmark attachments/files

4. Control & Sampling Parameters

Variable	Default	Description
`NUM_PROCESSES`	`10`	Number of concurrent evaluation workers
`MAX_INTERACTIONS`	`50`	Maximum interaction turns per task
`USE_LLM_JUDGE`	`"true"`	Whether to use an LLM as the judge (recommended)
`PASS_K`	`8`	Pass@k sampling runs
`TEMPERATURE`	`1.0`	Sampling temperature
`TOP_P`	`1.0`	Top-p sampling
`MAX_TOKENS`	`16384`	Maximum tokens per generation

📊 Results & Reports

After evaluation, results are saved in the directory specified by EVALUATION_ROOT_DIR.

Directory Structure

evaluation_outputs/ (EVALUATION_ROOT_DIR)
├── _temp_raw_outputs/                      # [All tasks] Raw evaluation logs
│   └── gaia_Qwen3-4B-test-0109/            # Named as ${BENCHMARK}_${RESULT_DIR_NAME}
│       ├── task_id_1/
│       │   ├── dialog.json                 # Model dialogue trajectory
│       │   ├── result.json                 # Complete task result
│       │   └── trace.json                  # Execution trace
│
├── [Benchmark Specific Reports]            # Success/fail analysis and MD reports
│   └── Qwen3-4B-test-0109/
│       ├── success/                        # Details of correctly answered tasks
│       ├── fail/                           # Details of incorrectly answered tasks
│       └── *_main_report.md                # Human-readable summary report

dialog.json: Full interaction trace including thoughts and tool calls.
result.json: Final output and scoring result for each task.
*_report.md: Detailed success/fail analysis with reasoning trajectory.

➕ Adding a Custom Benchmark

This framework is designed to be easily extensible. To add a new evaluation dataset:

Create a directory: Create a new folder under benchmarks/, for example my_custom_bench.
Prepare the data: Inside this folder, create a .jsonl file with the same name (e.g., my_custom_bench.jsonl).

Data format: Each line must be a JSON object containing the following fields:

Field Name	Type	Description
`task_id`	String / Int	Unique identifier for the task
`Question`	String	The complete question or instruction sent to the model
`Final answer`	String / Num	The reference answer used for automated evaluation

Example (my_custom_bench.jsonl):

{"task_id": 1, "Question": "What is 1 + 1?", "Final answer": "2"}

Configure the script: Copy the run.sh file from any existing benchmark (like gaia) into the new directory. Adjust the environment variables to point to your new data, and you're ready to run.

📄 License

This module is part of the AgentCPM-Explore project and is released under the Apache-2.0 license.