11a0e446创建于 1月12日历史提交
README.md

AgentToLeaP

Evaluation Framework for LLM Agents

A comprehensive and extensible evaluation framework designed to measure the performance
of LLM agents across a wide range of complex, long-horizon benchmarks.


✨ Features

  • 📊 Multi-Benchmark Support - Out-of-the-box support for GAIA, HLE, BrowseComp, Frames, WebWalkerQA, and more.
  • 🚀 Parallel Execution - High-performance multi-process engine for concurrent task evaluation.
  • 📈 Automated Reporting - Detailed success/fail analysis with reasoning trajectory and automated scoring.
  • 🛠️ MCP Integration - Seamlessly connects with AgentDock for secure tool use and environment interaction.
  • 🧩 Extensible Design - Easily add new datasets and custom evaluation logic with minimal configuration.

🏗️ Architecture

AgentToLeaP/
├── benchmarks/                 # Benchmark-specific configurations and scripts
│   ├── gaia/                  
│   ├── hle/                    
│   └── ...                     # Other 8+ integrated benchmarks
├── context/                    # Agent context management
├── run_evaluation.py           # Main entry point for parallel execution
├── evaluate_and_report.py      # Core scoring and report generation logic
├── gaia_report_generator.py    # Specialized report generator for GAIA
├── data_test_copy.py           # Evaluation task implementation
└── browser_processor.py        # Web content purification and processing

🚀 Quick Start

The easiest way to run evaluations is using our pre-built Docker image which contains all necessary dependencies:

docker pull yuyangfu/agenttoleap-eval
docker run -dit --name agenttoleap --gpus all --network host -v $(pwd):/workspace yuyangfu/agenttoleap-eval
docker exec -it agenttoleap /bin/bash
cd /workspace

2. Run a Benchmark

Navigate to a benchmark directory and execute the provided run.sh script:

cd AgentToLeaP/benchmarks/gaia
# Edit run.sh to configure your API_KEY and MODEL_NAME
bash run.sh

⚙️ Configuration

Evaluations are primarily configured via environment variables in the run.sh scripts.

1. Primary Model Configuration

Variable Example Description
MODEL_NAME "Qwen3-4B" Name of the model under evaluation (API model field)
BASE_URL "https://api.openai.com/v1" Primary model API base URL
API_KEY "sk-..." Primary model API key
RESULT_DIR_NAME "Qwen3-4B-test-0109" Result identifier used to generate output directory name

2. Auxiliary Model Configuration

Variable Example Description
PROCESSOR_MODEL_NAME "Qwen3-14B" Auxiliary model for summarization and long-context processing
PROCESSOR_BASE_URL "..." Auxiliary model API base URL

3. Evaluation Environment

Variable Example Description
MANAGER_URL "http://localhost:8000/mcpapi" Address of the AgentDock (MCP Manager) service
EVALUATION_ROOT_DIR "/path/to/outputs" Root directory for evaluation outputs
FILES_DIR "/path/to/files" Directory for benchmark attachments/files

4. Control & Sampling Parameters

Variable Default Description
NUM_PROCESSES 10 Number of concurrent evaluation workers
MAX_INTERACTIONS 50 Maximum interaction turns per task
USE_LLM_JUDGE "true" Whether to use an LLM as the judge (recommended)
PASS_K 8 Pass@k sampling runs
TEMPERATURE 1.0 Sampling temperature
TOP_P 1.0 Top-p sampling
MAX_TOKENS 16384 Maximum tokens per generation

📊 Results & Reports

After evaluation, results are saved in the directory specified by EVALUATION_ROOT_DIR.

Directory Structure

evaluation_outputs/ (EVALUATION_ROOT_DIR)
├── _temp_raw_outputs/                      # [All tasks] Raw evaluation logs
│   └── gaia_Qwen3-4B-test-0109/            # Named as ${BENCHMARK}_${RESULT_DIR_NAME}
│       ├── task_id_1/
│       │   ├── dialog.json                 # Model dialogue trajectory
│       │   ├── result.json                 # Complete task result
│       │   └── trace.json                  # Execution trace
│
├── [Benchmark Specific Reports]            # Success/fail analysis and MD reports
│   └── Qwen3-4B-test-0109/
│       ├── success/                        # Details of correctly answered tasks
│       ├── fail/                           # Details of incorrectly answered tasks
│       └── *_main_report.md                # Human-readable summary report
  • dialog.json: Full interaction trace including thoughts and tool calls.
  • result.json: Final output and scoring result for each task.
  • *_report.md: Detailed success/fail analysis with reasoning trajectory.

➕ Adding a Custom Benchmark

This framework is designed to be easily extensible. To add a new evaluation dataset:

  1. Create a directory: Create a new folder under benchmarks/, for example my_custom_bench.

  2. Prepare the data: Inside this folder, create a .jsonl file with the same name (e.g., my_custom_bench.jsonl).

  3. Data format: Each line must be a JSON object containing the following fields:

    Field Name Type Description
    task_id String / Int Unique identifier for the task
    Question String The complete question or instruction sent to the model
    Final answer String / Num The reference answer used for automated evaluation

    Example (my_custom_bench.jsonl):

    {"task_id": 1, "Question": "What is 1 + 1?", "Final answer": "2"}
    
  4. Configure the script: Copy the run.sh file from any existing benchmark (like gaia) into the new directory. Adjust the environment variables to point to your new data, and you're ready to run.

📄 License

This module is part of the AgentCPM-Explore project and is released under the Apache-2.0 license.