Local Model Deployment Guide

This document describes how to integrate locally deployed LLMs with AKG Agents.

Configuring Local Models

If you already have your own deployed model service, simply configure the local service base_url in ~/.akg/settings.json:

{
  "models": {
    "standard": {
      "base_url": "http://localhost:8000/v1",
      "api_key": "EMPTY",
      "model_name": "your-model-name"
    }
  },
  "default_model": "standard"
}

Configuration Details:

  • base_url: API address of your local model service (typically ends with /v1, compatible with OpenAI API format)
  • api_key: If your local service doesn't require authentication, use "EMPTY" or any string
  • model_name: The name of your deployed model

Local Model Deployment Frameworks

If you haven't deployed a local model service yet, consider these open-source frameworks:

Framework Official Link
vLLM https://github.com/vllm-project/vllm
vLLM-Ascend https://github.com/vllm-project/vllm-ascend
vLLM-MindSpore https://atomgit.com/mindspore/vllm-mindspore
Ollama https://ollama.com

Model Selection Recommendations

For Kernel code generation tasks, which demand strong code understanding and generation capabilities, we recommend the following types of models:

Due to the high difficulty of kernel code generation, we recommend using powerful models such as DeepSeek V3.2, or kernel-specific models designed for such tasks.

Configuration Example

{
  "models": {
    "standard": {
      "base_url": "http://localhost:8000/v1",
      "api_key": "EMPTY",
      "model_name": "deepseek-v3"
    }
  },
  "default_model": "standard"
}

Verifying Configuration

After configuration, verify the model connection by launching akg_cli op:

# Ascend NPU
akg_cli op --framework torch --backend ascend --arch ascend910b2 \
  --dsl triton_ascend --devices 0,1,2,3,4,5,6,7

# CUDA GPU
akg_cli op --framework torch --backend cuda --arch a100 \
  --dsl triton_cuda --devices 0,1,2,3,4,5,6,7

# CPU
akg_cli op --framework torch --backend cpu --arch x86_64 \
  --dsl cpp --devices 0

If the model is configured correctly, the system will start normally and respond to your requests.