README.md

Quick reference

SGLang | openEuler

SGLang is a fast serving framework for large language models (LLMs) and vision language models. It provides high-performance inference capabilities through co-designed backend runtime and frontend language. Learn more at https://github.com/sgl-project/sglang.

Supported tags and respective Dockerfile links

The tag of each SGLang docker image is consist of the version of SGLang and the version of basic image. The details are as follows:

Tags Currently Architectures
0.5.12-oe2403sp3 sglang 0.5.12 on openEuler 24.03-LTS-SP3 amd64, arm64
0.5.11-24.03-lts-sp3 sglang 0.5.11 on openEuler 24.03-lts-sp3 amd64, arm64

Usage

In this usage, users can select the corresponding {Tag} based on their requirements. Build artifacts are placed under /opt/sglang inside the image.

Pull the image (example):

docker pull my-registry/sglang:0.5.11

Check SGLang installation:

docker run --rm my-registry/sglang:0.5.11 python3 -c "import sglang; print(sglang.__version__)"

View available parameters:

docker run --rm my-registry/sglang:0.5.11 python3 -m sglang.launch_server --help

Starting SGLang Server

Start the SGLang inference server (example):

docker run --gpus all \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  my-registry/sglang:0.5.11 \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

Deploy SGLang as a Kubernetes Deployment with GPU support. Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sglang-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sglang-server
  template:
    metadata:
      labels:
        app: sglang-server
    spec:
      containers:
        - name: sglang
          image: my-registry/sglang:0.5.11
          ports:
            - containerPort: 30000
          args:
            - python3
            - -m
            - sglang.launch_server
            - --model-path
            - meta-llama/Llama-3.1-8B-Instruct
            - --host
            - "0.0.0.0"
            - --port
            - "30000"
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: huggingface-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: huggingface-cache
          persistentVolumeClaim:
            claimName: huggingface-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: sglang-service
spec:
  type: LoadBalancer
  ports:
    - port: 30000
      targetPort: 30000
  selector:
    app: sglang-server

Using with OpenAI-Compatible API

After starting the server, you can interact with SGLang using the OpenAI-compatible API:

# Chat completions API
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 256
  }'

# Completions API
curl http://localhost:30000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 32
  }'

Question and answering

If you have any questions or want to use special features, please submit an issue or a pull request on https://atomgit.com/openeuler/openeuler-docker-images.