eagle-eye

Project Introduction

Eagle Eye is an observability system for AI inference scenarios, implementing full-link metric collection, near real-time transmission, and intelligent diagnosis from AI gateway, inference engine, Mooncake to infrastructure (Ray, K8s, hardware). This system integrates Prometheus's periodic metric collection with the low-latency push mechanism of distributed message queue systems, supporting both trend analysis for scaling decisions and meeting the second-level data update requirements of modules with high time-sensitivity requirements (such as intelligent routing). Through an independent hardware health diagnosis module, it achieves continuous monitoring and anomaly identification of underlying metrics such as NPU/GPU, temperature, power consumption, error codes, etc., building a closed-loop monitoring capability of "collection-transmission-diagnosis-evaluation", providing solid data support for the stability, performance optimization, and resource scheduling of AI inference systems.

Core Functions

  • Multi-layer metric coverage: Covers AI gateway (such as performance, resource consumption, security and compliance audit, governance policy execution tracking), inference engine (API Server, model input/output, inference process, inference engine status), Mooncake (Mooncake master, transfer engine, Mooncake client) and infrastructure (Ray, K8S, hardware), achieving full-link observability.

  • Near real-time metric transmission: For modules with high time-sensitivity requirements, achieves second-level metric push through distributed message queue systems, ensuring metrics can be timely perceived and affect decisions.

  • Scaling decision support: Synchronously reports collected system and runtime metrics to Prometheus for periodic calculation and trend evaluation.

  • Hardware health check and diagnosis: Builds an independent hardware health diagnosis module that periodically collects underlying metrics such as NPU/GPU temperature, power consumption, error codes, etc., and reports them in real-time through distributed message queue systems. The diagnosis module subscribes to and analyzes collected data, combines device model, driver and firmware information, identifies typical failure patterns based on threshold rules and abnormal metric analysis, and outputs diagnosis conclusions and handling recommendations, achieving a closed loop from data collection to health evaluation.

Local Build

Image Build

Build Commands

  • Build and push to specified OCI repository.

    Using docker
    docker buildx build . -f <path/to/dockerfile> \
        -o type=image,name=<oci/repository>:<tag>,oci-mediatypes=true,rewrite-timestamp=true,push=true \
        --platform=linux/amd64,linux/arm64 \
        --provenance=false \
    
    Using nerdctl
    nerdctl build . -f <path/to/dockerfile> \
        -o type=image,name=<oci/repository>:<tag>,oci-mediatypes=true,rewrite-timestamp=true,push=true \
        --platform=linux/amd64,linux/arm64 \
        --provenance=false \
    

    Where <path/to/dockerfile> is the Dockerfile path, <oci/repository> is the image address, and <tag> is the image tag.

  • Build and export OCI Layout to local tarball.

    Using docker
    docker buildx build . -f <path/to/dockerfile> \
        -o type=oci,name=<oci/repository>:<tag>,dest=<path/to/oci-layout.tar>,rewrite-timestamp=true \
        --platform=linux/amd64,linux/arm64 \
        --provenance=false \
    
    Using nerdctl
    nerdctl build . -f <path/to/dockerfile> \
        -o type=oci,name=<oci/repository>:<tag>,dest=<path/to/oci-layout.tar>,rewrite-timestamp=true \
        --platform=linux/amd64,linux/arm64 \
        --provenance=false \
    

    Where <path/to/dockerfile> is the Dockerfile path, <oci/repository> is the image address, <tag> is the image tag, and path/to/oci-layout.tar is the tar package path.

  • Build and export image rootfs to local directory.

    Using docker
    docker buildx build . -f <path/to/dockerfile> \
        -o type=local,dest=<path/to/output>,platform-split=true \
        --platform=linux/amd64,linux/arm64 \
        --provenance=false \
    
    Using nerdctl
    nerdctl build . -f <path/to/dockerfile> \
        -o type=local,dest=<path/to/output>,platform-split=true \
        --platform=linux/amd64,linux/arm64 \
        --provenance=false \
    

    Where <path/to/dockerfile> is the Dockerfile path and path/to/output is the local directory path.

Helm Chart Build

  • Package Helm Chart.

    helm package <path/to/chart> -u \
        --version=0.0.0-latest \
        --app-version=openFuyao-v26.03
    

    Where <path/to/chart> is the Chart folder path.

  • Push Chart package to specified OCI repository.

    helm push <path/to/chart.tgz> oci://<oci/repository>:<tag>
    

    Where <path/to/chart.tgz> is the Chart package path, <oci/repository> is the Chart package push address, and <tag> is the Chart package tag.