Frequently Asked Questions (FAQ)

This document provides common installation and deployment issues along with available solutions.

1. The install-cni container of the calico component keeps restarting, causing the calico-node to remain in the init phase (1/3) and in a CrashLoopbackOff state

  • Root Cause

    Execute ip route on the node to view routing information. This issue is usually caused by one of the following two situations.

    1. The node is missing a default route.
    2. Multiple default routes exist on the node, and the route assigned for cluster deployment is not the highest priority route (the smaller the metric value, the higher the priority).
  • Solution

    For root cause 1, add a default route by executing the following command.

    ip route add default via 192.168.100.1 dev eth0 proto static metric 100
    

    For root cause 2, delete the redundant default route by executing the following command.

    ip route del default via 192.168.90.1 dev eth1
    
  • Extended Explanation

    Calico requires that nodes must have a default route with the highest priority. Essentially, this is to ensure that all traffic inside and outside the cluster can be correctly forwarded at the node layer-3 router.

    Default route configuration is generally located in the /etc/sysconfig/network-scripts directory, which defines and configures persistent configuration files for system network interfaces (network cards). When the system reboots or the network service restarts, the system reads these files to set IP addresses, gateways, DNS, and other information.

2. calico-node frequently restarts, with the container trapped in a "start-probe fails-restart" loop

  • Root Cause

    This issue is usually caused by setting initialDelaySeconds too short. Before the Pod completes initialization, establishes connections, and loads configurations, kubelet already starts probing the health status of calico-node.

  • Solution

    Set an appropriate delay time to tell kubelet to wait for a period of time after the container starts before executing the first probe operation. Execute the following command to configure it.

    # Edit the K8s resource yaml, find the readinessProbe readiness probe section, and add the initialDelaySeconds field
    kubectl edit ds -n kube-system calico-node
    

    Recommended initialDelaySeconds settings are as follows. Set according to actual scenarios in practice.

    • Small clusters (less than 10 nodes), set to 30, sufficient for calico to complete basic initialization.
    • Large clusters (10 nodes or more), set to 60, as more nodes and resource competition require longer startup time.

3. calico-node is in a Running state with 0/1 readiness

  • Detailed Symptoms

    View detailed information as follows.

    • Node status is NotReady: This is the most intuitive manifestation. Using the kubectl get nodes command, you will find that the node's status is not Ready, but NotReady.
    • calico-node Pod probe failure: View the calico-node Pod details (kubectl describe pod ...) or logs, and error messages similar to the following will repeatedly appear.
      • Readiness probe failed: calico/node is not ready: BIRD is not ready
      • BGP not established with X.X.X.X (X.X.X.X is usually the IP of other nodes)
      • Error querying BIRD: unable to connect to BIRDv4 socket
    • Cross-node Pod communication interruption: Because BGP sessions fail to establish, routing information cannot be synchronized. The most direct consequence is that Pods on the current node cannot communicate normally with Pods on other nodes in the cluster.
  • Root Cause

    The IP_AUTODETECTION_METHOD field of calico-node is set incorrectly, causing node network communication failure.

  • Solution

    Execute the following command to modify the environment variable value.

    # Edit the K8s resource yaml, set the IP_AUTODETECTION_METHOD environment variable
    kubectl edit ds -n kube-system calico-node
    

    Common setting values are as follows.

    • skip-interface=nerdctl*: The default policy set by openFuyao, skip network cards with the nerdctl prefix, and select the IP of the first valid network card.
    • can-reach=192.168.100.5: Directly specify the network card IP for the target address.
    • interface=eth4: Use the IP on the eth4 network card.
  • Extended Explanation

    The IP_AUTODETECTION_METHOD field determines how Calico automatically selects the correct IP address for establishing BGP neighbors and encapsulating traffic on multi-NIC nodes.

4. When installing a business cluster from the bootstrap cluster, the bkeagent on the business cluster node cannot connect to the APIServer of the bootstrap cluster, causing cluster installation to fail

  • Root Cause

    When bkeagent starts, it specifies the --kube-config parameter to configure the APIServer it monitors. After bkeagent starts, it will determine whether the CRD it reconciles is issued in the monitored APIServer, which triggers bkeagent to attempt to connect to the bootstrap cluster's APIServer, resulting in the following error log message.

    # For v25.12 and earlier versions, the log is /var/log/bkeagenbt.log; for later versions, it's /var/log/openFuyao/bkeagent.log
    The CRD cannot be installed in the target cluster, xxx
    

    Check the configuration file (for v25.12 and earlier versions, it's /etc/bkeagent/config; for later versions, it's /etc/openFuyao/bkeagent/config), and find that the IP address corresponding to server is not the given bootstrap cluster address.

  • Solution

    Execute the following commands to reset the bootstrap node and then initialize it with the specified IP address.

    # Reset the bootstrap node
    bke reset --all --mount
    # Initialize with the specified IP address
    bke init --hostIP=1.2.3.4
    

5. When deploying the openFuyao management plane on the bootstrap cluster, coredns is in a CrashLoopbackOff state

  • Detailed Information

    View the detailed logs of coredns (kubectl logs -n kube-system coredns-xxx), and the following logs appear (actual IP address and Port number differ).

    [ERROR] plugin/errors: 2 . NS: read udp 100.20.0.15:59690->100.10.0.10:53: i/o timeout
    [FATAL] plugin/loop: Loop(127.0.0.1:34812 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO ***"
    
  • Root Cause

    The above logs indicate that coredns has entered a loop during domain name resolution.

  • Solution

    Using kubectl get cm -n kube-system coredns -o yaml, you can see that forward is set to /etc/resolv.conf, which causes the coredns server IP to be used as the server for domain name resolution when no other available DNS server exists, ultimately leading to an infinite loop.

    Execute the following command to set the server.

    # Set forward to forward . 8.8.8.8; if there is an available DNS server, it can be set to the corresponding IP address
    kubectl edit cm -n kube-system coredns
    
  • Extended Explanation

    The forward plugin forwards DNS requests that cannot be resolved within the cluster to specified upstream DNS servers.

6. Using a virtual IP to deploy a high-availability cluster fails. After using bke reset to reset nodes, deploying a high-availability cluster again with the already-occupied virtual IP fails

  • Root Cause

    The keepalived component of a high-availability cluster binds the virtual IP to one of the nodes in the high-availability cluster. After using bke reset to reset the environment on each node, the virtual IP is not unbound from the bound node, causing errors when used again, ultimately preventing the cluster from starting.

  • Solution

    Log in to the management node of the high-availability cluster from the first installation and execute the following command to unbind the virtual IP.

    # View the IP addresses bound to the node's network card
    ip addr
    # If the virtual IP binding is found, execute the following command to unbind it; replace vip with the actual virtual IP used, and eth0 with the actual bound network card
    ip addr del <vip> dev <eth0>
    

7. For clusters installed by openFuyao, besides deleting clusters from the management plane, how to delete clusters from the backend

The installer-service of the bootstrap cluster or management cluster reads BKECluster data from the APIServer and then displays cluster information. Clusters can be deleted through the front-end management plane, or by logging into the terminal of the bootstrap cluster or management cluster and executing the following commands.

# Query cluster information
kubectl get bc -A
# Replace bke-cluster with actual cluster information, edit bc information, and set the following two parameters to delete the cluster
# Annotation: bke.bocloud.com/ignore-target-cluster-delete: "false"
# spec: add reset: true
kubectl edit bc -n bke-cluster bke-cluster

8. How to implement scale-up and scale-down operations for openFuyao clusters from the backend

The openFuyao management plane provides cluster lifecycle management capabilities, including cluster scale-up, scale-down, upgrade, installation, and uninstallation. This section provides backend cluster scale-up and scale-down processing, which needs to be performed when the cluster is healthy. When the cluster is in an unhealthy state, scale-up and scale-down operations may result in errors.

  • Scale-down operation: Remove a node from an existing cluster.

    View existing BKENode resources.

    # Replace bke-cluster with actual cluster information
    kubectl get bn -n bke-cluster
    

    Delete the corresponding BKENode resource.

    # Replace bke-cluster-n1 with the actual node name
    kubectl delete bn -n bke-cluster bke-cluster-n1
    

    Execute the command to view existing BKENode resources again. If the corresponding node is not found, the deletion was successful.

  • Scale-up operation: Add a new node to an existing cluster.

    Create a configuration file for the new node (newNode.yaml).

    apiVersion: bke.bocloud.com/v1beta1
    kind: BKENode
    metadata:
      name: bke-cluster-n1
      namespace: bke-cluster
      labels:
        cluster.x-k8s.io/cluster-name: bke-cluster
    spec:
      hostname: n1
      ip: <node-ip>
      password: '<encrypted>'
      port: "22"     
      role:
      - node
      username: root
    

    Execute the following command to perform the scale-up operation.

    kubectl apply -f newNode.yaml
    

    Execute the command to view existing BKENode resources. If the corresponding node is Ready, the scale-up was successful.

Note: For v25.12 and earlier versions, refer to the following guide.

  • Scale-down operation: Remove a node from an existing cluster.

    Edit the BKECluster resource.

    # Replace bke-cluster with actual cluster information
    kubectl edit bc -n bke-cluster bke-cluster
    

    Set nodes scheduled for deletion.

    metadata:
      annotations:
        # Node scheduled deletion: node deletion is a dangerous action, so this annotation is added for secondary confirmation, none by default
        # When deleting a node, in addition to removing the node from spec, the IP of the node to be deleted needs to be filled in, with multiple IPs separated by ','
        # Missing either of the two operations will not trigger node deletion
        bke.bocloud.com/appointment-deleted-nodes: "172.100.200.10"
    

    Remove node information from Spec.

    spec:
      clusterConfig:
        nodes:                
        - hostname: master-1  
          ip: 172.100.200.10
          username: root      
          password: password0
          port: "22"
          role:             
          - master
          - etcd
    
  • Scale-up operation: Add a new node to an existing cluster.

    Edit the BKECluster resource.

    # Replace bke-cluster with actual cluster information
    kubectl edit bc -n bke-cluster bke-cluster
    

    Add node information to Spec.

    spec:
      clusterConfig:
        nodes:                
        - hostname: master-1  
          ip: 172.100.200.10
          username: root      
          password: password0
          port: "22"
          role:             
          - master
          - etcd