Frequently Asked Questions (FAQ)

This document provides common installation and deployment issues along with available solutions.

1. The install-cni container of the calico component keeps restarting, causing the calico-node to remain in the init phase (1/3) and in a CrashLoopbackOff state

Root Cause

Execute ip route on the node to view routing information. This issue is usually caused by one of the following two situations.
1. The node is missing a default route.
2. Multiple default routes exist on the node, and the route assigned for cluster deployment is not the highest priority route (the smaller the metric value, the higher the priority).
Solution

For root cause 1, add a default route by executing the following command.
```
ip route add default via 192.168.100.1 dev eth0 proto static metric 100
```
For root cause 2, delete the redundant default route by executing the following command.
```
ip route del default via 192.168.90.1 dev eth1
```
Extended Explanation

Calico requires that nodes must have a default route with the highest priority. Essentially, this is to ensure that all traffic inside and outside the cluster can be correctly forwarded at the node layer-3 router.

Default route configuration is generally located in the /etc/sysconfig/network-scripts directory, which defines and configures persistent configuration files for system network interfaces (network cards). When the system reboots or the network service restarts, the system reads these files to set IP addresses, gateways, DNS, and other information.

2. calico-node frequently restarts, with the container trapped in a "start-probe fails-restart" loop

Root Cause

This issue is usually caused by setting initialDelaySeconds too short. Before the Pod completes initialization, establishes connections, and loads configurations, kubelet already starts probing the health status of calico-node.
Solution

Set an appropriate delay time to tell kubelet to wait for a period of time after the container starts before executing the first probe operation. Execute the following command to configure it.
```
# Edit the K8s resource yaml, find the readinessProbe readiness probe section, and add the initialDelaySeconds field
kubectl edit ds -n kube-system calico-node
```
Recommended initialDelaySeconds settings are as follows. Set according to actual scenarios in practice.
- Small clusters (less than 10 nodes), set to 30, sufficient for calico to complete basic initialization.
- Large clusters (10 nodes or more), set to 60, as more nodes and resource competition require longer startup time.

3. calico-node is in a Running state with 0/1 readiness

Detailed Symptoms

View detailed information as follows.
- Node status is NotReady: This is the most intuitive manifestation. Using the kubectl get nodes command, you will find that the node's status is not Ready, but NotReady.
- calico-node Pod probe failure: View the calico-node Pod details (kubectl describe pod ...) or logs, and error messages similar to the following will repeatedly appear.
  - Readiness probe failed: calico/node is not ready: BIRD is not ready
  - BGP not established with X.X.X.X (X.X.X.X is usually the IP of other nodes)
  - Error querying BIRD: unable to connect to BIRDv4 socket
- Cross-node Pod communication interruption: Because BGP sessions fail to establish, routing information cannot be synchronized. The most direct consequence is that Pods on the current node cannot communicate normally with Pods on other nodes in the cluster.
Root Cause

The IP_AUTODETECTION_METHOD field of calico-node is set incorrectly, causing node network communication failure.
Solution

Execute the following command to modify the environment variable value.
```
# Edit the K8s resource yaml, set the IP_AUTODETECTION_METHOD environment variable
kubectl edit ds -n kube-system calico-node
```
Common setting values are as follows.
- skip-interface=nerdctl*: The default policy set by openFuyao, skip network cards with the nerdctl prefix, and select the IP of the first valid network card.
- can-reach=192.168.100.5: Directly specify the network card IP for the target address.
- interface=eth4: Use the IP on the eth4 network card.
Extended Explanation

The IP_AUTODETECTION_METHOD field determines how Calico automatically selects the correct IP address for establishing BGP neighbors and encapsulating traffic on multi-NIC nodes.

4. When installing a business cluster from the bootstrap cluster, the bkeagent on the business cluster node cannot connect to the APIServer of the bootstrap cluster, causing cluster installation to fail

Root Cause

When bkeagent starts, it specifies the --kube-config parameter to configure the APIServer it monitors. After bkeagent starts, it will determine whether the CRD it reconciles is issued in the monitored APIServer, which triggers bkeagent to attempt to connect to the bootstrap cluster's APIServer, resulting in the following error log message.
```
# For v25.12 and earlier versions, the log is /var/log/bkeagenbt.log; for later versions, it's /var/log/openFuyao/bkeagent.log
The CRD cannot be installed in the target cluster, xxx
```
Check the configuration file (for v25.12 and earlier versions, it's /etc/bkeagent/config; for later versions, it's /etc/openFuyao/bkeagent/config), and find that the IP address corresponding to server is not the given bootstrap cluster address.

Solution

Execute the following commands to reset the bootstrap node and then initialize it with the specified IP address.

# Reset the bootstrap node
bke reset --all --mount
# Initialize with the specified IP address
bke init --hostIP=1.2.3.4

5. When deploying the openFuyao management plane on the bootstrap cluster, coredns is in a CrashLoopbackOff state

Detailed Information

View the detailed logs of coredns (kubectl logs -n kube-system coredns-xxx), and the following logs appear (actual IP address and Port number differ).

[ERROR] plugin/errors: 2 . NS: read udp 100.20.0.15:59690->100.10.0.10:53: i/o timeout
[FATAL] plugin/loop: Loop(127.0.0.1:34812 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO ***"

Root Cause

The above logs indicate that coredns has entered a loop during domain name resolution.
Solution

Using kubectl get cm -n kube-system coredns -o yaml, you can see that forward is set to /etc/resolv.conf, which causes the coredns server IP to be used as the server for domain name resolution when no other available DNS server exists, ultimately leading to an infinite loop.

Execute the following command to set the server.
```
# Set forward to forward . 8.8.8.8; if there is an available DNS server, it can be set to the corresponding IP address
kubectl edit cm -n kube-system coredns
```
Extended Explanation

The forward plugin forwards DNS requests that cannot be resolved within the cluster to specified upstream DNS servers.

6. Using a virtual IP to deploy a high-availability cluster fails. After using bke reset to reset nodes, deploying a high-availability cluster again with the already-occupied virtual IP fails

Root Cause

The keepalived component of a high-availability cluster binds the virtual IP to one of the nodes in the high-availability cluster. After using bke reset to reset the environment on each node, the virtual IP is not unbound from the bound node, causing errors when used again, ultimately preventing the cluster from starting.

Solution

Log in to the management node of the high-availability cluster from the first installation and execute the following command to unbind the virtual IP.

# View the IP addresses bound to the node's network card
ip addr
# If the virtual IP binding is found, execute the following command to unbind it; replace vip with the actual virtual IP used, and eth0 with the actual bound network card
ip addr del <vip> dev <eth0>

7. For clusters installed by openFuyao, besides deleting clusters from the management plane, how to delete clusters from the backend

The installer-service of the bootstrap cluster or management cluster reads BKECluster data from the APIServer and then displays cluster information. Clusters can be deleted through the front-end management plane, or by logging into the terminal of the bootstrap cluster or management cluster and executing the following commands.

# Query cluster information
kubectl get bc -A
# Replace bke-cluster with actual cluster information, edit bc information, and set the following two parameters to delete the cluster
# Annotation: bke.bocloud.com/ignore-target-cluster-delete: "false"
# spec: add reset: true
kubectl edit bc -n bke-cluster bke-cluster

8. How to implement scale-up and scale-down operations for openFuyao clusters from the backend

The openFuyao management plane provides cluster lifecycle management capabilities, including cluster scale-up, scale-down, upgrade, installation, and uninstallation. This section provides backend cluster scale-up and scale-down processing, which needs to be performed when the cluster is healthy. When the cluster is in an unhealthy state, scale-up and scale-down operations may result in errors.

Scale-down operation: Remove a node from an existing cluster.

View existing BKENode resources.
```
# Replace bke-cluster with actual cluster information
kubectl get bn -n bke-cluster
```
Delete the corresponding BKENode resource.
```
# Replace bke-cluster-n1 with the actual node name
kubectl delete bn -n bke-cluster bke-cluster-n1
```
Execute the command to view existing BKENode resources again. If the corresponding node is not found, the deletion was successful.

Scale-up operation: Add a new node to an existing cluster.

Create a configuration file for the new node (newNode.yaml).

apiVersion: bke.bocloud.com/v1beta1
kind: BKENode
metadata:
  name: bke-cluster-n1
  namespace: bke-cluster
  labels:
    cluster.x-k8s.io/cluster-name: bke-cluster
spec:
  hostname: n1
  ip: <node-ip>
  password: '<encrypted>'
  port: "22"     
  role:
  - node
  username: root

Execute the following command to perform the scale-up operation.

kubectl apply -f newNode.yaml

Execute the command to view existing BKENode resources. If the corresponding node is Ready, the scale-up was successful.

Note: For v25.12 and earlier versions, refer to the following guide.

Scale-down operation: Remove a node from an existing cluster.

Edit the BKECluster resource.

# Replace bke-cluster with actual cluster information
kubectl edit bc -n bke-cluster bke-cluster

Set nodes scheduled for deletion.

metadata:
  annotations:
    # Node scheduled deletion: node deletion is a dangerous action, so this annotation is added for secondary confirmation, none by default
    # When deleting a node, in addition to removing the node from spec, the IP of the node to be deleted needs to be filled in, with multiple IPs separated by ','
    # Missing either of the two operations will not trigger node deletion
    bke.bocloud.com/appointment-deleted-nodes: "172.100.200.10"

Remove node information from Spec.

spec:
  clusterConfig:
    nodes:                
    - hostname: master-1  
      ip: 172.100.200.10
      username: root      
      password: password0
      port: "22"
      role:             
      - master
      - etcd

Scale-up operation: Add a new node to an existing cluster.

Edit the BKECluster resource.

# Replace bke-cluster with actual cluster information
kubectl edit bc -n bke-cluster bke-cluster

Add node information to Spec.

spec:
  clusterConfig:
    nodes:                
    - hostname: master-1  
      ip: 172.100.200.10
      username: root      
      password: password0
      port: "22"
      role:             
      - master
      - etcd