Frequently Asked Questions (FAQ)
This document provides common installation and deployment issues along with available solutions.
1. The install-cni container of the calico component keeps restarting, causing the calico-node to remain in the init phase (1/3) and in a CrashLoopbackOff state
-
Root Cause
Execute
ip routeon the node to view routing information. This issue is usually caused by one of the following two situations.- The node is missing a default route.
- Multiple default routes exist on the node, and the route assigned for cluster deployment is not the highest priority route (the smaller the metric value, the higher the priority).
-
Solution
For root cause 1, add a default route by executing the following command.
ip route add default via 192.168.100.1 dev eth0 proto static metric 100For root cause 2, delete the redundant default route by executing the following command.
ip route del default via 192.168.90.1 dev eth1 -
Extended Explanation
Calico requires that nodes must have a default route with the highest priority. Essentially, this is to ensure that all traffic inside and outside the cluster can be correctly forwarded at the node layer-3 router.
Default route configuration is generally located in the
/etc/sysconfig/network-scriptsdirectory, which defines and configures persistent configuration files for system network interfaces (network cards). When the system reboots or the network service restarts, the system reads these files to set IP addresses, gateways, DNS, and other information.
2. calico-node frequently restarts, with the container trapped in a "start-probe fails-restart" loop
-
Root Cause
This issue is usually caused by setting
initialDelaySecondstoo short. Before the Pod completes initialization, establishes connections, and loads configurations,kubeletalready starts probing the health status ofcalico-node. -
Solution
Set an appropriate delay time to tell
kubeletto wait for a period of time after the container starts before executing the first probe operation. Execute the following command to configure it.# Edit the K8s resource yaml, find the readinessProbe readiness probe section, and add the initialDelaySeconds field kubectl edit ds -n kube-system calico-nodeRecommended
initialDelaySecondssettings are as follows. Set according to actual scenarios in practice.- Small clusters (less than 10 nodes), set to 30, sufficient for
calicoto complete basic initialization. - Large clusters (10 nodes or more), set to 60, as more nodes and resource competition require longer startup time.
- Small clusters (less than 10 nodes), set to 30, sufficient for
3. calico-node is in a Running state with 0/1 readiness
-
Detailed Symptoms
View detailed information as follows.
- Node status is
NotReady: This is the most intuitive manifestation. Using thekubectl get nodescommand, you will find that the node's status is notReady, butNotReady. calico-nodePod probe failure: View thecalico-nodePod details (kubectl describe pod ...) or logs, and error messages similar to the following will repeatedly appear.Readiness probe failed: calico/node is not ready: BIRD is not readyBGP not established with X.X.X.X(X.X.X.X is usually the IP of other nodes)Error querying BIRD: unable to connect to BIRDv4 socket
- Cross-node Pod communication interruption: Because BGP sessions fail to establish, routing information cannot be synchronized. The most direct consequence is that Pods on the current node cannot communicate normally with Pods on other nodes in the cluster.
- Node status is
-
Root Cause
The
IP_AUTODETECTION_METHODfield ofcalico-nodeis set incorrectly, causing node network communication failure. -
Solution
Execute the following command to modify the environment variable value.
# Edit the K8s resource yaml, set the IP_AUTODETECTION_METHOD environment variable kubectl edit ds -n kube-system calico-nodeCommon setting values are as follows.
skip-interface=nerdctl*: The default policy set byopenFuyao, skip network cards with thenerdctlprefix, and select the IP of the first valid network card.can-reach=192.168.100.5: Directly specify the network card IP for the target address.interface=eth4: Use the IP on theeth4network card.
-
Extended Explanation
The
IP_AUTODETECTION_METHODfield determines how Calico automatically selects the correct IP address for establishing BGP neighbors and encapsulating traffic on multi-NIC nodes.
4. When installing a business cluster from the bootstrap cluster, the bkeagent on the business cluster node cannot connect to the APIServer of the bootstrap cluster, causing cluster installation to fail
-
Root Cause
When
bkeagentstarts, it specifies the--kube-configparameter to configure theAPIServerit monitors. Afterbkeagentstarts, it will determine whether the CRD it reconciles is issued in the monitoredAPIServer, which triggersbkeagentto attempt to connect to the bootstrap cluster'sAPIServer, resulting in the following error log message.# For v25.12 and earlier versions, the log is /var/log/bkeagenbt.log; for later versions, it's /var/log/openFuyao/bkeagent.log The CRD cannot be installed in the target cluster, xxxCheck the configuration file (for v25.12 and earlier versions, it's
/etc/bkeagent/config; for later versions, it's/etc/openFuyao/bkeagent/config), and find that the IP address corresponding toserveris not the given bootstrap cluster address. -
Solution
Execute the following commands to reset the bootstrap node and then initialize it with the specified IP address.
# Reset the bootstrap node bke reset --all --mount # Initialize with the specified IP address bke init --hostIP=1.2.3.4
5. When deploying the openFuyao management plane on the bootstrap cluster, coredns is in a CrashLoopbackOff state
-
Detailed Information
View the detailed logs of
coredns(kubectl logs -n kube-system coredns-xxx), and the following logs appear (actual IP address and Port number differ).[ERROR] plugin/errors: 2 . NS: read udp 100.20.0.15:59690->100.10.0.10:53: i/o timeout [FATAL] plugin/loop: Loop(127.0.0.1:34812 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO ***" -
Root Cause
The above logs indicate that
corednshas entered a loop during domain name resolution. -
Solution
Using
kubectl get cm -n kube-system coredns -o yaml, you can see thatforwardis set to/etc/resolv.conf, which causes the coredns server IP to be used as the server for domain name resolution when no other available DNS server exists, ultimately leading to an infinite loop.Execute the following command to set the server.
# Set forward to forward . 8.8.8.8; if there is an available DNS server, it can be set to the corresponding IP address kubectl edit cm -n kube-system coredns -
Extended Explanation
The
forwardplugin forwards DNS requests that cannot be resolved within the cluster to specified upstream DNS servers.
6. Using a virtual IP to deploy a high-availability cluster fails. After using bke reset to reset nodes, deploying a high-availability cluster again with the already-occupied virtual IP fails
-
Root Cause
The
keepalivedcomponent of a high-availability cluster binds the virtual IP to one of the nodes in the high-availability cluster. After usingbke resetto reset the environment on each node, the virtual IP is not unbound from the bound node, causing errors when used again, ultimately preventing the cluster from starting. -
Solution
Log in to the management node of the high-availability cluster from the first installation and execute the following command to unbind the virtual IP.
# View the IP addresses bound to the node's network card ip addr # If the virtual IP binding is found, execute the following command to unbind it; replace vip with the actual virtual IP used, and eth0 with the actual bound network card ip addr del <vip> dev <eth0>
7. For clusters installed by openFuyao, besides deleting clusters from the management plane, how to delete clusters from the backend
The installer-service of the bootstrap cluster or management cluster reads BKECluster data from the APIServer and then displays cluster information. Clusters can be deleted through the front-end management plane, or by logging into the terminal of the bootstrap cluster or management cluster and executing the following commands.
# Query cluster information
kubectl get bc -A
# Replace bke-cluster with actual cluster information, edit bc information, and set the following two parameters to delete the cluster
# Annotation: bke.bocloud.com/ignore-target-cluster-delete: "false"
# spec: add reset: true
kubectl edit bc -n bke-cluster bke-cluster
8. How to implement scale-up and scale-down operations for openFuyao clusters from the backend
The openFuyao management plane provides cluster lifecycle management capabilities, including cluster scale-up, scale-down, upgrade, installation, and uninstallation. This section provides backend cluster scale-up and scale-down processing, which needs to be performed when the cluster is healthy. When the cluster is in an unhealthy state, scale-up and scale-down operations may result in errors.
-
Scale-down operation: Remove a node from an existing cluster.
View existing
BKENoderesources.# Replace bke-cluster with actual cluster information kubectl get bn -n bke-clusterDelete the corresponding BKENode resource.
# Replace bke-cluster-n1 with the actual node name kubectl delete bn -n bke-cluster bke-cluster-n1Execute the command to view existing
BKENoderesources again. If the corresponding node is not found, the deletion was successful. -
Scale-up operation: Add a new node to an existing cluster.
Create a configuration file for the new node (newNode.yaml).
apiVersion: bke.bocloud.com/v1beta1 kind: BKENode metadata: name: bke-cluster-n1 namespace: bke-cluster labels: cluster.x-k8s.io/cluster-name: bke-cluster spec: hostname: n1 ip: <node-ip> password: '<encrypted>' port: "22" role: - node username: rootExecute the following command to perform the scale-up operation.
kubectl apply -f newNode.yamlExecute the command to view existing
BKENoderesources. If the corresponding node is Ready, the scale-up was successful.
Note: For v25.12 and earlier versions, refer to the following guide.
-
Scale-down operation: Remove a node from an existing cluster.
Edit the
BKEClusterresource.# Replace bke-cluster with actual cluster information kubectl edit bc -n bke-cluster bke-clusterSet nodes scheduled for deletion.
metadata: annotations: # Node scheduled deletion: node deletion is a dangerous action, so this annotation is added for secondary confirmation, none by default # When deleting a node, in addition to removing the node from spec, the IP of the node to be deleted needs to be filled in, with multiple IPs separated by ',' # Missing either of the two operations will not trigger node deletion bke.bocloud.com/appointment-deleted-nodes: "172.100.200.10"Remove node information from
Spec.spec: clusterConfig: nodes: - hostname: master-1 ip: 172.100.200.10 username: root password: password0 port: "22" role: - master - etcd -
Scale-up operation: Add a new node to an existing cluster.
Edit the
BKEClusterresource.# Replace bke-cluster with actual cluster information kubectl edit bc -n bke-cluster bke-clusterAdd node information to
Spec.spec: clusterConfig: nodes: - hostname: master-1 ip: 172.100.200.10 username: root password: password0 port: "22" role: - master - etcd