vSAN Issue After 6.7 Update 3
We embarked on patching all of our hosts to 6.7 Update 3, which included our vSAN clusters. The first vSAN cluster to get the update is our four-node management cluster. After patching two hosts to 6.7 Update 3 (build 14320388), we noticed several vSAN alarms. At first I thought this was something similar to my post here, but I soon realized this was a different issue.
Two of the vSAN alarms stood stood out to me in particular. The vSAN: Basic (unicast) connectivity check, and the vSAN Cluster Partition check. The basic connectivity test does just what its name entails; it performs a ping test with a small packet size from each host to all other hosts. This checks the vmknic, uplink, VLAN, physical switch, and other associated settings.
The Cluster Partition check validates that the vSAN cluster is in a single partition. In order to ensure proper functionality, all vSAN hosts must be able to communicate over both multicast and unicast. If they cannot, the vSAN cluster will split into one or more partitions. If this occurs, the cluster will be in a degraded state due to objects becoming unavailable until resolved.
It’s important to note that before patching, there were no issues with the cluster. So why is it broken? And why on a Friday?! Unfortunately, there was not a lot of useful information in the log files. Troubleshooting tests only confirmed the results of the vSAN health checks. Something is going on with that vSAN vmkernel port.
Per the vSAN Cluster Partition check, we have two partitions. Each partition is comprised of half the hosts; the patched and the unpatched. And each cluster partition can communicate amongst each other. All of the workloads are on the unpatched hosts, so we have a little more freedom in testing.
I initially disabled, and re-enabled the vSAN service on the vmkernel port. This had no effect on the host. Next step was to remove, and re-add the vSAN vmkernel port entirely. Boom! Immediately after adding the vmkernel port back, I have connectivity.
Rinse and repeat for the last host, and almost all the vSAN errors go away. The vSAN Object Health check expectedly takes a little time to clear while data in rebalanced across the cluster. We’re back to a single partition and a happy cluster.
So why did this happen? I’m not sure yet. I have an SR open with VMware, so I’m hoping they can give me a root cause.
Same here. Did you get some answer from VMware?
Hello Guilherme. I was not able to get a root cause out of VMware. They requested logs before patching, and prior to fixing the issue. Unfortunately (or fortunately?) there were no additional hosts in our management cluster to patch. We did patch our remaining vSAN clusters, which were of different hardware, and did not experience any issues. May I ask what kind of hardware you have where you ran into the issue?