Host Disconnects Due to NFS
Ah, another lovely Tuesday. The Monday fires have been put out, TPS reports have all been filed (with the new cover letter no less), and now we can get back to focusing on innovating. Not so fast. It looks like one of our hosts has disconnected from vCenter. This should be fun!
All of the usual stuff doesn’t work; attempting to reconnect in vCenter, restarting hostd, and vxpa, and a full restart of all services. There is not a ton of useful information in the vmkernel log file, other than some obscure NFS messages. Time for a ticket to VMware support.
After analyzing the log bundle, the engineer also pointed out the same obscure NFS messages:
2019-10-22T07:36:54.431Z cpu2:34634)WARNING: NFS: 221: Got error 13 from mount call 2019-10-22T07:37:24.432Z cpu19:34310)WARNING: NFS: 221: Got error 13 from mount call 2019-10-22T07:37:54.433Z cpu3:33436)WARNING: NFS: 221: Got error 13 from mount call
According to the engineer, these messages can be caused if an NFS share that is being mounted was renamed at the storage side, and was previously in-use on the ESXi host with the original name.
The process to identify this is to validate that the total number of mounted NFS datastores is the same as reported in the esx.conf configuration file. To check the configuration file, run the following command:
cat /etc/vmware/esx.conf | grep <Volume Service Name>
Where you replace <Volume Service Name> with your NFS datsatore name. The command should output something similar to the following:
/nas/<volume name>/readOnly = "false" /nas/<volume name>/host = "w.x.y.z" /nas/<volume name>/share = "/exports/<Volume Service Name>" /nas/<volume name>/enabled = "true"
Determine the number of NFS mounts on the host with following esxcli command:
esxcli storage nfs list
Sure enough, we have a different set of hosts mounted versus what is in the configuration file. The resolution is to remove the datastores in question manually from the esx.conf configuration file. This can be accomplished with the following command:
esxcli storage nfs remove -v <Volume Service Name>
Confirm that the esx.conf file has been updated with the following command:
cat /etc/vmware/esx.conf | grep <Volume Service Name>
This should return no results. Interestingly enough, the VM KB with similar log messages was completely unrelated. Ultimately, the issue occurs because a number of services are being killed which impacts hostd and the communication to vCenter.