A very interesting Microsoft cluster failure

It's been a long time sine last post. I was out of internet due to health issue. Just got recovered and backed to normal work. I have to publish my article by English then translate it to Chinese later since I lost lot of me time after my baby born, but more fun. hopefully it not impact to Google search. :-)

There was a interesting problem happend on Microsoft cluster when I came back from hospital. Our DBA team complaint Microsoft Cluster Service failed intermittently on virtual machine. This situation constantly happend for a week.

At the beginning of the whole troubleshooting, team noticed quorum disks failed with following Windows event:

Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.

So we focused on disk performance. vbod.log also show some performance degrading but the time was not match. Microsoft was involved after that, they said the cluster failure actually caused by network connectivity issue according to following Windows event:

Cluster node 'xxx' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

It became interesting since virtual machines share physical network links, it cannot be only single virtual machine had problem if there was network connectivity issue. Then we noticed there was following abnormal Windows event when some failure happend:

Reset to device, DeviceRaidPort0, was issued

It related to ISL driver bug, maybe not related to cluster failure issue but it worth to update to improve cluster stability. You can check Windows virtual machine event log reports the error: Reset to device, DeviceRaidPort0, was issued about this bug.

After involved multiple vendors from OS, virtualization, network and storage team, everybody said it's not their problem. You could see this kind of problem in large datacenter since more and more system installed, it's hard to find out which piece of the system caused the issue. You have to familar with each field of datacenter.

Eventually we figured out the issue related to storage workload. But why vendors cannot figured out this problem? First of all, Windows OS disk is running on shared storage, Windows no responding when the VMFS5 datastore latency of OS disk is high. From Windows perspective, it doesn't know what happend on backend storage, it just know OS is very slow for few seconds, kind of pause the system. So it leads to network packages drop, and no Windows event for that since OS resumed very quickly, Windows takes it as normal behavior. Cluster actually failed at this moment. Secondary, the particular LUNs hosted the virtual machine was not busy, but the LUNs shared same storage pool with other LUNs. Any high workload LUN will impact rest of LUNs in same storage pool.

After understood these points, we figured out lot of virtual machines got high latency or high IO around 7PM every day, and most of the cluster failure happend this time. Since it's impacted to large number of virtual machines, it must be caused by some common components on virtual machines. We eventially figured out it's McAfee DAT updating after captured network packages. All virtual machines did same downloading in same time lead to high workload on shared storage and lead to cluster virtual machiens no responding for few seconds. The issue got fixed after change McAfee DAT updating schedule to random interval.

There are always common things going on on datacenter, it maybe small resource consumer but it can be a signaficant big monster in virtualized datacenter. Such as backup, monitoring, anti-virus or system management agents. It can impacts to shared storage or network links.


Popular posts from this blog

Moving to Blogger.com

How to Use Proxy on WSL 2

Connect-NsxtServer shows "Unable to connect to the remote server"