Had a full poweroutage at a customer datacenter. Power was restored but HA did not start up any VMs automatically after the hosts came up.
I don't understand why HA failed, and a googling for some of the repeated errors returned troubling little.
I took at look at the fdm.log and I see that it continuously failed to find suitable 'hosts' for the VMs for over an hour.. the VMs were eventually manually powered on when the system admin arrived at the DC over an hour after the environment powered up.
Here's a snippet of where it decides there are 65 VMs that need to be powered up but then immediately starts to fault them..
2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::GenerateFailoverRecommendation] 65 vms added to domain config
2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsMultiplePasses] Pass2: respect host preference but not failover hosts
2012-11-03T15:12:15.573Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsAlgorithmForPlacement] Calling mapVm to place 65 Vms
And then immediately it faults all 65 with the same error..
2012-11-03T15:12:15.576Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] DrmFault: reason powerOnVm, vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, host host-231, fault [N3Vim5Fault16NoCompatibleHostE:0x5b08b88]
2012-11-03T15:12:15.576Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] FaultArgument: none
all the way through all 65..
Then it tries another pass
2012-11-03T15:12:15.581Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsMultiplePasses] Pass3: use all compatible hosts
2012-11-03T15:12:15.581Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [DrmPE::InvokeDrsAlgorithmForPlacement] Calling mapVm to place 65 Vms
2012-11-03T15:12:15.584Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] DrmFault: reason powerOnVm, vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, host host-231, fault [N3Vim5Fault16NoCompatibleHostE:0x5b00348]
2012-11-03T15:12:15.584Z [FFD85B90 verbose 'drmLogger' opID=SWI-72429e8f] FaultArgument: none
After that pass it then changes to the vim.fault.NoCompatibleHost error
2012-11-03T15:12:15.587Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [PlacementManagerImpl::PlacementUpdateCb] No recommendation is generated
2012-11-03T15:12:15.587Z [FFD85B90 verbose 'Placement' opID=SWI-72429e8f] [PlacementManagerImpl::HandleNotPlacedVms] Reset Vm /vmfs/volumes/4ac2492d-8aa9fb29-d449-001517a6c248/SOMEVM/SOMEVM.vmx, vim.fault.NoCompatibleHost
Then it just outputs the following for almost 2 hours..
2012-11-03T15:12:37.824Z [FFE89B90 verbose 'Placement'] [RR::ResetVms] Reset 0 Vms. Records = 65
2012-11-03T15:12:37.824Z [FFE89B90 info 'Placement'] [RR::CreatePlacementRequest] 65 total VM with some excluded: 0 VM disabled; 0 VM being placed; 65 VM waiting resources; 0 VM in time delay;
....
2012-11-03T16:57:37.931Z [70B13B90 info 'Placement'] [RR::CreatePlacementRequest] 65 total VM with some excluded: 0 VM disabled; 0 VM being placed; 65 VM waiting resources; 0 VM in time delay;
2012-11-03T16:58:37.933Z [FFB9C460 verbose 'Placement'] [RR::ResetVms] Reset 0 Vms. Records = 65
At this point one of the sysadmins arrived at the DC and started to power on the VMs manually, they started up no problem.
So what gives? Why did HA fail so badly? If there were no compatible hosts why could the sysadmin just turn the VMs on no problems?
(at first I thought maybe it was access to the datastores but the admin didn't have to do anything, they were all listed when he connected direct to the ESX host).
All the hosts were up, the HA Cluster config is as follows:
"Enable: Disallow VM power on operations that violate availability constraints"
"Precentage of cluster resources reserved as failover spare capacity: 50% CPU, 50% Memory"
All other options are default.
Since it was a full power down situation there were no running VMs, just ESX hosts. (This includes vCenter being down)
So my question is.. has anyone else seen this? Do you know why it happened?