Hello,
We had a system with 1 ESXi 5.1 host with local disks.
Now we install redundancy by adding an ESXi 5.5 U2 host and a vCenter 5.5 appliance.
After installing and adding everything to vcenter, we upgraded the ESXi 5.1 to ESXi 5.5 U2. The SAN is operating correctly (vMotion is working on seperate NIC).
Now, if I try to enable High Availability, both servers will install the HA Agent, and start "Election".
All datastores (4) on the SAN are chosen for the HA heartbeat, isolation response is "keep powered on" default.
One server will always get this process done, and the other will keep "electing" until it gets to 100% and errors on the election "operation timed out".
I have seen this problem on both servers, so I think the elected "master" does not have the problem, only the "slave".
I have checked these articles and executed them, but non worked:
VMware KB: Reconfiguring HA (FDM) on a cluster fails with the error: Operation timed out
- The services were running
VMware KB: Configuring HA in VMware vCenter Server 5.x fails with the error: Operation Timed out
- All MTU's were set to 1500
- the default gateway was not the same on both hosts, but I corrected this. There are no routings. HA setting is "leave powered on". After correcting and disabling/reenabling HA, problem is still the same.
VMware KB: Verifying and reinstalling the correct version of the VMware vCenter Server agents
- I executed "Reinstalling the ESX host management agents and HA agents on ESXi", and I verified that it was uninstalled and reinstalled when reenabling HA. I did this for both hosts. This actually fixed the election problem, and I was even able to run a HA test succesfully, but when after this test I powered down the 2nd server (to test the HA in the other direction), HA did not do the failover to the 1st and everything remained down. After pushing "reconfigure HA", the election problem appeared again on 1 of the hosts.
These are some extractions from the logs:
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:03:00 PM 192.27.224.138
-vSphere HA agent is healthy info 11/29/2014 10:02:56 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Master info 11/29/2014 10:02:56 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:01:26 PM 192.27.224.138
-vSphere HA agent is healthy info 11/29/2014 10:01:22 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Master info 11/29/2014 10:01:22 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:03:02 PM 192.27.224.139
-Alarm 'vSphere HA host status' on 192.27.224.139 changed from Green to Red info 11/29/2014 10:02:58 PM 192.27.224.139
-vSphere HA agent for this host has an error: vSphere HA agent cannot be correctly installed or configured warning 11/29/2014 10:02:58 PM 192.27.224.139
-The vSphere HA availability state of this host has changed to Initialization Error info 11/29/2014 10:02:58 PM 192.27.224.139
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:00:52 PM 192.27.224.139
-Datastore DSMD3400DG2VD2 is selected for storage heartbeating monitored by the vSphere HA agent on this host info 11/29/2014 10:00:49 PM 192.27.224.139
-Datastore DSMD3400DG2VD1 is selected for storage heartbeating monitored by the vSphere HA agent on this host info 11/29/2014 10:00:49 PM 192.27.224.139
-Firewall configuration has changed. Operation 'enable' for rule set fdm succeeded. info 11/29/2014 10:00:45 PM 192.27.224.139
-The vSphere HA availability state of this host has changed to Uninitialized info 11/29/2014 10:00:40 PM Reconfigure vSphere HA host 192.27.224.139 root
-vSphere HA agent on this host is disabled info 11/29/2014 10:00:40 PM Reconfigure vSphere HA host 192.27.224.139 root
-Reconfigure vSphere HA host 192.27.224.139 Operation timed out. root HOSTSERVER01 11/29/2014 10:00:31 PM 11/29/2014 10:00:31 PM 11/29/2014 10:02:51 PM
-Configuring vSphere HA 192.27.224.139 Operation timed out. System HOSTSERVER01 11/29/2014 9:56:42 PM 11/29/2014 9:56:42 PM 11/29/2014 9:58:55 PM
Can someone please provide me with some help here?
Or extra things I can check or provide?
I am running out of options currenty.
Best Regards,
Joris
P.S. When I tried to do a "Cold Migration" yesterday (when still running ESXi 5.1 on local storage, Cold migration from local storage to SAN), I got the error "unable to connect to host". So I needed to use VEEAM 'quick migration' features in order to copy the files to the other datastore (on the SAN). Maybe this is a related problem.
P.P.S. I have configured more of these systems from scratch in the past with no problem (though this is an 'upgrade').