We had an isolation issue on our 4.1 production systems many months ago. Since then, we've not been allowed to use HA to automatically restart VMs on another server. Now that v5.0 is in the mix with datastore heartbeats, I'm pushing to be allowed to re-enable it.
In the event that occurred, our vCenter system became disconnected from our two-host cluster due to a switch outage. The same outage caused one of the hosts in the cluster to become isolated. The VMs were still running on the isolated host, but when the non-isolated host tried to launch the VMs covered under HA, they corrupted data on the datastore because two identical VMs were running against the same data (or rather, both hosts were attempting to start and restart the same VM over and over).
How do the hosts (in an isolation event) prevent data corruption from having multiple VMs accessing the same data at the same time? By that, I mean if we have a host that goes into isolation but the VM continues to run (and continues to access the datastore), and the VM subsequently starts up on another host, what's to prevent both running VMs from causing my datastore to fry? What is the best way to configure HA in this respect? We have several VMs for which a graceful shutdown would be highly desirable (SQL, Exchange), but we have some that can undergo a hard shutdown.
I've read a lot of documentation about HA, but none of it seems to cover the data access aspect of it. If you know of some documentation that does, I would appreciate a link.
Message was edited by: fpineau - Clarified incident to rule out file locks