Good Morning VMWARE world!
So... Long story short we were bitten hard by HA's lack of protection against APD. We run a stretched cluster configuration running 5.5 with the latest updates. EMC VPlex replicates storage to two different EMC arrays, and keeps datastores in sync and creating a true virtual write anywhere configuration.
There are two formal classifications of storage failures from an FDM (HA) prospective as far as I understand... APD / PDL.
PDL is determined by SCSI sense codes shared from the array to the host, and APD is a connection failure from the Array to the Host, or the another catastrophic failure that didn't produce a SCSI sense code.
We have a HP c7000 enclosure with several 1/2 height blades.
About a month ago the flex fabric cards had a situation where only storage (networking was not impacted) failed from the enclosure to the SAN. This caused the VM's to essentially lose the ability to complete any storage I/O. VM's all went the the best I can describe as 'Zombie'. Many of the blades ramped CPU up to near 100% after 15 min or so as VM's were unable to complete any storage I/O. We lost about 200VM's and caused a major outage to the business.
It was then I learned HA doesn't protect against APD in any way shape or form. WE really wish VMWARE would solve this issue.
Instead I have created a few scripts and processes to fix this issue until developers of FDM can get this resolved...
Here is how I solved this issue.
I have created a shell script that runs on an ESXi host that has the following high level logic.
-Check to see if storage attached via FC to SAN is up and accessible.
-If storage is down and all data stores are inaccessible (defined data stores) reboot the impacted host, which will force HA to reboot VM's on a surviving host.
-Ensure script can't restart a host if HA isn't running.
-Ensure that script can't be started multiple times on a host.
-Have a way to collect logs from the script.
-Check against multiple datastores to ensure that paths are down.
-If paths are down use esxcli to rescan the interface several times prior to killing the host.
-run the script on the ESXi host itself to ensure that patching / other activities doesn't impact the script from running.
Here we go!
The script again runs on the ESXi host itself. It is a linux shell script (I am not a linux engineer so this was best effort for me)....
#!/bin/sh
#!/usr/bin/esxcli
echo "$(date) -- Current date : $(date) @ $(hostname)"
echo "$(date) -- IVIS HA APD Issue Script Start!"
#DEFINE TEST DATASTORES
HBFILE=ivisHB$(hostname).txt
TESTDS1=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_1***/ivisHA
TESTDS2=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_2***/ivisHA
TESTDS3=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_3***/ivisHA
TESTDS4=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_4***/ivisHA
LOCKFILELOC=/tmp/ivisHB.lck
mkdir $TESTDS1
mkdir $TESTDS2
mkdir $TESTDS3
mkdir $TESTDS4
#DECLARE FUNCTIONS VARS
RUN=1
EchoDates() {
echo "$(date)" > $TESTDS1/$HBFILE
echo "$(date)" > $TESTDS2/$HBFILE
echo "$(date)" > $TESTDS3/$HBFILE
echo "$(date)" > $TESTDS4/$HBFILE
}
TestDatastores() {
if [[ -e "$TESTDS1" || -e "$TESTDS2" || -e "$TESTDS3" || -e "$TESTDS4" ]]
then
EchoDates
local return_value=0
else
echo "$(date) -- $(hostname) lost connectivity to all datastores at $(date)...."
local return_value=1
fi
return "$return_value"
}
LockScript(){
lockfile -r 0 "$LOCKFILELOC"
}
CheckLock(){
lockfile -r 0 "$LOCKFILELOC" || exit 1 # SCRIPT IS ALREADY RUNNING
}
CheckHARunning(){ #FUNCTION TO CHECK IF HA(FDM v5.1+) IS RUNNING
if [ $(ps -Z | grep fdm | wc -l) -gt 0 ]
then
local rt_CheckHARunning=1 #FDM running
else
local rt_CheckHARunning=0 #FDM not running
fi
return "$rt_CheckHARunning"
}
RescanHBAs(){
esxcli storage core adapter rescan --all
}
CheckLockFileExists(){
if [ -e "$LOCKFILELOC" ]
then
local rt_ChkFileDel=1 #CHK NOT DELETED
else
local rt_ChkFileDel=0 #CHK DELETED
fi
return "$rt_ChkFileDel"
}
#MAIN LOOP PROGRAM
CheckLock
LockScript
RescanHBAs
EchoDates
echo "$(date) $(hostname) has started IVISHA!"
while [ $RUN ]
do
#Check and see if lockfile is deleted. If so EXIT GRACEFULLY.
CheckLockFileExists
rt_ChkFileDel=$?
if [ $rt_ChkFileDel == 0 ] #IF HA RUNNING / ELSE NOT RUNNING
then
echo "$(date) -- Lock File Deleted Killing Script... "
let $RUN=0
fi
#Check and see if HA is running if it isn't don't do anything.
CheckHARunning
rt_CheckHARunning=$?
if [ $rt_CheckHARunning == 1 ]
then
TestDatastores
return_value=$?
if [ $return_value == 1 ]
then
echo "$(date) -- Initial Failure Detected $(date) APD Detected..."
ContinueLoop=1
FailureCount=0
while [ $ContinueLoop ]
do
sleep 5
TestDatastores
return_value=$?
if [ $FailureCount == 3 ]
then
echo "$(date) -- IVISHA has detected APD... REBOOTING HOST NOW!!!!"
reboot -n -f
fi
if [ $return_value == 1 ]
then
let FailureCount=FailureCount+1
RescanHBAs
sleep 5
else
EchoDates
let ContinueLoop=0
fi
done
else
EchoDates
fi
sleep 20
else
echo "$(date) -- IVISHA has detected HA (FDM Agent) is Off-line... Script Sleeping..."
sleep 120
fi
done
-The script is stored on a shared data-store among all of the hosts.
SCRIPT LOGIC
1. Define static vars.
2. Make directories to write heartbeats to.
3. Check to see if the lock file the script creates has already been created, if it has exit the script (as the script is already running)
4. Create the lock file.
5. Scan the HBA's
6. Echo to the log that the script has started.
7. Start endless loop.
8. Ensure lockfile exists, else exit.
9. Check to ensure HA is running if not sleep.
10. Test datastores to see if the heartbeat file exists, if it exists heartbeat a datetime stamp into the file
11. if it doesn't exist go into a sub loop... Check 3 more times to see if any one of the data stores is accessible. If it fails all 3 checks. Force a REBOOT.
From the time of pulling the cable from the host to the time of reboot is about 4.5 min on my test boxes. It works every time, HA then reboots VM's on surviving hosts... The script is still very RAW, but again it works, Please post updates/enhancements here....
Now the next problem.... How do you start the script on an ESXi host???
This is where PowerCLI and PLINK come to the rescue.
# IVIS HA FUNCTIONS
Function Start-ESXiHostSSH()
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName
)
if(![string]::IsNullOrEmpty($_)){
$HostName = $_
}
$TargetHost = Get-VMHost -Name $HostName
Start-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"})
}
Function Stop-ESXiHostSSH()
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName
)
if(![string]::IsNullOrEmpty($_)){
$HostName = $_
}
$TargetHost = Get-VMHost -Name $HostName
Stop-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"}) -Confirm:$false
}
#GLOBAL HA FUNCTIONS
$primaryIvisHADS = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/"
$primaryIvisHADSScript = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/ivisHAv2.sh"
$lockFileLocation = "/tmp/ivisHB.lck"
$localIvisHBLog = "/scratch/log/ivisHA.log"
function Start-ivisHA #Force will not start script if it already running, only if it has crashed and lock file exists.
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName,
[string]$pass,
[bool]$stopSSH = $true,
[bool]$force = $false
)
Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"
Start-ESXiHostSSH -HostName $HostName
if($force){
C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation; cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh > $localIvisHBLog &"
}
else{
C:\putty\plink.exe $HostName -l root -pw $pass "cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh > $localIvisHBLog &"
}
if($stopSSH){
Stop-ESXiHostSSH -HostName $HostName
}
}
function Stop-ivisHA #Force will not start script if it already running on the ESXi Host, only if it has crashed and lock file exists.
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName,
[string]$pass,
[bool]$stopSSH = $true,
[bool]$force = $false
)
Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"
Start-ESXiHostSSH -HostName $HostName
C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation > $localIvisHBLog &"
if($stopSSH){
Stop-ESXiHostSSH -HostName $HostName
}
}
function Collect-ivisHALogs
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName,
[string]$pass
)
Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"
Start-ESXiHostSSH -HostName $HostName
C:\putty\plink.exe $HostName -l root -pw $pass "mkdir $primaryIvisHADS/logCollection; cp /scratch/log/ivisHA.log $primaryIvisHADS/logCollection/ivisHA$HostName.log &"
Stop-ESXiHostSSH -HostName $HostName
}
APD isn't very common but it does happen and when it does strike it really has a serious impact... I am available via email / message to help you set this up in your environment and play with if you would like.
Cheers vmware!