Quantcast
Channel: VMware Communities : Discussion List - Availability: HA & FT
Viewing all articles
Browse latest Browse all 845

ESXi 5.5 -- APD not causing HA event. --Scripts to monitor / reboot hosts

$
0
0

Good Morning VMWARE world!

 

So... Long story short we were bitten hard by HA's lack of protection against APD. We run a stretched cluster configuration running 5.5 with the latest updates. EMC VPlex replicates storage to two different EMC arrays, and keeps datastores in sync and creating a true virtual write anywhere configuration.

 

There are two formal classifications of storage failures from an FDM (HA) prospective as far as I understand... APD / PDL.

PDL is determined by SCSI sense codes shared from the array to the host, and APD is a connection failure from the Array to the Host, or the another catastrophic failure that didn't produce a SCSI sense code.

 

We have a HP c7000 enclosure with several 1/2 height blades.

About a month ago the flex fabric cards had a situation where only storage (networking was not impacted) failed from the enclosure to the SAN. This caused the VM's to essentially lose the ability to complete any storage I/O. VM's all went the the best I can describe as 'Zombie'. Many of the blades ramped CPU up to near 100% after 15 min or so as VM's were unable to complete any storage I/O. We lost about 200VM's and caused a major outage to the business.

 

It was then I learned HA doesn't protect against APD in any way shape or form. WE really wish VMWARE would solve this issue.

Instead I have created a few scripts and processes to fix this issue until developers of FDM can get this resolved...

Here is how I solved this issue.

 

I have created a shell script that runs on an ESXi host that has the following high level logic.

 

-Check to see if storage attached via FC to SAN is up and accessible.

-If storage is down and all data stores are inaccessible (defined data stores) reboot the impacted host, which will force HA to reboot VM's on a surviving host.

-Ensure script can't restart a host if HA isn't running.

-Ensure that script can't be started multiple times on a host.

-Have a way to collect logs from the script.

-Check against multiple datastores to ensure that paths are down.

-If paths are down use esxcli to rescan the interface several times prior to killing the host.

-run the script on the ESXi host itself to ensure that patching / other activities doesn't impact the script from running.

 

Here we go!

 

The script again runs on the ESXi host itself. It is a linux shell script (I am not a linux engineer so this was best effort for me)....

 

#!/bin/sh

#!/usr/bin/esxcli

 

echo "$(date) -- Current date : $(date) @ $(hostname)"

echo "$(date) -- IVIS HA APD Issue Script Start!"

 

#DEFINE TEST DATASTORES

 

HBFILE=ivisHB$(hostname).txt

TESTDS1=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_1***/ivisHA

TESTDS2=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_2***/ivisHA

TESTDS3=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_3***/ivisHA

TESTDS4=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_4***/ivisHA

LOCKFILELOC=/tmp/ivisHB.lck

 

 

mkdir $TESTDS1

mkdir $TESTDS2

mkdir $TESTDS3

mkdir $TESTDS4

 

#DECLARE FUNCTIONS VARS

RUN=1

 

EchoDates() {

  echo "$(date)" > $TESTDS1/$HBFILE

  echo "$(date)" > $TESTDS2/$HBFILE

  echo "$(date)" > $TESTDS3/$HBFILE

  echo "$(date)" > $TESTDS4/$HBFILE

}

 

TestDatastores() {

  if [[ -e "$TESTDS1" || -e "$TESTDS2" || -e "$TESTDS3" || -e "$TESTDS4" ]]

  then

  EchoDates

  local return_value=0

  else

  echo "$(date) -- $(hostname) lost connectivity to all datastores at $(date)...."

  local return_value=1

  fi

  return "$return_value"

}

 

LockScript(){

  lockfile -r 0 "$LOCKFILELOC"

}

 

CheckLock(){

  lockfile -r 0 "$LOCKFILELOC" || exit 1 # SCRIPT IS ALREADY RUNNING

}

 

CheckHARunning(){ #FUNCTION TO CHECK IF HA(FDM v5.1+) IS RUNNING

  if [ $(ps -Z  | grep fdm | wc -l) -gt 0 ]

  then

  local rt_CheckHARunning=1 #FDM running

  else

  local rt_CheckHARunning=0 #FDM not running

  fi

  return "$rt_CheckHARunning"

}

 

RescanHBAs(){

  esxcli storage core adapter rescan --all

}

 

CheckLockFileExists(){

  if [ -e "$LOCKFILELOC" ]

  then

  local rt_ChkFileDel=1 #CHK NOT DELETED

  else

  local rt_ChkFileDel=0 #CHK DELETED

  fi

  return "$rt_ChkFileDel"

}

 

#MAIN LOOP PROGRAM

CheckLock

LockScript

RescanHBAs

EchoDates

 

echo "$(date) $(hostname) has started IVISHA!"

 

while [ $RUN ]

  do

  #Check and see if lockfile is deleted. If so EXIT GRACEFULLY.

  CheckLockFileExists

  rt_ChkFileDel=$?

  if [ $rt_ChkFileDel == 0 ] #IF HA RUNNING / ELSE NOT RUNNING

  then

  echo "$(date) -- Lock File Deleted Killing Script... "

  let $RUN=0

  fi

 

  #Check and see if HA is running if it isn't don't do anything.

  CheckHARunning

  rt_CheckHARunning=$?

  if [ $rt_CheckHARunning == 1 ]

  then

  TestDatastores

  return_value=$?

  if [ $return_value == 1 ]

  then

  echo "$(date) -- Initial Failure Detected $(date) APD Detected..."

  ContinueLoop=1

  FailureCount=0

  while [ $ContinueLoop ]

  do

  sleep 5

  TestDatastores

  return_value=$?

 

  if [ $FailureCount == 3 ]

  then

  echo "$(date) -- IVISHA has detected APD... REBOOTING HOST NOW!!!!"

  reboot -n -f

  fi

 

  if [ $return_value == 1 ]

  then

  let FailureCount=FailureCount+1

  RescanHBAs

  sleep 5

  else

  EchoDates

  let ContinueLoop=0

  fi

  done

  else

  EchoDates

  fi

  sleep 20

  else

  echo "$(date) -- IVISHA has detected HA (FDM Agent) is Off-line... Script Sleeping..."

  sleep 120

  fi

  done

 

-The script is stored on a shared data-store among all of the hosts.

 

SCRIPT LOGIC

1. Define static vars.

2. Make directories to write heartbeats to.

3. Check to see if the lock file the script creates has already been created, if it has exit the script (as the script is already running)

4. Create the lock file.

5. Scan the HBA's

6. Echo to the log that the script has started.

7. Start endless loop.

8. Ensure lockfile exists, else exit.

9. Check to ensure HA is running if not sleep.

10. Test datastores to see if the heartbeat file exists, if it exists heartbeat a datetime stamp into the file

11. if it doesn't exist go into a sub loop... Check 3 more times to see if any one of the data stores is accessible. If it fails all 3 checks. Force a REBOOT.

 

From the time of pulling the cable from the host to the time of reboot is about 4.5 min on my test boxes. It works every time, HA then reboots VM's on surviving hosts... The script is still very RAW, but again it works, Please post updates/enhancements here....

 

Now the next problem.... How do you start the script on an ESXi host???

This is where PowerCLI and PLINK come to the rescue.

 

# IVIS HA FUNCTIONS


Function Start-ESXiHostSSH()

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName

    )

 

    if(![string]::IsNullOrEmpty($_)){

        $HostName = $_

    }

  $TargetHost = Get-VMHost -Name $HostName

  Start-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"})

}

 

Function Stop-ESXiHostSSH()

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName

    )

 

    if(![string]::IsNullOrEmpty($_)){

        $HostName = $_

    }

 

  $TargetHost = Get-VMHost -Name $HostName

  Stop-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"}) -Confirm:$false

}

 

#GLOBAL HA FUNCTIONS

$primaryIvisHADS = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/"

$primaryIvisHADSScript = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/ivisHAv2.sh"

$lockFileLocation = "/tmp/ivisHB.lck"

$localIvisHBLog = "/scratch/log/ivisHA.log"

 

function Start-ivisHA #Force will not start script if it already running, only if it has crashed and lock file exists.

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName,

  [string]$pass,

  [bool]$stopSSH = $true,

  [bool]$force = $false

    )

 

  Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"

  Start-ESXiHostSSH -HostName $HostName

 

  if($force){

  C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation; cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh  > $localIvisHBLog &"

  }

  else{

  C:\putty\plink.exe $HostName -l root -pw $pass "cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh  > $localIvisHBLog &"

  }

 

  if($stopSSH){

  Stop-ESXiHostSSH -HostName $HostName

  }

}

 

function Stop-ivisHA #Force will not start script if it already running on the ESXi Host, only if it has crashed and lock file exists.

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName,

  [string]$pass,

  [bool]$stopSSH = $true,

  [bool]$force = $false

    )

 

  Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"

  Start-ESXiHostSSH -HostName $HostName

 

  C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation > $localIvisHBLog &"

 

  if($stopSSH){

  Stop-ESXiHostSSH -HostName $HostName

  }

}

 

function Collect-ivisHALogs

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName,

  [string]$pass

    )


  Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"

  Start-ESXiHostSSH -HostName $HostName

  C:\putty\plink.exe $HostName -l root -pw $pass "mkdir $primaryIvisHADS/logCollection; cp /scratch/log/ivisHA.log $primaryIvisHADS/logCollection/ivisHA$HostName.log &"

  Stop-ESXiHostSSH -HostName $HostName

}

 

APD isn't very common but it does happen and when it does strike it really has a serious impact... I am available via email / message to help you set this up in your environment and play with if you would like.

 

Cheers vmware!


Viewing all articles
Browse latest Browse all 845

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>