Thursday 2 February 2012

lun won't release on ESX4

We found this when we noticed the esx host disk latency was through the roof (40,000 miliseconds).
None of the VMs on this host had latency above 24 miliseconds.  Why?  Because the luns wasn't assigned to any VMs.  It was a rogue lun that had been removed but due to a bug in ESX it wouldn't release.  vCenter showed the lun size as zero and no paths.

Rescanning the luns on the hosts didn't fix the problem either (try a few times).

Come to find out the lun was "All Paths Down" (APD):

/var/log/vmkernel
Feb  1 18:01:02 VHB23 vmkernel: 144:09:20:43.989 cpu10:4265)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device "naa.60a9800057396d5a4a6f687531535934" - issuing command 0x41027f1ccc40 
Feb  1 18:01:02 VHB23 vmkernel: 144:09:20:43.989 cpu10:4265)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device "naa.60a9800057396d5a4a6f687531535934" - failed to issue command due to Not found (APD), try again...


verify the lun number via vCenter of with "esxcfg-mpath -l"

VMware says this is fixed with ESX5 but we had no alternative than to VMotion the VMs and reboot the hosts.  That sorted it.

I need a script to send an altert when/if this happens again.

Anyone want to share a vCLI version of this script:

________________________________________________

--draft only for now--
grep "APD" /var/log/vmkernel > /dev/null
if [ $? != 1 ] ; then\
sendmail.script mailsever mailaddress -s "investigate All Path Down error on `uname -n`"
--draft only for now--

schedule via cron
________________________________________________