Friday 12 December 2014

hostd disconnect & ESXi5 SAN storage issues with PowerPath 5.9.1.00

Symptoms:  Adding more than 2 luns caused hostd to disconnect hosts from vCenter.  Underlying issue is storage related.

We learned a lot about the PowerPath Adaptive policy (autostandby and proximity) but the fix for our problem was to upgrade PowerPath on our hosts.  Seems there's a VMware bug that shows when too many paths are presented for a disk.  Until VMware fixes the issue, taking PowerPath Version 5.9 SP 1 (build 11) to Version 5.9 SP 1 P 02 (build 54) stops our headaches.

The problem shows on our cross connect VPLEX, but it seems it's the number of paths to the storage rather than whether it's cross connect or not.

Still need to confirm this is the smoking gun, as their kbase article mentions APD and I  haven't seen that.  But out internal testing looks pretty good. We were reproducing the issue on demand in a test cluster with only two ESXi hosts (no VMs) and with less than 10 distributed volumes (VPLEX luns).

Got this from EMC:

Article Number:000190540 Version:7
Key Information

Audience: Level 30 = Customers Article Type: Break Fix
Last Published: Thu Nov 20 19:18:35 GMT 2014 Validation Status: Final Approved
Summary: ESXi host becomes unresponsive after adding or removing LUNs.

Impact ESXi host becomes unresponsive after adding or removing LUNs.
ESXi host has to be rebooted.
ESXi host cannot be managed in vSphere.
Issue Adding or removing LUNs on an ESXi host and rescanning storage can cause the host to stop responding in the management GUI.
The host becomes unmanageable and must be rebooted to restore management abilities.
Virtual machines continue to run normally, even though the ESXi host is not responding.
The hostd Daemon is a management daemon and it stops responding.
All paths down messages will sometimes be seen in the host logs just before the host stops responding.
Environment System: VMware ESXi 5.1
System: VMware ESXi 5.5
EMC SW: PowerPath/VE for VMware 5.9.1
Cause A bug in NMP contributes to a bug in PowerPath causing the host to leave a device in an All Paths Down state.
While the device is in All Paths Down it will continue to consume resources until the hostd daemon cannot start any new management threads.
When this happens the host becomes unresponsive because the hostd daemon cannot respond to managment requests.
Change Adding or removing a LUN to an ESXi host or cluster.
Resolution Solution:
Upgrade PowerPath to version 5.9 SP1 P02 (5.9.1.2) or later.


Workaround:
Once the host has stopped responding there are two options to bring it back.  You can try undoing the storage change that triggered the host to become unresponsive, or you can reboot the host.
The VM's are still running normally, so they can be gracefully shut down.  Because the host is not responding you will not be able to vMotion virtual machines to another host.

This problem is partially triggered by the presence of an ACLX or LUNZ device.
Removing these devices will greatly reduce the chances of the host going unresponsive.

Symmetrix: Unmap any ACLX devices from the FA's that the ESXi hosts are zoned to.
CLARiiON / VNX: Add a real LUN as lun 0 into the storage group and then reboot the ESXi host.

Place hosts in maintenance mode before adding or removing LUNs, so that they can be rebooted without affecting production if it is needed.
Notes This issue is seen with Symmetrix, VNX, CLARiiON and VPLEX arrays.   There is a separate OPT for this issue when seen with VPLEX arrays.

The kernel.log from the ESXi host may show evidence of entering and exiting an All Paths Down state similar to this:
cpu4:16629)ScsiDevice: 4108: Setting Device naa.60000970000298701034533030333844 state back to 0x2
cpu4:16629)ScsiDevice: 6121: No Handlers registered!
cpu4:16629)ScsiDevice: 4126: Device naa.60000970000298701034533030333844 is Out of APD; token num:1
cpu4:16629)StorageApdHandler: 277: APD Timer killed for ident [naa.60000970000298701034533030333844]
cpu4:16629)StorageApdHandler: 402: Device or filesystem with identifier [naa.60000970000298701034533030333844] has exited the All Paths Down state.
Attachments
Article Metadata
Product PowerPath/VE for VMware5.9 SP1
Operating System VMware ESX Server
Requested Publish Date 8/1/2014 1:11 PM

No comments:

Post a Comment