Friday, September 3, 2010

VMware guests and ISCSI/SAN storage

Last week I have had a problem with my Windows Server VMs residing on an iSCSI storage. Error eventid 11 (The driver detected a controller error on \Device\Harddisk0.) and eventid 15 (The device, \Device\Scsi\symmpi1, is not ready for access yet.) where recorded by some of my VMWARE virtual machines running Windows.

Event Type: Error
Event Source: Disk
Event Category: None
Event ID: 11
User: N/A
Description:
The driver detected a controller error on \Device\Harddisk1.

Event Type: Error
Event Source: symmpi
Event Category: None
Event ID: 15
User: N/A
Description:
The device, \Device\Scsi\symmpi1, is not ready for access yet.

Furthermore, in our case I detected the following error on the ESX logs:


Aug 27 15:56:23 esx vmkernel: 106:23:16:50.187 cpu3:1058)StorageMonitor: 196: vmhba32:1:0:0 status = D:0x0/H:0x1 0x2 0x8 0x0
Aug 27 15:56:23 esx vmkernel: 106:23:16:50.189 cpu6:1057)StorageMonitor: 196: vmhba32:1:0:0 status = D:0x0/H:0x1 0x2 0x8 0x0
Aug 27 15:56:23 esx vmkernel: 106:23:16:50.189 cpu6:1057)WARNING: SCSI: 5492: vml.0200000534944: Too many failed retries 33 (32), Returning I/O failure. 0x16 D:0x0/H:0x1 0x2 0x8 0x0
Aug 27 15:56:23 esx vmkernel: 106:23:16:50.189 cpu5:1043)WARNING: FS3: 4866: Reservation error: No connection
Aug 27 15:56:26 esx vmkernel: 106:23:16:52.881 cpu6:1141)iSCSI: bus 0 target 1 trying to establish session 0x35e180a0 to portal 0, address xx.xx.xx.xx port 3260 group 1
Aug 27 15:56:26 esx vmkernel: 106:23:16:52.888 cpu6:1141)iSCSI: bus 0 target 1 established session 0x35e180a0 #15, portal 0, address xx.xx.xx.xx port 3260 group 1
Aug 27 15:56:54 esx vmkernel: 106:23:17:20.958 cpu6:1141)Initialize the reset lun framework for session 0x35e180a0

and

Aug 27 15:56:54 esx vmkernel: 106:23:17:20.950 cpu6:1141)iSCSI: session 0x35e180a0 closed by target iqn.2003-10.com.lefthandnetworks:mgt-grp:46:vol-standalone at 924224212
Aug 27 15:56:54 esx vmkernel: 106:23:17:20.950 cpu6:1141)iSCSI: session 0x35e180a0 to iqn.2003-10.com.lefthandnetworks:mgt-grp:46:vol-standalone dropped

These errors simply meant that we had temporarily suffered a network failure on the iSCSI link. And our virtual machines had lost connectivity to their vmdk or raw disk mappings.

Once everything went back to normal, I got this message:

Aug 27 15:58:21 esx vmkernel: 106:23:18:47.627 cpu3:1039)LVM: 4523: [1a2] Unreachable device came back as vml.0200000944:1

This one is a typical behaviour of Windows, which, once lost iSCSI mapping, leaves the Windows VM running even without access to its storage. The time that Microsoft Windows will accept to wait for a hard disk to respond before a BSOD is defined by the TimeOutValue registry setting under:

HKEY_LOCAL_MACHINE/System/CurrentControlSet/Services/Disk


In a few words, this registry key is designed to help the guest operating system to survive high latency or temporary outage conditions such as SAN path failover or maybe a network failure in iSCSI based storage.

VMware Tools installation by default modify the standard value of 10 seconds for non-cluster nodes, 20 seconds for cluster nodes, to 60 seconds (or x03c hex).

This can be not enough, especially during SAN upgrades, so it can be a good idea to prevent problems by specifying a temporarily longer time-out value (190 seconds for NetApp), but carefully because exceeding caching possibilities would result in a BSOD.

To increase the disk timeout value:

  1. Run REGEDIT
  2. Go to HKEY_LOCAL_MACHINE/System/CurrentControlSet/Services/Disk.
  3. Right-click and select Edit/Add value.
  4. Set the value name to TimeOutValue.
  5. Set the data type to REG_DWORD.
  6. Set the data to 120 decimal.
  7. Reboot the virtual machine.

A good Powershell script that I sed to mass modify this value on our servers is the one here. Another Powershell solution can be found here. A related post is here. Finally, a good iSCSI SAN configuration guide for VMWARE can be found here.

0 commentaires:

Post a Comment

Related Posts Plugin for WordPress, Blogger...