Friday, September 3, 2010

VMware guests and ISCSI/SAN storage

Last week I have had a problem with my Windows Server VMs residing on an iSCSI storage. Error eventid 11 (The driver detected a controller error on \Device\Harddisk0.) and eventid 15 (The device, \Device\Scsi\symmpi1, is not ready for access yet.) where recorded by some of my VMWARE virtual machines running Windows.

Event Type: Error
Event Source: Disk
Event Category: None
Event ID: 11
User: N/A
Description:
The driver detected a controller error on \Device\Harddisk1.

Event Type: Error
Event Source: symmpi
Event Category: None
Event ID: 15
User: N/A
Description:
The device, \Device\Scsi\symmpi1, is not ready for access yet.

Furthermore, in our case I detected the following error on the ESX logs:


Aug 27 15:56:23 esx vmkernel: 106:23:16:50.187 cpu3:1058)StorageMonitor: 196: vmhba32:1:0:0 status = D:0x0/H:0x1 0x2 0x8 0x0
Aug 27 15:56:23 esx vmkernel: 106:23:16:50.189 cpu6:1057)StorageMonitor: 196: vmhba32:1:0:0 status = D:0x0/H:0x1 0x2 0x8 0x0
Aug 27 15:56:23 esx vmkernel: 106:23:16:50.189 cpu6:1057)WARNING: SCSI: 5492: vml.0200000534944: Too many failed retries 33 (32), Returning I/O failure. 0x16 D:0x0/H:0x1 0x2 0x8 0x0
Aug 27 15:56:23 esx vmkernel: 106:23:16:50.189 cpu5:1043)WARNING: FS3: 4866: Reservation error: No connection
Aug 27 15:56:26 esx vmkernel: 106:23:16:52.881 cpu6:1141)iSCSI: bus 0 target 1 trying to establish session 0x35e180a0 to portal 0, address xx.xx.xx.xx port 3260 group 1
Aug 27 15:56:26 esx vmkernel: 106:23:16:52.888 cpu6:1141)iSCSI: bus 0 target 1 established session 0x35e180a0 #15, portal 0, address xx.xx.xx.xx port 3260 group 1
Aug 27 15:56:54 esx vmkernel: 106:23:17:20.958 cpu6:1141)Initialize the reset lun framework for session 0x35e180a0

and

Aug 27 15:56:54 esx vmkernel: 106:23:17:20.950 cpu6:1141)iSCSI: session 0x35e180a0 closed by target iqn.2003-10.com.lefthandnetworks:mgt-grp:46:vol-standalone at 924224212
Aug 27 15:56:54 esx vmkernel: 106:23:17:20.950 cpu6:1141)iSCSI: session 0x35e180a0 to iqn.2003-10.com.lefthandnetworks:mgt-grp:46:vol-standalone dropped

These errors simply meant that we had temporarily suffered a network failure on the iSCSI link. And our virtual machines had lost connectivity to their vmdk or raw disk mappings.

Once everything went back to normal, I got this message:

Aug 27 15:58:21 esx vmkernel: 106:23:18:47.627 cpu3:1039)LVM: 4523: [1a2] Unreachable device came back as vml.0200000944:1

This one is a typical behaviour of Windows, which, once lost iSCSI mapping, leaves the Windows VM running even without access to its storage. The time that Microsoft Windows will accept to wait for a hard disk to respond before a BSOD is defined by the TimeOutValue registry setting under:

HKEY_LOCAL_MACHINE/System/CurrentControlSet/Services/Disk


In a few words, this registry key is designed to help the guest operating system to survive high latency or temporary outage conditions such as SAN path failover or maybe a network failure in iSCSI based storage.

VMware Tools installation by default modify the standard value of 10 seconds for non-cluster nodes, 20 seconds for cluster nodes, to 60 seconds (or x03c hex).

This can be not enough, especially during SAN upgrades, so it can be a good idea to prevent problems by specifying a temporarily longer time-out value (190 seconds for NetApp), but carefully because exceeding caching possibilities would result in a BSOD.

To increase the disk timeout value:

  1. Run REGEDIT
  2. Go to HKEY_LOCAL_MACHINE/System/CurrentControlSet/Services/Disk.
  3. Right-click and select Edit/Add value.
  4. Set the value name to TimeOutValue.
  5. Set the data type to REG_DWORD.
  6. Set the data to 120 decimal.
  7. Reboot the virtual machine.

A good Powershell script that I sed to mass modify this value on our servers is the one here. Another Powershell solution can be found here. A related post is here. Finally, a good iSCSI SAN configuration guide for VMWARE can be found here.

3 comments:

  1. you are a genius !! you tsolution help me , because i have the same problem on linux server !

    •Check the current timeout for every generic SCSI device in Linux sysfs using the command:

    find /sys/class/scsi_generic/*/device/timeout -exec grep -H . '{}' \;


    •The timeout value for an individual disk can be modified using the sysfs interface. For example:

    echo 180 > /sys/block/ sdc/device/timeout

    Note: This change does not persist across reboots.


    •The VMware Tools installer creates a udev rule at /etc/udev/rules.d/99-vmware-scsi-udev.rules that sets the timeout for each VMware virtual disk device and reloads the udev rules so that it takes effect immediately. This rule is applied during each subsequent startup. For example, the following is the udev rule from vSphere 4.x:

    # Redhat systems
    ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware, " , SYSFS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

    # Debian systems
    ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="VMware " , ATTRS{model}=="Virtual disk ", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

    # SuSE / Ubuntu systems
    ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="VMware, " , ATTRS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

    ReplyDelete
    Replies
    1. Glad it helped and thanks for sharing the procedure for Linux guests!
      Carlo

      Delete
  2. I run into your blog when I was investigating EventID 15 on a Windows file server with SAN storage, no VMware involved. On first sight it seemed to be similar to my case. I have TimeOutValue set to 60. But then I found this http://blogs.msdn.com/b/san/archive/2011/09/01/the-windows-disk-timeout-value-understanding-why-this-should-be-set-to-a-small-value.aspx that actually recommended decreasing TimeOutValue.
    Any ideas here?

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...