Sunday, November 7, 2010

ESX server troubleshooting

Three days ago I was woken up for a problem on en critical ESX server for which I am on-call. As you know, duties of a system administrator are wide-ranging, and, in particularly sensitive environments, you must be ready to be answering the phone at any time of the year… so I duly spent most of my night alone with my laptop to solve it and that event made me think of a post where I would resume a bunch of information about ways and commands to troubleshoot an ESX server problem. This is that post.

First of all, when troubleshooting an ESX, you have to rule out if there is any sort of server crash. So, ping your ESX server and if it not responding connect to your out-of-band management console (HPO ILO, IBM RSA or DELL DRAC in most cases) and check if there is a purple screen of dead (PSoD), which would means that a crash has been caused by an hardware problem or by a VMware bug in the ESX code.
Take a screenshot of the error on your remote management console before rebooting. Save this screenshot for further analysis, because this could indeed be a good piece of conviction to help you find the element responsible of the unplanned crash.

Once you have restarted your ESX server, move to you root directory (/) and find a file beginning with the name vmkernel-zdump. The vmkdump utility must be run against this file to extract the VMkernel logfile (with vmkdump –l) and look for any clues as to the cause of the PSoD. You will often see (and I say that from personal experience) that this dump file will point out a RAM memory failure. At this point on older version of VMWARE ESX you could run a Ramcheck but unfortunately this utility has been suppressed in ESX 3.5. So, if you are on a ESX 3.0 server, logon as root and run

service ramcheck start

This starts a background ram check of the server’s RAM and writes out a log file to /var/log/vmware/ramcheck.log and ramcheck-err.log.

Otherwise, if you are on a ESX 3.5 or ESX 4 environment, your best option is to boot your ESX server on a MEMTEST86 Live CD and test your RAM modules from there.

Once you have solved your RAM problems (if you had any…), log on again with root and run the vm-support tool that will put all your ESX server's log and configuration files togheter in a TAR file. This TAR file is usually sent to VMWARE tech people for analysis. You can also use the VI client to generate the same compressed file. To do so, select Administration, then select Export Diagnostic Data, select your host (VirtualCenter data option) and a local PC directory to store the file which will be created.

Vm-support can also be used with specific parameters to identify specific problems:
  • -n Causes no core files to be included in the tar file
  • -s Takes performance snapshots in addition to other data
  • -S Takes only performance snapshots
  • -x Lists world IDs (wid) for running VMs
  • -X wid Grabs debug info for a hung VM
  • -w dir Sets working directory for output files
Now it comes a very important section of this post. The one where I am going to list the possible important logfiles that you may have to walk through. Log files must be accurately analysed with command such as cat, grep, tail and awk. Every troubleshooting task should start from the vmkernel log file.

/var/log/vmkernel.1: in this log you will find a record of ESX server and virtual machines tasks. To get the last messages type tail /var/log/vmkernel.1

Your output should look something like this:
vmkernel: 101:20:53:00.082 cpu14:4117)FS3: 2798: [Requested mode: 1] Checking liveness of lock holders [type 10c00001 offset 40669184 v 361, hb offset 3194880
vmkernel: gen 1609, mode 1, owner 4be1fcb6-3c92adf1-14ee-18a9055877ec mtime 59294]on volume 'datastore1'.
vmkernel: 101:20:53:04.084 cpu14:4117)FS3: 2890: [Requested mode: 1] Lock [type 10c00001 offset 40669584 v 361, hb offset 3194880
vmkernel: gen 1609, mode 1, owner 4be1feb6-3d92adf1-14ee-19a9955177ec mtime 59294] is not free on volume 'datastore1'


Even better, try to grep an error string: tail /var/log/vmkernel.1 grep “error”.

Have a look then at /var/log/vmkwarning which keeps a copy of everything marked as a warning or higher severity from the previous vmkernel log. It is much simpler to look through this for warnings and errors, instead of filtering through the full information in the vmkernel logs. In an ideal world you could implement a PERL script that checks this log and and forward entries from here to HP SIM (after having loaded the appropriated MIB in HP SIM).

Secondary but still useful logs are the following ones:

/var/log/vmksummary where you will find the uptime of the ESX server and statistics about its availability. Skip this and go straight to the human readable version /var/log/vmksummary.txt.

/var/log/vmware/hostd.log includes information about the agent in charge of managing and configuring the ESX server host and its virtual machine.

If you want to push your analysis deeper and deeper, issue a cat command on /var/log/vmware/esxcfg-firewall.log for firewall problems, and on /var/log/vmware/esxupdate.log for a view on update job.

Don’t forget also to skim read your /var/log/vmware/vpxa logfile which contains information on the agent that communicates with vCenter Server.

Also read /var/log/messages which is the log from the Linux kernel (service console). This log is potentially useful in the case of a host hang, crash, authentication issue. Remember that this log has absolutely nothing to do with VMs. The SERVICE CONSOLE ( based on the red hat kernel) has NO awareness of the virtual machines (worlds) running on the VMKERNEL.

To end it up, here’s a list of commands which are often useful to collect information prior to sending your logs to VmWare support people.
  • Enter the vmware-v to check the version of ESX server, such as VMware ESX Server 3.0.1 build-32039
  • Enter esxupdate-l query to view the installation of the patch
  • Enter vpxa-v to check the version of ESX server management
  • Enter rpm-qa grep VMware-esx-tools to check the ESX server version of VMware Tools installed
A few useful links to continue investigating (or understanding):

Post scriptum: remember that this post applies mainly to VMWARE ESX and only partially to ESXi. The significant difference betweeen ESX and ESXi being that the last one has no Service Console. That reduces the size of ESXi to under 32 MB so with the embedded version you can run it off a USB key. In removing the service console which is Linux, you also reduce the security exposure and thus reduce the number of patches you will have to apply to the system. The downside today is that ESXi doesn't support everything that 3.5 does - for example HA just has experimental support, jumbo frames aren't supported.

2 comments:

  1. Nice one.. Thanks for posting here

    ReplyDelete
  2. Tnx for the post! It's nice to have this info regrouped at the same place. :)

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...