Tuesday, December 13, 2011

Understanding Windows Services Recovery features

As you probably know Windows has the ability to automatically perform some predefined action in response to the failure of a Windows Service. The Recovery tab in the Service property page let you in fact define the actions that the system has to perform on first failure, second failure, and subsequent failures.

Valid options are "Take No Action", "Restart the Service", "Run a Program", and "Restart the Computer".

In my case I have configured my test Trend ServerProtect service to restart after the first and the second failure, then a system reboot is executed the next time this service fails.

To test this I have written a basic batch script which recursively kills the service. Doing so I have just discovered that, with the default setting, Windows always performs the action defined for the first failure (in my case my TREND ServerProtect test service is restarted) and will never go through successive actions.

Furthermore I see that the event log reports all the time the same diagnostic message, even in case of recurring service failures:

Log Name:      System
Source:        Service Control Manager
Date:          07/12/2011 10:54:25
Event ID:      7031
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      servername
The Trend ServerProtect service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

The "It has done this 1 time(s)" sentence looks problematic to me because I am recursively killing this service and the failure counter should increase.

If I double check the recovery parameters with sc.exe I am happy with the output:

sc qfailure spntsvc
[SC] QueryServiceConfig2 SUCCESS

   RESET_PERIOD (in seconds)    : 0
   REBOOT_MESSAGE               :
   COMMAND_LINE                 :
   FAILURE_ACTIONS              : 
     RESTART -- Delay = 60000 milliseconds.
     RESTART -- Delay = 60000 milliseconds.
     REBOOT -- Delay = 60000 milliseconds.

So, why does the failure counter does not increase? Cleary it looks like there is a bug in the way the Service Control Manager reads or understands the parameters I have set.

After deep investigation, and a after many searches throughout technet.microsoft.com, I found that setting the "Reset fail count after:" option to 0 means that the failure counter will not be stored at all. So I completely misunderstood its meaning. At first I was lost for words when I discovered that this parameter did not do what I expected from it.

Anyway, once you know that keeping this option set to 0 disables both the "second failure" and "subsequent failure" actions, the solution is pretty simple: set its value to 1 (or whatever you like) and you'll get the desired behavior upon service failure (in my case the server will restart upon third failure).

I hope this post will help you and, if so, do not hesitate to comment!


  1. We've got a print spooler configured this way too but it reports 'The Print Spooler service terminated unexpectedly. It has done this 3 time(s).' As the Reset fail count is 0 I'm surprised it reports back 3 times. Any thoughts?

    1. Hi,
      ok, this might be because the operating system keeps a trace of all service terminations but if the 'reset counter' is set to 0 it won't make use of it. This is my understanding of this confusing behavior...

    2. Okay I've tested the behaviour of two services, one random service and one print spooler service. The random service, with a reset count of 0, reports '1 time' every time it fails: '...service terminated unexpectedly. It has done this 1 time(s).'

      However the Print Spooler increments this number and records it:

      The Print Spooler service terminated unexpectedly. It has done this 2 time(s).

      The Print Spooler service terminated unexpectedly. It has done this 3 time(s).

      This is only seems to occur with 2003, not 2008...

      The fix?

      Set 'Subsequent failures' to restart as well for 2003 boxes with failing printer spoolers.

    3. Thanks for the information on the reset counter value of 0!

  2. Carlos, man, thanks so much for this! I would have never guessed that with the counter - I thought that was, ya know, the counter, silly me. Why would that affect the subsequent actions? So weird. You saved me a ton of headache (as Im trying to figure out when a service fails) and I was banging my head. Thanks!!

  3. Carlos, man, you saved me big time. Why would the counter affect the subsequent steps? So weird. anyway, thanks for this!!


Related Posts Plugin for WordPress, Blogger...