As you probably know Windows has the ability to automatically perform some predefined action in response to the failure of a Windows Service. The Recovery tab in the Service property page let you in fact define the actions that the system has to perform on first failure, second failure, and subsequent failures.
Valid options are "
Take No Action", "
Restart the Service", "
Run a Program", and "
Restart the Computer".
In my case I have configured my test
Trend ServerProtect service to restart after the first and the second failure, then a system reboot is executed the next time this service fails.
To test this I have written a basic batch script which recursively kills the service. Doing so I have just discovered that, with the default setting, Windows always performs the action defined for the first failure (in my case my TREND ServerProtect test service is restarted) and will never go through successive actions.
Furthermore I see that the event log reports all the time the same diagnostic message, even in case of recurring service failures:
Log Name: System
Source: Service Control Manager
Date: 07/12/2011 10:54:25
Event ID: 7031
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: servername
Description:
The Trend ServerProtect service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.
The "
It has done this 1 time(s)" sentence looks problematic to me because I am recursively killing this service and the failure counter should increase.
If I double check the recovery parameters with
sc.exe I am happy with the output:
sc qfailure spntsvc
[SC] QueryServiceConfig2 SUCCESS
SERVICE_NAME: spntsvc
RESET_PERIOD (in seconds) : 0
REBOOT_MESSAGE :
COMMAND_LINE :
FAILURE_ACTIONS :
RESTART -- Delay = 60000 milliseconds.
RESTART -- Delay = 60000 milliseconds.
REBOOT -- Delay = 60000 milliseconds.
So, why does the failure counter does not increase? Cleary it looks like there is a bug in the way the Service Control Manager reads or understands the parameters I have set.
After deep investigation, and a after many searches throughout technet.microsoft.com, I found that setting the "
Reset fail count after:" option to 0 means that the failure counter will not be stored at all. So I completely misunderstood its meaning. At first I was lost for words when I discovered that this parameter did not do what I expected from it.
Anyway, once you know that keeping this option set to 0 disables both the "second failure" and "subsequent failure" actions, the
solution is pretty simple: set its value to 1 (or whatever you like) and you'll get the desired behavior upon service failure (in my case the server will restart upon third failure).
I hope this post will help you and, if so, do not hesitate to comment!