I initially thought I had this solved with my 2 tasks [1 to start service upon startup and s second running every hour] However, this was not totally effective in keeping all hosts online.
I have had to go a step further and add a task that performs a stop service then start service at 4am every day. Without this task, I was finding that some [not all] hosts would go grey, and never come back online. The service itself was running, but the host was just offline. It is not an issue that occurs every day, it seems very random. The best theory I have is that if the host loses connectivity to the RU server for an [as yet unknown] extended amount of time, it seems to give up trying, and never reconnects. Maybe there is a connection retry limit in the hosts somewhere.
The 4am daily restart so far appears to be dealing with this issue for me, and as I typically do not need overnight sessions to stay active, it does not have any negative impact here. However, my gripe is that if the remote host restarts and I have left a session running, the screen at my viewer end will show offline, but does not close, then when I have to close the viewer window, it takes about 5-8 seconds for each window to close, which is frustrating.
I have only implemented these recurring tasks [I call it the RUM; Remote Utilities Monitor] on a small number of our hosts, but overall, the reliability has improved significantly on these, whilst the non RUM'd machines are often problematic.
It would be great if RU could have its own watchdog service built in, so we would not have implement our own.
In my case, the issue is not firewall, or sleep or connection interruption. It is an RU issue. If it were not, it would be difficult to explain how it happens on dozens of machines, all in different geographical locations, on different internet provider connections, with different hardware, with different antivirus, connected through different routers, with varying sleep/no sleep settings.
I have resorted to a 3 scheduled tasks that check on startup for the service to be running, then check hourly again, and finally a third to restart the service each day at 4am. This has made the issue less of a headache, but it does still exist.
There is another issue that accompanies this for us, I am unsure if it is part of the same problem, and it is less common than the startup problem.
We notice it on hosts that run for long periods of time without reboots, sometimes the host will go offline in the viewer, and the icon on the host machine will turn grey. Attached are screenshots from 1 such host currently. In this case, the RU service is running. If I manually restart the service, the host will come back online, and is then I can connect. But it will not recover on its own. We do not have logging enabled on the host machines so I do not have any information to accompany this.
The timeout error is interesting. If the host is trying to start, and there is some dependency service that is not yet running, I guess that is a possible source for the issue, but in my case, I have previously tried setting the service recovery options to Restart the Service for 1st, 2nd and subsequent failures, but this was not effective in resolving the problem. Only a manual start request seemed to ever work.
I don't know that I will be able to provide this information. I do not always realise that the host is offline right away. We have almost 500 hosts and we do not log into them all every day. I will keep an eye out and see if I can catch a host going offline after a reboot.
Something I have found with the host service, is that even if I switch the service recovery settings to Restart the Service on first, second and third failures, this has not been effective in getting the service online after a reboot. It seems that if the service does not start on boot, only starting it manually will work to bring it back online.
I am surprised you cannot replicate this issue. I would say it happens on most of our client hosts quite regularly.
...I just re-read your post and realised you are not using the script. I am unsure if there will be any effect of running the start command when the service is already running. Probably not, but I guess you will know for sure after it runs for a few days.
The antivirus is probably picking up the script as unknown and untrusted program. I would say that it quarantined the RUMonitor.cmd file which is why it did not work for you initially.
Your way will still work fine, I needed a way to quickly deploy it to machines which is why I created the installer. Creating an exclusion in the antivirus will probably solve this for you if you wish to do that.
You should see no adverse affects in running this as the script executes no commands when the RU service is already started, it will only issue a start command if the status returns not equal to running.
This thread got me thinking about the watchdog, so I have created a makeshift script version. Looking forward to a real fix, but in the meantime, here it is for anyone interested.
1. Create a batch file called CreateTask.cmd and insert the below code: [This code creates a folder C:\Program Files\RUMonitor to store the script, and creates 2 scheduled tasks, the first runs the script at startup, the second runs the script every hour on the hour. Feel free to modify the schedule to your needs]
2. Create a second batch script called RUMonitor.cmd and insert the below code: [This code checks to see if the RU host is running, if it is, nothing is performed. If it is not, a start command is sent to the service]
echo off for /F "tokens=3 delims=: " %%H in ('sc query "RManService" ^| findstr " STATE"') do ( if /I "%%H" NEQ "RUNNING" ( REM Put the service to start here net start "RManService" ) )
3. Place both scripts in a folder together [it does not matter which folder], then run the CreateTask.cmd as administrator. That's all that is required. You will see the 2 tasks in Task Scheduler.