TaskForest

A simple, expressive, open-source, text-file-based Job Scheduler with console, HTTP, and RESTful API interfaces.

Documentation

Automatic Retries

TaskForest can be configured so that when a job fails, it will automatically retry running the job. There is a system-wide option called num_retries that specifies how many times the job will be retried. The retry_sleep option specifies how many seconds the system will wait before trying to rerun the job.

The num_retries and retry_sleep options may optionally be specified for each job as well. For example, if the configuration file contains this...

# This is the number of times to automatically
# retry running a job that fails 
num_retries              = 1

# Wait these many seconds before automatically
# retrying running a job that fails 
retry_sleep              = 300

...then if any job fails, the run wrapper will sleep for 300 seconds and then retry the job once. If the retry fails as well, then the job will be considered to have failed. If the retry succeeds, then job will have considered to have run successfully. During 300 second sleep period, and during the retries, the official status of the job will still be 'Running.'

The responsibility of implementing the auto-retries falls on the run wrapper. Even though TaskForest ships with two run wrappers, you really should use run_with_log and not run. When a job is being retried, run_with_log will note it the log file like this:

*****************************************************************
Start Time:   Mon Mar 22 17:57:21 2010
Family:       RETRY
Job:          J_Retry
Job File:     J_Retry
Log Dir:      logs/20090503
Script Dir:   jobs
Pid File:     logs/20090503/RETRY.J_Retry.pid
Success File: logs/20090503/RETRY.J_Retry.0
Failure File: logs/20090503/RETRY.J_Retry.1
Out/Err File: logs/20090503/RETRY.J_Retry.19258.1269298641.stdout

*****************************************************************

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Current Time: Mon Mar 22 17:57:21 2010
!! Exit Code:    256
!! 
!! Job failed.  Sleeping 2 seconds and then retrying (retry 1 of 1).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

*****************************************************************
Start Time: Mon Mar 22 17:57:21 2010
End Time:   Mon Mar 22 17:57:23 2010
Duration:   2 seconds
Exit Code:  0
*****************************************************************

You can see that in this case, TaskForest had been instructed to sleep for 2 seconds before retrying, and that there was to be only one retry.

You can also override these configuration settings for individual jobs. Let's say you have your configuration file set up as shown above, with one retry after 300 seconds. Now suppose you want job J_Retry to be retried 10 times, with a two-minute sleep between retries. You could specify the overrides directly in the family file as follows:


   +-------------------------------------------------------
01 | ...
02 |
03 | J_Retry (num_retries => 10, retry_sleep => 120)
04 |
05 | ...
   +-------------------------------------------------------

This local specification of ten retries spaced two minutes apart will override what you have specified in the configuration file. You can override these configuration options on as many jobs as you wish.