Re-trying failed jobs - PeriodicRelease

When submitting many jobs, a few jobs might fail and go to the held state on few sites due to the difference in the architecture, system library requirements, operating system versions, etc... In such cases, periodic release would automatically re-try the failed jobs.  The following condor expression 

PeriodicRelease = ( (CurrentTime - EnteredCurrentStatus) > $RANDOM_INTEGER(60, 7200, 120) ) && ((NumJobStarts < 4))

would re-try the failed jobs for at-least three times. The RANDOM_INTEGER(60, 7200, 120) means random integers are generated between 60 and 7200 seconds with a step size of 120 seconds.  With the above expression, the failed jobs are randomly released with a spread of 1 min to 2 hours. Releasing multiple jobs at the same time causes stress for the login node, so the random spread is a good approach to periodically release the failed jobs. 

2 people like this

Very nice Bala.  I wonder if others have experience with job retry settings in HTCondor?

Login or Signup to post a comment