Start a new topic
Answered

What is the best way to re-run jobs that exit with a certain error?

I'm running a batch of jobs and have noticed that some have an error with obtaining the data through stashcp while others are fine.


I would like a way to rerun only the minority of jobs that fail, and not the jobs that run perfectly fine. I appreciate any tips!


Best Answer

Scott,


First make sure your jobs fail with non-zero exit code in case of failure. If you are using a bash job wrapper, you can do this by adding set -e to the wrapper. For example:


#!/bin/bash

set -e

stashcp ....


This will fail the wrapper as soon as a command in the wrapper fails. Then in your submit file, add rules for handle failed jobs. For example:


# Send the job to Held state on failure. 

on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)


# Periodically retry the jobs every 1 hour, up to a maximum of 5 retries.

periodic_release = (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 60*60)


This will catch the non-zero exit code from the failing job wrapper, and allow failed jobs to restart 5 times automatically. For a full job submit file example, see:


https://github.com/OSGConnect/tutorial-quickstart/blob/master/osg-template-job.submit 


Answer

Scott,


First make sure your jobs fail with non-zero exit code in case of failure. If you are using a bash job wrapper, you can do this by adding set -e to the wrapper. For example:


#!/bin/bash

set -e

stashcp ....


This will fail the wrapper as soon as a command in the wrapper fails. Then in your submit file, add rules for handle failed jobs. For example:


# Send the job to Held state on failure. 

on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)


# Periodically retry the jobs every 1 hour, up to a maximum of 5 retries.

periodic_release = (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 60*60)


This will catch the non-zero exit code from the failing job wrapper, and allow failed jobs to restart 5 times automatically. For a full job submit file example, see:


https://github.com/OSGConnect/tutorial-quickstart/blob/master/osg-template-job.submit 


Thank you! As for the exit code, where can I find that in the log/error files?


For an example failed job, in my error file, I see:

[ERROR] Server responded with an error: [3011] No servers are available to read the file.

...


And in the job file, I see: 

...

005 (31090602.010.000) 05/01 15:25:32 Job terminated.

 (1) Normal termination (return value 0)

...


So I am guessing that the error code is either 3011 or 0. Which is it?


Is that stashcp giving that error? What should have happened is that stashcp would have failed with non-zero exit code, and because of the set -e, your job script should have stopped there, and also exited with non-zero exit code.


Can you point me to the full log file for that job? You can just give a path on the submit host and I can take a look there.

Sorry, I've since deleted those log and error files since I didn't need them anymore.


What I pasted above was from a job before I made the modifications you suggested (e.g. 'set -e'). I was wondering if 3011 was what you meant by 'exit code' in that case.

Ah, well I don't know what the exact error code is. It could be 3011 but also something else. You could check for specific error codes, but the general convention is that non-zero is some type of failure. In most cases we don't care about which particular failure it was.

Login or Signup to post a comment