Identifying jobs with problems
connect q command tells the status of submitted jobs:
$ connect q <osgconnect-username> -- Submitter: login01.osgconnect.net : <18.104.22.168:40814> : login01.osgconnect.net ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1234.0 username 4/29 16:42 0+00:00:00 I 0 0.0 short.sh 1234.1 username 4/29 16:42 0+00:00:00 I 0 0.0 short.sh 1234.2 username 4/29 16:42 0+00:00:49 R 0 0.0 short.sh 1234.3 username 4/29 16:42 0+00:00:49 R 0 0.0 short.sh 1234.4 username 4/29 16:42 0+00:00:49 R 0 0.0 short.sh 1234.5 username 4/29 16:42 0+00:00:00 I 0 0.0 short.sh 1234.6 username 4/29 16:42 0+00:00:49 R 0 0.0 short.sh 1234.7 username 4/29 16:42 0+00:00:49 R 0 0.0 short.sh 1234.8 username 4/29 16:42 0+00:00:49 H 0 0.0 short.sh 1234.9 username 4/29 16:42 0+00:00:49 H 0 0.0 short.sh 10 jobs; 0 completed, 0 removed, 3 idle, 5 running, 2 held, 0 suspended
Notice that the last two jobs have a state (the ST column) of H. This indicates that the jobs encountered some sort of problem and have been held. Jobs that are held remain in a suspended state in the system and are not run.
Troubleshooting job errors
In order to troubleshoot jobs, you'll need to use the
connect shell command as follows:
$ connect shell [connected to connect://email@example.com/connectusername; ^D to disconnect] sh-4.1$
Once this is done, you can use
condor_ssh_to_job to troubleshoot a job that's
Diagnostics with condor_q
condor_q command shows the status of the jobs and it can be used
to diagnose why jobs are not running. Using the
condor_q can show you detailed information about why a job isn't
sh-4.1$ condor_q -better-analyze JOB-ID
This will indicate why the job is being held. Often times, the problem is due to missing input or output files. If that's the case, you should check your submit files to make sure that you have the correct filenames for inputs and ouputs.
This command allows you to
ssh to the compute node where the job is
running. After running
condor_ssh_to_job, you will be connected to
the remote system, and you will be able to use normal shell commands to
investigate your job.
sh-4.1$ condor_ssh_to_job JOB-ID
This page was updated on Oct 16, 2017 at 16:00 from connectbook/client-troubleshooting.md.