Organizing and Submitting HTC Workloads

Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author.

This tutorial starts with the same set up as our Wordcount Tutorial for Submitting Multiple Jobs, but focuses on how to organize that example more effectively on the Access Point, with an eye to scaling up to a larger HTC workload in the future.

Our Workload

We can analyze one book by running the wordcount.py script, with the name of the book we want to analyze:

$ ./wordcount.py Alice_in_Wonderland.txt

Try running the command to see what the output is for the script. Once you have done that delete the output file created (rm counts.Alice_in_Wonderland.txt).

We want to run this script on all the books we have copies of.

  1. What is the input set for this HTC workload?
  2. What is the output set?

Make an Organization Plan

Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the access point?

There will also be system and HTCondor files produced when we submit a job -- how would you organize the log, standard error and standard output files?

Try making those changes before moving on to the next section of the tutorial.

Organize Files

There are many different ways to organize files; a simple example that works for most workloads is having a directory for your input files and a directory for your output files. We can set up this structure on the command line by running:

$ mkdir input
$ mv *.txt input/
$ mkdir output/

We can view our current directory and its subdirectories by using the recursive flag with the ls command:

$ ls -R
README.md    books.submit input        output       wordcount.py

./input:
Alice_in_Wonderland.txt Huckleberry_Finn.txt    Ulysses.txt
Dracula.txt             Pride_and_Prejudice.txt

./output:

We are also going to create directories for the HTCondor log files and the standard error and standard output files (in one directory):

$ mkdir logs
$ mkdir errout

Submit One Job

Now we want to submit a test job that uses this organizing scheme, using just one item in our input set -- in this example, we'll use the Alice_in_Wonderland.txt file from our input/ directory. The lines that need to be filled in are shown below and can be edited using the nano text editor:

$ nano books.submit

executable    = wordcount.py
arguments     = Alice_in_Wonderland.txt

transfer_input_files    = input/Alice_in_Wonderland.txt
transfer_output_files   = counts.Alice_in_Wonderland.txt
transfer_output_remaps  = "counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt"

Note that to tell HTCondor the location of the input file, we need to include the input directory. We're also using a submit file option called transfer_output_remaps that will essentially move the output file to our output/ directory by renaming or remapping it.

We also want to edit the submit file lines that tell the log, error and output files where to go:

$ nano books.submit
output        = logs/job.$(ClusterID).$(ProcID).out
error         = errout/job.$(ClusterID).$(ProcID).err
log           = errout/job.$(ClusterID).$(ProcID).log

Once you've made the above changes to the books.submit file, you can submit it, and monitor its progress:

$ condor_submit books.submit
$ condor_watch_q

(Type CTRL-C to stop the condor_watch_q command.)

Submit Multiple Jobs

We are now sufficiently organized to submit our whole workload.

First, we need to create a file with our input set -- in this case, it will be a list of the book files we want to analyze. We can do this by using the shell's listing command ls and redirecting the output to a file:

$ cd input
$ ls > booklist.txt
$ cat booklist.txt
$ mv booklist.txt ..
$ cd ..

Then, we modify our submit file to reference this input list and replace the static values from our test job (Alice_in_Wonderland.txt) with a variable -- we've chosen $(book) below:

$ nano books.submit

executable    = wordcount.py
arguments     = $(book)

transfer_input_files    = input/$(book)
transfer_output_files   = counts.$(book)
transfer_output_remaps  = "counts.$(book)=output/counts.$(book)"

# other options

queue book from booklist.txt

Once this is done, you can submit the jobs as usual:

$ condor_submit books.submit

 

This page was updated on May 20, 2022 at 18:09 from tutorials/tutorial-organizing/README.md.