Scaling up the computational resources is a big advantage for doing certain large scale calculations on OSG. Consider the extensive sampling for a multi-dimensional Monte Carlo integration or molecular dynamics simulation with several initial conditions. These type of calculations require submitting lot of jobs.
In the previous example, we submitted the job to a single worker machine. About a million CPU hours per day are available to OSG users on an opportunistic basis. Learning how to scale up and control large numbers of jobs to realizing the full potential of distributed high throughput computing on the OSG.
In this section, we will see how to scale up the calculations with simple example. Once we understand the basic HTCondor script, it is easy to scale up.
$ tutorial ScalingUp-R $ cd tutorial-ScalingUp-R
As we discussed in the previous section on HTCondor scripts, we need to prepare the job execution and the job submission scripts. Here again is our job execution script:
Submitting jobs concurrently
If we want to submit several jobs, we need to track log, out and error files for each
job. An easy way to do this is to add the
macros to the HTCondor submit file.
universe = vanilla Executable = R-wrapper.sh arguments = mcpi.R $(Process) transfer_input_files = mcpi.R # mcpi.R is the R program we want to run output = Log/job.out.$(Cluster).$(Process) error = Log/job.error.$(Cluster).$(Process) log = Log/job.log.$(Cluster).$(Process) requirements = (HAS_CVMFS_oasis_opensciencegrid_org =?= TRUE) # Checks if OASIS available queue 100
Queue 100. This tells Condor to enqueue 100 copies of this job
as one cluster.
Let us take a look at the execution script,
#!/bin/bash source /cvmfs/oasis.opensciencegrid.org/osg/modules/lmod/current/init/bash module load R Rscript $1 > mcpi.$2.out
The wrapper loads the R module and then executes the script with
Rscript utility. From the submit
file described above, the first argument is the name of the R program -
mcpi.R and the second argument is the process number. The process number is a sequence of integers and used here to name the output
You'll see something like the following upon submission:
$ condor_submit R.submit Submitting job(s)......................... 100 job(s) submitted to cluster 837.
connect watch knowledge to see this job
progress. Execute the following bash script to compute the average from all the jobs.
Interlude: utilization plots
Before we continue, let's look at a URL: your OSG Connect home page. If you have not signed in, you'll be redirected back to the main site. Sign In as you did the first time you signed up, and then click again on the your OSG Connect home link.
You see a number of graphs and plots here showing things happening in OSG Connect. We'll go over these briefly, then return later.
We're waiting on 1,000 jobs. Let's use
connect watch to
watch for job completions. As soon as you see some jobs enter R state
(running), press control-C, and let's introduce a new command:
$ connect histogram Val |Ct (Pct) Histogram unl.edu |46 (68.66%) ████████████████████████████████▏ bu.edu |13 (19.40%) █████████▏ uconn.edu |2 (2.99%) █▌ CRUSH-OSG-10-5-220-34 |1 (1.49%) ▊ ufhpc |1 (1.49%) ▊ LAW-D-SBA01-S2-its-c6-osg-20141013|1 (1.49%) ▊ CRUSH-OSG-10-5-10-33 |1 (1.49%) ▊ iu.edu |1 (1.49%) ▊ vt.edu |1 (1.49%) ▊
This command gives us a simple histogram of where on the grid our jobs are running. The column on the left is (for the most) a list of sites that OSG jobs run on. At times we don't correctly group job locations together. For example, the two rows for CRUSH-* above are really the same site, but histogram doesn't know about that site (yet) so it displays as two. But most of the big sites are mapped correctly. You see that in my case, 67 of my 100 jobs have begun running, and among them 69% (46 of 67) are running at University of Nebraska at Lincoln.
connect histogram gives metrics on current jobs. As jobs complete,
they no longer appear. How to see where jobs have already run?
histogram --last shows the run sites of your last job cluster.
$ connect histogram --last Val |Ct (Pct) Histogram uc3 |49 (49.00%) ████████████████████████████████▏ bu.edu |21 (21.00%) █████████████▊ uconn.edu |11 (11.00%) ███████▎ unl.edu |9 (9.00%) ██████ mwt2.org |3 (3.00%) ██ c5a-s22.ufhpc |3 (3.00%) ██ LCS-215-021-S2-its-c6-osg-20141013|3 (3.00%) ██ cinvestav.mx |1 (1.00%) ▊
Once the jobs are completed, you might want to invoke the script⋅
~~~ $mcpi_ave.bash ~~~
to compute the average value of pi from all the available outputs.⋅⋅
- [x] Scaling up the computational resources on OSG is crucial to taking full advantage of grid computing.
- [x] Changing the value of
Queueallows the user to scale up the resources.
Argumentsallows you to pass parameters to a job script.
$(Process)can be used to name log files uniquely.
connect histogramgives a nice plot of resource assignments.
This page was updated on Oct 23, 2019 at 10:00 from tutorials/tutorial-ScalingUp-R/README.md.