Overview
For input files >100MB and output files >1GB in size, the default HTCondor file transfer mechanisms run the risk of over-taxing the login nodes and their network capacity. And this is exactly why the OSG Data Federation exists for researchers with larger per-job data!
Users on an
OSG Connect login node can handle such files via the OSG Connect data caching origin
(mounted and visible as the /public
location) and use OSG's caching tools to
scalably transfer them between the running jobs and the origin.
The OSG caching tools ensure faster delivery to and from execute nodes by taking adantage of
regional data caches in the OSG Data Federation, while preserving login node performance.
Important Considerations and Best Practices
-
As described in OSG Connect's Introduction to Data Management on OSG Connect, the
/public
location must be used for:- Any input data or software larger than 100MB for transfer to jobs using OSG caching tools
- Any per-job output >1GB and <10GB, which
should ONLY be transferred back to the origin using a
stashcp
command within the job executable.
-
User must never submit jobs from the
/public
location, and should continue to ONLY submit jobs from within their/home
directory. Alllog
,error
,output
files and any other files smaller than the above values should ONLY ever exist within the user's /home directory, unless otherwise directed by an OSG staff member.Thus, files within the
/public
location should only be referenced within the submit file by using the methods described further below, and should never be listed for direct HTCondor transfer viatransfer_input_files
,transfer_output_files
, ortransfer_output_remaps
.The
/public
location is a mount of the OSG Connect origin filesystem. It is mounted to the OSG Connect login nodes only so that users can appropriately stage large job inputs or retrieve outputs via the login nodes. -
Because of impacts to the filesystem of the data origin, files in the data origin (
/public
) should be organized in one or very few files per job. The filesystem is likely to encounter performance issues if/when files accumulated there are highly numerous and/or small. -
The
/public
location is otherwise unnecessary for smaller files (which can and should be served via the user's /home directory and regular HTCondor file transfer). Smaller files should only be handled via/public
with explicit instruction from an OSG staff member. -
Files placed within a user's
/public
directory are publicly accessible, discoverable and readable by anyone, via the web. Data is made public viastash
transfer (and, thus, via http addresses), and mirrored to a shared data repository which is available on a large number of systems around the world.
Use a 'stash' URL to Transfer Large Input Files to Jobs
Jobs submitted from the OSG Connect login nodes will transfer data from the origin
when files are indicated with an appropriate stash:///
URL in the transfer_input_files
line
of the submit file:
-
Upload your larger input and/or software files to your
/public
directory which is accessible via your OSG Connect login node at/public/username
for which our Using scp To Transfer Files To OSG Connect guide may be helpful.Because of the way your files in
/public
are cached across the Open Science Pool, any changes or modifications that you make after placing a file in/public
will not be propagated. This means that if you add a new version of a file to your/public
directory, it must first be given a unique name (or directory path) to distinguish it from previous versions of that file. Adding a date or version number to directories or file names is strongly encouraged to manage your files in/public
. -
Add the necessary details to your HTCondor submit file to tell HTCondor which files to transfer, and that your jobs must run on executes nodes that have access to the Open Science Data Federation.
# Submit file example of large input/software transfer log = my_job.$(Cluster).$(Process).log error = my_job.$(Cluster).$(Process).err output = my_job.$(Cluster).$(Process).out #Transfer input files transfer_input_files = stash:///osgconnect/public/<username>/<dir>/<filename>, <other files> ...other submit file details...
Note how the
/public
mount (visible on the login node) corresponds to the/osgconnect/public
namespace across the Open Science Federation. For example, if the data file is located at/public/<username>/samples/sample01.dat
, then thestash:///
URL to transfer this file to the job's working directory on the execute point would be:stash:///osgconnect/public/<username>/samples/sample01.dat
Use stashcp
to Transfer Larger Job Outputs to the Data Origin
For output, users should use the stashcp
command within their job executable,
which will transfer the user's specified file to the specific location in the data origin.
-
Add the necessary details to your HTCondor submit file to tell HTCondor that your jobs must run on executes nodes that have access to the
stashcp
module (among other -supported modules). Note that the output files are NOT listed anywhere in the submit file for transfer purposes.# submit file example for large output log = my_job.$(Cluster).$(Process).log error = my_job.$(Cluster).$(Process).err output = my_job.$(Cluster).$(Process).out requirements = (OSGVO_OS_STRING =?= "RHEL 7") && (HAS_MODULES =?= true) ...other submit file details...
-
Add a
stashcp
command at the end of your executable to transfer the data files back to the OSG Connect data origin (within/public
). You will need to prepend your/public
directory path withstash:///osgconnect
as follows:#!/bin/bash # other commands to be executed in job: # transfer large output to public stashcp <filename> stash:///osgconnect/public/username/path/<filename>
For example, if you wish to transfer
output.dat
to the directory/public/<username>/output/
then thestash
command would be:stashcp output.dat stash:///osgconnect/public/<username>/output/output.dat
Note that the output file name must also be included at the end of the
/public
path where the file will be transferred, which also allows you to rename the file.
Stachcp Command Manual
More usage options are described in the stashcp help message:
$ stashcp -h
Usage: stashcp [options] source destination
Options:
-h, --help show this help message and exit
-d, --debug debug
-r recursively copy
--closest Return the closest cache and exit
-c CACHE, --cache=CACHE
Cache to use
-j CACHES_JSON, --caches-json=CACHES_JSON
The JSON file containing the list of caches
--methods=METHODS Comma separated list of methods to try, in order.
Default: cvmfs,xrootd,http
-t TOKEN, --token=TOKEN
Token file to use for reading and/or writing
Get Help
For assistance or questions, please email the OSG Research Facilitation team at support@opensciencegrid.org or visit the help desk and community forums.
This page was updated on May 20, 2022 at 18:10 from start/data/stashcache.md.