This tutorial will put together several OSG tools and ideas - handling a larger data file, splitting a large file into smaller pieces, and transferring a portable software program.
Job components and plan
To run BLAST, we need three things:
1. the BLAST program (specifically the
2. a reference database (this is usually a larger file)
3. the file we want to query against the database
The database and the input file will each get special treatment. The database we are using
is large enough that we will want to use OSG Connect's
stashcache capability (more information
about that here). The input
file is large enough that a) it is near the upper limit of what is practical to transfer,
b) it would take hours to complete a single
analysis for it, and c) the resulting output file would be huge.
Because the BLAST process is run over the input file line by line, it is scientifically valid to split up the input query file, analyze the pieces, and then put the results back together at the end! By splitting the input query file into smaller pieces, each of the queries can be run as separate jobs. On the other hand, BLAST databases should not be split, because the blast output includes a score value for each sequence that is calculated relative to the entire length of the database.
Get materials and set up files
Run the tutorial command:
Once the tutorial has downloaded, move into the folder and run the
download_files.sh script to download the remaining files:
cd tutorial-blast-split ./download_files.sh
This command will have downloaded and unzipped the BLAST program (
ncbi-blast-2.9.0+), the file we want to query
mouse_rna.fa) and a set of tools that will split the file into smaller pieces
Next, we will use the command
gt from the
genome tools package to split our input query file into 2 MB chunks as indicated by the -targetsize flag. To split the file, run this command:
./gt-1.5.10-Linux_x86_64-64bit-complete/bin/gt splitfasta -targetsize 2 mouse_rna.fa
Later, we'll need a list of the split files, so run this command to generate that list:
ls mouse_rna.fa.* > list.txt
Examine the submit file
The submit file,
blast.submit looks like this:
executable = run_blast.sh arguments = $(inputfile) transfer_input_files = ncbi-blast-2.9.0+/bin/blastx, $(inputfile), stash:///osgconnect/public/osg/BlastTutorial/pdbaa.tar.gz output = logs/job_$(process).out error = logs/job_$(process).err log = logs/job_$(process).log requirements = OSGVO_OS_STRING == "RHEL 7" && Arch == "X86_64" request_memory = 2GB request_disk = 1GB request_cpus = 1 queue inputfile from list.txt
run_blast.sh is a script that runs blast and takes in a file to
query as its argument. We'll look at this script in more detail in a minute.
Our job will need to transfer the
blastx executable and the input file being used for
queries, shown in the
transfer_input_files line. Because of the size of our database,
we'll be using
stash:/// to transfer the database to our job.
stash:///: In this job, we're copying the file from a particular
osg/BlastTutorialV1), but you have your own
/publicfolder that you could use for the database. If you wanted to try this, you would want to navigate to your
/publicfolder, download the
pdbaa.tar.gzfile, return to your
/homefolder, and change the path in the
stash:///command above. This might look like:
cd /public/username wget http://stash.osgconnect.net/public/osg/BlastTutorialV1/pdbaa.tar.gz cd /home/username
Finally, you may have already noticed that instead of listing the individual input file
by name, we've used the following syntax:
$(inputfile). This is a variable that represents
the name of an individual input file. We've done this so that we can set the variable as
a different file name for each job.
We can set the variable by using the
queue syntax shown at the bottom of the file:
queue inputfile from list.txt
This command will pull file names from the
list.txt file that we created earlier, and
submit one job per file and set the "inputfile" variable to that file name.
Examine the wrapper script
The submit file had a script called
#!/bin/bash # get input file from arguments inputfile=$1 # Prepare our database and unzip into new dir tar -xzvf pdbaa.tar.gz rm pdbaa.tar.gz # run blast query on input file ./blastx -db pdbaa/pdbaa -query $inputfile -out $inputfile.result
It saves the name of the input file, unpacks our database, and then runs the BLAST query from the input file we transferred and used as the argument.
Submit the jobs
Our jobs should be set and ready to go. To submit them, run this command:
And you should see that 51 jobs have been submitted:
Submitting job(s)................................................ 51 job(s) submitted to cluster 90363.
You can check on your jobs' progress using
Bonus: a BLAST workflow
We had to go through multiple steps to run the jobs above. There was an initial step to split the files and generate a list of them; then we submitted the jobs. These two steps can be tied together in a workflow using the HTCondor DAGMan workflow tool.
First, we would create a script (
split_files.sh) that does the file splitting steps:
#!/bin/bash filesize=$1 ./gt-1.5.10-Linux_x86_64-64bit-complete/bin/gt splitfasta -targetsize $filesize mouse_rna.fa ls mouse_rna.fa.* > list.txt
This script will need executable permissions:
chmod +x split_files.sh
Then, we create a DAG workflow file that ties the two steps together:
## DAG: blastrun.dag JOB blast blast.submit SCRIPT PRE blast split_files.sh 2
To submit this DAG, we use this command:
This page was updated on Sep 15, 2022 at 15:41 from tutorials/tutorial-blast-split/README.md.