Blast Search on a large database with Pegasus
Blast search against a large sequence data base is a challenging task. Because a large database requires large memory on a computing machine and the size of the sequence database grows every year. In this example, we split the reference database and perform the blast search. The current workflow is handled with Pegasus Workflow Manager.
In the command prompt, type
$ tutorial # Without an argument, it shows the list of available tutorials. $ tutorial pegasus-blast # The files to run the pegasus-blast tutorial are created under the directory tutorial-pegasus-blast $ cd tutorial-pegasus-blast # Change to this directory
Let's take a look at the files inside the directory
tutorial-pegasus-vina. The following are two input sequence files (just one is enough to run our demo).
BRAC1.txt # Input sequence (example 1) HLA.txt # Input sequence (example 2, small sequeunce)
The following files are related to Pegasus workflow management:
dax-generator-blast.py # python script that generates dax.xml pegasusrc # pegasus config file sites-generator.bash # bash script generates sites.xml file submit.bash # Job submit script
pegasusrc contains the Pegasus configuration information. We
can simply keep this file in the current working directory without
worrying much about the details (if you would like to know the details,
please visit the Pegasus home page). The files
contain the information about the work flow and data management.
The following is the blast executable file
blast_exe.bash # blast executable file
The sequence data base is around 15 GB which requires approximately same amount of memory to process the files. Therefore we split the input database into smaller chunks. The chunks are stored in stash public storage. To see them
$ ls /stash/public/blast/database/nt.5-30-2014/*.gz
there are 19 files. This means a given input sequence is matched against these 19 chunks that results in 19 independent jobs. This workflow is implemented with Pegasus.
Let's review a few parts of the
submit.bash script to understand
how the workflow is submitted. Open the file
submit.bash and take a
... line 9 ./dax-generator-blast.py ### Execution of "dax-generator-blast.py" script. Generates dax.xml. ... line 14 ./sites-generator.bash ### Execution of "sites-generator.bash" script. Generates sites.xml. ... line 17 pegasus-plan \ ### Executes the pegasus-plan with the following arguments line 18 --conf pegasusrc \ ### pegasus configuration file line 19 --sites condorpool \ ### jobs are executed in condorpool line 20 --dir $PWD/workflows \ ### The path of the workflow directory line 21 --output-site local \ ### Outputs are directed to the local site. line 22 --cluster horizontal \ ### Cluster the jobs horizontally line 23 --dax dax.xml \ ### Name of the dax file line 24 --submit ### Type of action is submit
The purpose of
sites-generator.bash script is to generate
sites.xml file. There are several lines declared in the
sites-generator.bash script. We need to understand the lines defining
the scratch and output directories.
... line 4 cat >sites.xml <<EOF ### creates the file "sites.xml" and appends the following lines. ... line 11 <file-server protocol="file" url="file://" mount-point="$PWD/scratch"/> ### Define the path of scratch directory line 12 <internal-mount-point mount-point="$PWD/scratch"/> ### Define the path of scratch directory ... line 17 <file-server protocol="file" url="file://" mount-point="$PWD/outputs"/> ### Define the path of output directory line 18 <internal-mount-point mount-point="$PWD/outputs"/> ### Define the path of output directory ... line 32 EOF ### End of sites.xml file
sites-generator.bash will not change very
much for a new workflow. We need to edit these two files, when we
change the name of the dax-generator and/or the path of outputs, scratch
dax.xml contains the workflow information, including the
description about the jobs and required input files. We could manually
write the dax.xml file but it is not very pleasant for the human eye
to deal with the xml format. Here, dax.xml is generated via the python
dax-generator-blast.py. Take a look at the python script,
it is self explanatory with lots of comments. If you have difficulty to
understand the script, please feel free to send us an email. Here is the
brief description about dax-generator python script.
... line 8 query_filename = "HLA.txt" # Name of the input querry sequence file line 9 database_dir = "/stash/public/blast/database/nt.5-30-2014" # Location of database chucks
For example, if you would like to match "BRAC1.txt" change
querry_filename as follows
line 8 query_filename = "BRAC1.txt" # Provide name of the input querry sequence file
To submit the job
To submit the job, run:
To check the status of the submitted job:
You can also check with the
$ condor_q username ### username is your login ID
Pegasus creates the following directories:
scratch/ ### Contains all the files (including input, parameter and execution) required to run the job are copied in this directory. workflows/ ### Contains the workflow files including DAGMan, data transfer scripts and condor job files. outputs/ ### Where the NAMD output files are stored at the end of each job.
- Pegasus requires
pegasusrcfiles. These files contain the information about executable, input and output files and the relation between them while executing the jobs.
- It is convenient to generate the XML files via scripts. In our
dax.xmlis generated via a Python script and
sites.xmlis generated via a bash script.
- To implement a new workflow, edit the existing dax-generator, sites-generator and submit scripts.
This page was updated on Aug 24, 2019 at 19:00 from tutorials/tutorial-pegasus-blast/README.md.