Workflows on HPC systems

High-throughput

Learning Objectives

  • Learn to break a job up into parts to reduce execution time
  • Learn to use the debug queue to ensure the job is running
  • Learn job dependencies to ensure correct job ordering

Create submission shell script submit-stats-workflow.sh

# Calculate reduced stats for A and Site B data files at J = 100 c/bp
for datafile in [AB].txt
do
    qsub -v datafile=$datafile run-stats.sh
done

Create job script run-stats.sh

cd $HOME/script-data
echo $datafile
bash goostats -J 100 -r $datafile stats-$datafile

Make sure run-stats.sh works correctly

$ debug
qsub: waiting for job 103248.mountaineer to start
qsub: job 103248.mountaineer ready

[mcarlise@compute-01-25 ~]$

Set up environment and run an instance of the script

$ export datafile=NENE01729A.txt
$ cd $HOME/script-data
$ bash run-stats.sh
NENE01729A.txt

Check the output

$ pwd
$ ls
/users/mcarlise/script-data

do-stats.sh  goostats        NENE01729B.txt  NENE01751A.txt  NENE01843B.txt  stats-NENE01729A.txt
goodiff      NENE01729A.txt  NENE01736A.txt  NENE01751B.txt  run-stats.sh    submit-stats-workflow.sh

You can see that stats-NENE01729A.txt exists. And you can verify the output using the less or cat command.

Submit the entire workflow

$ bash submit-stats-workflow.sh
103250.mountaineer
103251.mountaineer
103252.mountaineer
103253.mountaineer
103254.mountaineer
103255.mountaineer
103256.mountaineer

Verify all the running jobs

$ showq -u training01
active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

103250             mcarlise    Running     1     1:59:47  Sat Jun 11 16:33:12
103251             mcarlise    Running     1     1:59:47  Sat Jun 11 16:33:12
103253             mcarlise    Running     1     1:59:47  Sat Jun 11 16:33:12
103252             mcarlise    Running     1     1:59:47  Sat Jun 11 16:33:12
103256             mcarlise    Running     1     1:59:47  Sat Jun 11 16:33:12
103254             mcarlise    Running     1     1:59:47  Sat Jun 11 16:33:12
103255             mcarlise    Running     1     1:59:47  Sat Jun 11 16:33:12

7 active jobs             7 of 384 processors in use by local jobs (1.82%)
                          22 of 32 nodes active      (68.75%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 blocked jobs   

Total jobs:  7

You can use ls to verify that all the stats files appeared. You can use the ls -l command to check if any errors occurred.

$ ls -l run-stats.sh.e??????
-rw------- 1 mcarlise wvu-hpc 0 Jun 11 16:33 run-stats.sh.e103250
-rw------- 1 mcarlise wvu-hpc 0 Jun 11 16:33 run-stats.sh.e103251
-rw------- 1 mcarlise wvu-hpc 0 Jun 11 16:33 run-stats.sh.e103252
-rw------- 1 mcarlise wvu-hpc 0 Jun 11 16:33 run-stats.sh.e103253
-rw------- 1 mcarlise wvu-hpc 0 Jun 11 16:33 run-stats.sh.e103254
-rw------- 1 mcarlise wvu-hpc 0 Jun 11 16:33 run-stats.sh.e103255
-rw------- 1 mcarlise wvu-hpc 0 Jun 11 16:33 run-stats.sh.e103256

You can see all the error output files have a size of 0.

Adding a second job to the workflow

The workflow is not complete at only doing the stats part. We also have another program to run called goodiff. goodiff compares the output of goostat to a validated dataset and either outputs the difference or tells us they are identical. The shell script to run goodiff is located as do-diff.sh:

    # Calculate reduced stats for A and Site B data files at J = 100 c/bp
    for datafile in stats-*[AB].txt
    do
        echo $datafile
        bash goodiff $datafile validated-data.txt > diff-$datafile
    done

As with do-stats.sh. We need to break this up into a script that runs only goodiff, and a script that runs qsub to launch all of the jobs.

    # Calculate reduced stats for A and Site B data files at J = 100 c/bp
    for datafile in stats-*[AB].txt
    do
        qsub -v datafile=$datafile run-diff.sh
    done

run-diff.sh

    cd $HOME/script-data
    echo $datafile
    bash goodiff $datafile validated-data.txt > diff-$datafile

This is very similar to how we did the stats portion of the workflow. Now, we can use the debug queue to make sure run-diff.sh works correctly given a single datafile.

$ debug -v datafile=stats-NENE01729A.txt
$ qsub: waiting for job 103269.mountaineer to start
qsub: job 103269.mountaineer ready

[mcarlise@compute-01-25 ~]$

Verify the value of datafile, and run run-diff.sh to make sure it works

$ cd $HOME/script-data
$ echo $datafile
$ bash run-diff.sh
$ cat diff-stats-NEN01729A.txt

Typing exit will get you out of the job. Submit entire workflow.

$ bash submit-diff-workflow.sh
103271.mountaineer
103272.mountaineer
103273.mountaineer
103274.mountaineer
103275.mountaineer
103276.mountaineer

You can use showq to verify that the jobs are queued/running.

    $ cat diff-stats-NENE01*
0.21598
0.3136
0.29846
0.1382
0.29863
0.4571

Automating through job dependencies

We had to execute two submit bash scripts that run qsub commands. This is to ensure that the goostats program runs before the goodiff program. However, this requires that you run the goostats portion. Wait until it finishes, and come back later to run the goostats part. If you have 10 or 12 steps of the workflow, this can add a considerable amount of time.

Combine the submit workflow scripts.

    # Calculate reduced stats for A and Site B data files at J = 100 c/bp
    for datafile in *[AB].txt
    do
        JOBID=`qsub -v datafile=$datafile run-stats.sh`
        qsub -v datafile=stats-$datafile -W depend=afterok:$JOBID run-diff.sh
    done

Capturing a commands output. -W option allows you to define attributes of the job. One of them is dependencies. This instance, the second submitted job (running goodiff) will not run until after the run-stats.sh job completes and without error.

$ bash submit-stats-workflow.sh
103284.mountaineer
103286.mountaineer
103288.mountaineer
103290.mountaineer
103292.mountaineer
103294.mountaineer

Notice that you only get 6 jobIDs. Additionally, they skip a number.

$ showq -u training01
active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

103293             mcarlise    Running     1     1:59:55  Sun Jun 12 13:25:50
103291             mcarlise    Running     1     1:59:55  Sun Jun 12 13:25:50
103285             mcarlise    Running     1     1:59:55  Sun Jun 12 13:25:50
103289             mcarlise    Running     1     1:59:55  Sun Jun 12 13:25:50
103287             mcarlise    Running     1     1:59:55  Sun Jun 12 13:25:50
103283             mcarlise    Running     1     1:59:55  Sun Jun 12 13:25:50

6 active jobs             6 of 384 processors in use by local jobs (1.56%)
                          21 of 32 nodes active      (65.62%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

103284             mcarlise       Hold     1     2:00:00  Sun Jun 12 13:25:10
103286             mcarlise       Hold     1     2:00:00  Sun Jun 12 13:25:10
103288             mcarlise       Hold     1     2:00:00  Sun Jun 12 13:25:10
103290             mcarlise       Hold     1     2:00:00  Sun Jun 12 13:25:10
103292             mcarlise       Hold     1     2:00:00  Sun Jun 12 13:25:11
103294             mcarlise       Hold     1     2:00:00  Sun Jun 12 13:25:11

6 blocked jobs   

Total jobs:  12

Notice 6 blocked jobs. They are not eligible until after the previous running jobs complete. The scheduler is ensuring the correct job order.

Check your output

    $ cat diff-stats-NENE01*
0.1531
0.12977
0.26874
0.27960
0.9072
0.10941

You get the difference of 6 datasets. Exactly what we expect.

Better organization

Can we re-write the scripts to have a better organization. Instead of dumping everything to a single directory.