Submitting jobs to HPC systems
Submit your first job
Learning Objectives
- Learn to write a submit script
- Learn to submit, check, and cancel jobs
- Learn to add PBS directives to control jobs behavior
Write initial PBS submit script
We want to submit the do-stats.sh script to the system. We write a second script that tells the scheduler what to do:
- Change to the script-data directory
- Execute the do-stats.sh script
cd $HOME/script-data
bash do-stats.sh
Let’s call this script submit-stats.sh
. Notice that these two commands would be exactly how you would execute the script normally. Now submit the job.
$ qsub submit-stats.sh
546324.srih001.hpc.wvu.edu
You can see a list of all your running jobs using showq
.
$ showq -u training01
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME
546324 training01 Running 1 1:59:02 Fri Jun 10 15:19:21
1 active job 1 of 384 processors in use by local jobs (0.26%)
21 of 32 nodes active (65.62%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 blocked jobs
Total job: 1
After a few minutes. New files will appear in the script-data directory.
$ ls
do-stats.sh NENE01812A.txt NENE02040B.txt stats-NENE01751B.txt stats-NENE02040B.txt
goodiff NENE01843A.txt NENE02040Z.txt stats-NENE01812A.txt stats-NENE02043A.txt
goostats NENE01843B.txt NENE02043A.txt stats-NENE01843A.txt stats-NENE02043B.txt
NENE01729A.txt NENE01971Z.txt NENE02043B.txt stats-NENE01843B.txt submit-stats.sh
NENE01729B.txt NENE01978A.txt stats-NENE01729A.txt stats-NENE01978A.txt submit-stats.sh.e546324
NENE01736A.txt NENE01978B.txt stats-NENE01729B.txt stats-NENE01978B.txt submit-stats.sh.o546324
NENE01751A.txt NENE02018B.txt stats-NENE01736A.txt stats-NENE02018B.txt
NENE01751B.txt NENE02040A.txt stats-NENE01751A.txt stats-NENE02040A.txt
Check our output
$ cat submit-stats.sh.o546324
...
NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
...
The ‘e’ file is for errors. The ‘o’ file is for regular output.
We need to include canceling a job.
Checkjob
$ checkjob 546324
job 546324
AName: submit-stats.sh
State: Completed Completion Code: 0 Time: Fri Jun 10 15:35:08
Creds: user:training01 group:trainingc class:batch qos:community
WallTime: 00:01:31 of 2:00:00
SubmitTime: Fri Jun 10 15:33:04
(Time Queued Total: 00:00:33 Eligible: 00:00:33)
TemplateSets: DEFAULT
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: torque
Dedicated Resources Per Task: PROCS: 1 SWAP: 4096M
Allocated Nodes:
[compute-01-25:1]
SystemID: Moab
SystemJID: 546324
Notification Events: JobFail Notification Address: training01@theacademy.com
StartCount: 1
Execution Partition: torque
Flags: RESTARTABLE
Attr: checkpoint
StartPriority: 100000
Controlling some job behavior
The qsub command has a large number of options that can be given to it to control the behavior of the job. For instance -N
changes the jobName, and subsequently the names of the output files.
$ qsub -N north-stats submit-stats.sh
You will now notice that the output files are named north-stats.o546325 and north-stats.e546325. You can also get an e-mail from the system when the job completes using the -M
option.
$ qsub -M training01@theacademy.com submit-stats.sh
Mail at different job status
Each job goes through several status changes during it’s life cycle. If a queue job is cancelled before starting the jod is said to be ‘aborted’ or failed. Of course, a status change occurs changing from being queued to excuting. And another status change occurs when the job completes execution (whether successful or not). You can control when the system sends you an e-mail using the -m
option.
- Submit a job with
-m be -M youremail@server.com
options. What kind of e-mails did you recieve from the system? - Submit a job with
-m n -M youremail@server.com
options. Did you recieve any e-mails from the system about this job?
Controlling the name of output files
The output of submit-stats.sh
produced two files, a .e
file and a .o
file. The name template for these files was the jobName.ejobID or jobName.ojobID. You can control the names of these output files.
- Submit a job with
-o statOutput.txt
option. What where the names of the output files? - Submit a job with
-e statError.txt
option. What where the names of the output files? - Submit a job with
-j oe
option. How many files where produced? What where their names? - Can you submit a job with a single output file produced and named singleStream.txt?
- How do you eliminate output files all together? Hint:
/dev/null
can be used in place of output files to throw away output.
Canceling a job
The sleep
command makes a job wait for the number of seconds specified.
For instance, sleep 3600
would make a shell wait 1 hour before continuing.
Write a submit script that sleeps for 2 hours. Submit the job, and verify that it’s running using the showq
command.
- Terminate the job using
canceljob jobID
command. Where any output files produced? - Re-submit the job but now with
-m n -M youremail@server.com
. When you cancelled the job, did you get any e-mails? - Can you submit the job, and cancel it without getting any e-mails? Hint: trying using nomail@hpc.wvu.edu as your e-mail address.
Placing qsub options inside a submit script
Another way your can specify qsub options is by placing them inside the submit script. Any option that starts with #PBS
and appears before commands will be accepted by the system. Re-submit your sleep command script with #PBS -M youremail@server.com
placed in the file after your sleep command. Cancel the job, did you get an e-mail? Now put the #PBS -M youremail@server.com
before the sleep command. Re-submit and cancel the job. Did you recieve an e-mail? Now re-submit the job keeping the -M
in the submit script. But place -M noemail@hpc.wvu.edu
on the command-line in your qsub command. Did you receive an e-mail? Use checkjob
to verify which e-mail address is attributed to the job.