Job execution (Hexagon)

From HPC documentation portal
Revision as of 10:08, 15 April 2016 by Lorand Janos Szentannai (Talk | contribs) (Parallel/OpenMP jobs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Batch system

To ensure a fair use of the clusters, all users have to run their computations via the batch system. A batch system is a program that manages the queuing, scheduling, starting and stopping of jobs/programs users run on the cluster. Usually it is divided into a resource-manager part and a scheduler part. To start jobs, users specify to the batch system which executable(s) they want to run, the amount of processors and memory needed, and the maximum amount of time the execution should take.

Hexagon uses "Torque" as the resource manager, but on Cray architecture the syntax is slightly different. To schedule jobs hexagon uses Moab, a commercial version of the Maui scheduler. In addition hexagon uses aprun to execute jobs on the compute nodes, independent of the job being a MPI job or sequential job. The user therefore has to make sure to call "aprun ./executable", and not just the executable if it is to run on the compute part instead of the login-node part of the Cray.
Fimm uses "Slurm" as resource manager.

Node configuration

Hwloc.png

Each Hexagon node has the following configuration:

  • 2 x 16 cores Interlagos CPUs
  • 32 GB of RAM


For core to memory allocation please refer to the following illustration:

NOTE: There is only one job can run on each of the nodes (nodes are dedicated). Therefore, for better node utilization, please try to specify in the job as few limitations as possible and leave the rest to be decided by the batch system.

Batch job submission

There are essentially two ways to execute jobs via the batch system.

  • Interactive. The batch system allocates the requested resources or waits until these are available. Once the resources are allocated, interaction with these resources and your application is via the command-line and very similar to what you normally would do on your local (Linux) desktop. Note that you will be charged for the entire time your interactive session is open, not just during the time your application is running.
  • Batch. One writes a job script that specifies the required resources and executables and arguments. This script is then given to the batch system that will then schedule this job and start it as soon as the resources are available.

Running jobs in batch is the more common way on a compute cluster. Here, one can e.g. log off and log on again later to see what the status of a job is. We recommend running jobs in batch mode.

Create a job (scripts)

Jobs are normally submitted to the batch system via shell scripts, and are often called job scripts or batch scripts. Lines in the scripts that start with #PBS are interpreted by Torque as instructions for the batch system. (Please note that these lines are interpreted as comments when the script is run in the shell, so there is no magic here: a batch script is a shell script.)

Script can be created in any text editor, like e.g. vim and emacs.

Job script should start with an interpreter line, like:

#!/bin/bash

Next it should contain directives to queue system, at least execution time and how many cpus are requested:

#PBS -l walltime=00:60:00
#PBS -l mppwidth=32

The rest is the regular shell commands.Please note: all other #PBS directives will be ignored after first regular command.
All commands written in script will be executed on login node. This is important to remember for several reasons:

  1. Commands like gzip/bzip2 or even cp for many files can create heavy load on CPU and network interface. This will result in low or unstable performance for such operations.
  2. Overuse of memory or CPU resources on login node can crash it. This means all jobs (from all users) which were started from that login node will crash.

Taking this in mind all IO/CPU intensive tasks should be prefixed with aprun command. aprun will execute the command on compute nodes resulting in higher performance. Note that this should improve the charging of the job since the total time the script is running should be less (charging does not take into account whether the compute nodes are used or not during the time the script is run).

Real computational tasks (the main program) should of course be prefixed with aprun as well.

You can find examples below.

Manage a job (submission, monitoring, suspend/resume, canceling)

Please find below the most important batch system job management commands:

To submit a job use the qsub command.

qsub job.pbs   # submit the script job.pbs

Queues and priorities are chosen automatically by the system. The command qsub returns a job identifier (number) that can be used to monitor the status of the job (in the queue and during execution). This number may also be requested by the support staff.

To monitor job status use the ""qstat"" command.

qstat          # display a listing of jobs
qstat -a      # display all jobs in alternative format
qstat -f       # display full status of all jobs (long output) 
qstat -f <jobid> # display full status of a specific job 

To cancel job use the ""qdel"" command.

qdel jobid     # delete a specific batch job

To display the actual job ordering for the scheduler, separated in three list; active, eligible, and blocked jobs:

showq          # display jobs
showq -u $USER # display jobs for $USER
showq -i      # display only jobs in eligible (idle) queue waiting for execution

To display a detailed job state information use:

checkjob <jobid> # display status for job

List of useful commands, incl. short description

Here is a list of the most important commands in tabular form (manual pages are recommended):

PBS Purpose
qsub Submit a job
qdel Cancel a job
qstat Get job status
qstat -Q Get available queues
qstat -Q -f Show queue information
qstat -B -f Show PBS Server status
qhold Temporarily stop job
qrls Resume job
qhold Checkpoint job
qrls Restart from checkpoint
qcat <jobid> Displays the standard output or the standard error of a running job
showq Displays the job ordering of the scheduler
showq -u $USER Displays the job ordering for $USER
showstart <jobid> Displays estimated start time of job
checkjob <jobid> Displays status for job
apstat Provides status information for Cray XT systems applications
xtnodestat Shows information about compute and service partition processors and the jobs running in each partition


List of useful job script parameters

-A : a job script must specify a valid project name for accounting, otherwise it will not be possible to submit jobs to the batch system.

-l : resources are specified with the -l option (lowercase L). There are a number of resources that can be specified. See the example above for the correct syntax. Jobs must specify the number of processors (CPUs), and the maximum allowed wall-clock time for execution. Make sure that you specify a correct amount of memory or you will risk crashing the node for lack of memory. Note that mppmem=XXXmb is a per-process amount and that the nodes have 32000mb total (not 32768mb). You can find all attributes and their description in:

man pbs_resources_linux

Below are the most important attributes:

-l mppwidth : the number of processing elements. So mppwidth=16 means sixteen processing elements is requested. This argument must match the -n argument for aprun (or use aprun -B).

-l mppnppn : number of processing elements per node. If one needs only one processors on each compute node, one can use the attribute mppnppn (e.g. mppnppn=1 or mppnppn=2). mppnppn=32 is default. If included, this argument has to match the -N argument for aprun (or "aprun -B" used). It is important to note that even if mppnppn=1 is set the job will still be accounted as mppnppn=32 in the cpu accounting (i.e. a 32x accounting).

-l walltime : the maximum allowed wall-clock time for execution of the job. If the specified time is too short, the job will be killed before it completes.

-l mppmem : an upper limit for the memory usage per process for a job. An explanation of how to request more memory can be found here. If the memory requirement is exceeded, the job may get killed by the system. Use "aprun -B" or specify corresponding "-m" option to aprun.

-l mppdepth : DEPRECATED, see Job execution (Hexagon)#Parallel/OpenMP jobs for more information on OpenMP.

-o, -e : see example below. If the attributes are not used and thus filenames are not specified, the standard output and standard error from the job will be stored in the files mpijob.o## and mpijob.e## where ## is the job number assigned by PBS when submitting the job.

-c enabled enable checkpoint feature for the job. When this option specified job can be checkpointed during execution and switched to hold state, later it can be "unpaused" and execution can continue from place where it was stopped (or after a machine/node crash). To use this option the application must be compiled with checkpointing libraries. See Application development (Hexagon)#Checkpoint and restart of applications and man qsub for more info.

-c periodic,interval=120,depth=2 this option will enable periodic checkpoints for the job, with an interval of 2 hours and will keep only the two latest checkpoint images, see man qsub and Application development (Hexagon)#Checkpoint and restart of applications for more info.

For additional PBS switches please refer to:

man qsub

APRUN arguments

The resources you requested in PBS has to match the arguments for aprun. So if you ask for "#PBS -l mppmem=900mb" you will need to add the argument "-m 900M" to aprun, or use the new "-B" option.

-B use parameters from batch-system (mppwidth,mppnppn,mppmem,mppdepth)
-N processors per node should be equal to the value of mppnppn
-n processing elements should be equal to the value of mppwidth
-d number of threads should be equal to the value of mppdepth
-m memory per element suffix should be equal to the amount of memory requested by mppmem. Suffix should be M.

A complete list of aprun arguments can be found on the man page of aprun.

List of classes/queues, incl. short description and limitations

Hexagon uses a default batch queue named "batch". It is a routing queue which based on job attributes can forward jobs to the debug, small or normal queues. Therefore there is no need to specify any execution queue in the PBS script.

Please keep in mind that we have priority based job scheduling. This means that based on requested amount of CPU and time job, as well as previous usage history, jobs will get higher or lower priority in the queue. Please find a more detailed explanation in Job execution (Hexagon)#Scheduling policy on the machine.

List of queues:
Name Description Limitations
batch general routing queue job limitations
normal queue for normal jobs limited only by job limitations
small queue for small jobs, jobs will get higher priority max 512 CPUs, max 1 hour walltime
debug queue for debugging, jobs will get higher priority max 64 CPUs, max 20 minutes walltime

NOTE: There is no need to specify a queue in the job script, the correct queue will automatically be selected.

Relevant examples

We illustrate the use of batch job scripts and submission with a few examples.

Sequential jobs

To use 1 processor (CPU) for at most 60 hours wall-clock time and all memory on 1 node (32000mb) the PBS job script must contain the line:

#PBS -l mppwidth=1,walltime=60:00:00

Please note that Fimm, Stallo and Titan are much better suited for sequential jobs. Hexagon should therefore only be used for parallel jobs.

Below is a complete example of a PBS script for executing a sequential job.

#!/bin/bash
#
# Give the job a name (optional)
#PBS -N "seqjob"
#
# Specify the project the job should be accounted on (obligatory)
#PBS -A replace_with_correct_cpuaccount
#
# The job needs at most 60 hours wall-clock time on 1 CPU (obligatory)
#PBS -l mppwidth=1,walltime=60:00:00
#
# Write the standard output of the job to file 'seqjob.out' (optional)
#PBS -o seqjob.out
#
# Write the standard error of the job to file 'seqjob.err' (optional)
#PBS -e seqjob.err
#
# Make sure I am in the correct directory
mkdir -p /work/$USER/seqwork
cd /work/$USER/seqwork

# Invoke the executable on the compute node
aprun -B ./program

Parallel/MPI jobs

To use 512 CPUs (cores) for at most 60 hours wall-clock time, the PBS job script must contain the line:

#PBS -l mppwidth=512,walltime=60:00:00

Below is an example of a PBS script for the execution of an MPI job.

#!/bin/bash
#
#  Give the job a name
#PBS -N "mpijob"
#
#  Specify the project the job belongs to
#PBS -A replace_with_correct_cpuaccount
#
#  We want 60 hours on 512 cpu's (cores):
#PBS -l walltime=60:00:00,mppwidth=512
#
#  Send me an email on  a=abort, b=begin, e=end
#PBS -m abe
#
#  Use this email address (check that it is correct):
#PBS -M your.email.address@example.com
#
#  Write the standard output of the job to file 'mpijob.out' (optional)
#PBS -o mpijob.out
#
#  Write the standard error of the job to file 'mpijob.err' (optional)
#PBS -e mpijob.err
#
#  Make sure I am in the correct directory
mkdir -p /work/$USER/mpiwork
cd /work/$USER/mpiwork

aprun -B ./program

Parallel/OpenMP jobs

To use 32 threads on 1 node for at most 60 hours wall-clock time, the PBS job script must contain the line:

#PBS -l mppwidth=32,walltime=60:00:00

you will need to declare number of threads:

export OMP_NUM_THREADS=32

and aprun line:

aprun -n1 -d32 ./program

Below is an example of a PBS script for the execution of an OpenMP job.

#!/bin/bash
#
#  Give the job a name
#PBS -N "openmp_job"
#
#  Specify the project the job belongs to
#PBS -A replace_with_correct_cpuaccount
#
#  We want 60 hours on 32 cpu's (cores):
#PBS -l walltime=60:00:00,mppwidth=32
#
#  Send me an email on  a=abort, b=begin, e=end
#PBS -m abe
#
#  Use this email address (check that it is correct):
#PBS -M your.email.address@example.com
#
#  Write the standard output of the job to file 'mpijob.out' (optional)
#PBS -o openmpjob.out
#
#  Write the standard error of the job to file 'mpijob.err' (optional)
#PBS -e openmpjob.err
#
#  Make sure I am in the correct directory
mkdir -p /work/$USER/openmp-work
cd /work/$USER/openmp-work

export OMP_NUM_THREADS=32
aprun -n1 -d32 ./program

IMPORTANT! aprun -B is not supported in OpenMP mode, one has to specify OpenMP number of threads with export OMP_NUM_THREADS= and aprun -d

IMPORTANT! An optimal limitation will be to use not more that 32 OpenMP threads per node, this is since each node has 32 cores. If you are trying to use more threads than cores available, you will get a message like this:

WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources!  Performance may be degraded.

In this case decrease number of OMP_NUM_THREADS or increase nodes.

Parallel/mixed MPI-OpenMP jobs

To use 2 MPI processes with 32 threads each for at most 60 hours wall-clock time, the PBS job script must contain the line:

#PBS -l mppwidth=64,walltime=60:00:00

you will need to declare number of threads:

export OMP_NUM_THREADS=32

and the aprun line:

aprun -n2 -d32 ./program

Below is an example of a PBS script for the execution of an MPI-OpenMP job.

#!/bin/bash
#
#  Give the job a name
#PBS -N "mpi-openmp_job"
#
#  Specify the project the job belongs to
#PBS -A replace_with_correct_cpuaccount
#
#  We want 60 hours on 64 cpu's (cores):
#PBS -l walltime=60:00:00,mppwidth=64
#
#  Send me an email on  a=abort, b=begin, e=end
#PBS -m abe
#
#  Use this email address (check that it is correct):
#PBS -M your.email.address@example.com
#
#  Write the standard output of the job to file 'mpijob.out' (optional)
#PBS -o mpi-openmp_job.out
#
#  Write the standard error of the job to file 'mpijob.err' (optional)
#PBS -e mpi-openmp_job.err
#
#  Make sure I am in the correct directory
mkdir -p /work/$USER/mpi-openmp-work
cd /work/$USER/mpi-openmp-work

export OMP_NUM_THREADS=32
aprun -n2 -d32 ./program

Please refer to the Job execution (Hexagon)#Parallel/OpenMP jobs paragraph IMPORTANT statements.

Creating dependencies between jobs

By default, basic single step job dependencies are supported through completed/failed step evaluation. Basic dependency support does not require special configuration and is activated by default. For TORQUE's qsub and the Moab msub command, the semantics listed in the table below can be used with the -W depend=<DEPENDENCY>:<STRING> flag.

Job Dependency Syntax:

Dependency Format Description
after after:<job>[:<job>]... The job may start at any time after the specified jobs have started execution.
afterany afterany:<job>[:<job>]... The job may start at any time after all the specified jobs have completed regardless of completion status.
afterok afterok:<job>[:<job>]... The job may be start at any time after all the specified jobs have successfully completed.
afternotok afternotok:<job>[:<job>]... The job may start at any time after all the specified jobs have completed unsuccessfully.
before before:<job>[:<job>]... The job may start at any time before the specified jobs have started execution.
beforeany beforeany:<job>[:<job>]... The job may start at any time before all the specified jobs have completed regardless of completion status.
beforeok beforeok:<job>[:<job>]... The job may start at any time before all the specified jobs have successfully completed.
beforenotok beforenotok:<job>[:<job>]... The job may start at any time before any of the specified jobs have completed unsuccessfully.
on on:<count> The job may start after <count> dependencies on other jobs have been satisfied.

where <job>={jobname|jobid}

Any of the dependencies containing "before" must be used in conjunction with the "on" dependency. So, if job A must run before job B, job B must be submitted with depend=on:1, as well as job A having depend=before:B. This means job B cannot run until one dependency of another job on job B has been fulfilled. This prevents job B from running until job A can be successfully submitted.

Combining multiple tasks in a single job

In some cases it is preferable to combine several aprun's inside one batch script. This can be useful in the following cases:

  • Several executions must be started one after each other with same amount of CPUs (better use of resources for this can be to use dependencies in your qsub script).
  • Runtime of each aprun is shorter than e.g. one minute. By combining several of these short tasks together you avoid that the job will spend more time waiting in the queue and starting up than being executed.

It should be written like this in the script:

aprun -B ./cmd args
aprun -B ./cmd args
...


Interactive job submission

PBS allows the user to use a compute nodes for interactive use. Interactive jobs are typically used for:

  • testing batch scripts
  • debugging applications/code
  • code development that involves e.g. a sequence of short test jobs

A job run with the interactive option will run normally, but stdout and stderr will be connected directly to the users terminal. This also allows stdin from the user to be sent directly to the application.

To request one processor, 1000MB memory and reserve it for 2 hours, use the command:

qsub -l mppwidth=1,walltime=2:00:00,mppmem=1000mb -A  replace_with_correct_cpuaccount -I

To request 2 CPUs, with 1000mb per process/cpu and reserve them for 2 hours, use the command:

qsub -l mppwidth=2,walltime=2:00:00,mppmem=1000mb -A replace_with_correct_cpuaccount -I

Where, replace_with_correct_cpuaccount must be replaced with your project name for accounting. The option that specifies the interactive use is -I.

You can use it with scripts aswell:

qsub -I ~/myscript

Note that you will be charged for the full time this job allocates the CPUs/nodes, even if you are not actively using these resources. Therefore, exit the job (shell) as soon as the interactive work is done. To launch your program on the compute node you go to /work/$USER and then you HAVE to use "aprun". If "aprun" is omitted the program is executed on the login node, which in the worst case can crash the login node. Since /home is not mounted on the compute node, the job has to be started from /work/$USER.

General job limitations

The default values if not specified are 1 CPU with 1000mb of memory for 60 minutes.

Maximum amount of resources per user:

  • 4096 cpu cores, total number used by all running jobs
  • 8-22 running (active) jobs (depending on load)
  • 2 idle jobs

Asking for an amount of resources greater than specified above will result in the jobs being blocked. You can see this by the fact that the job is in the "showq -b" output. You can also use the "checkjob jobnumber" command to get the status and reason for any blocking.

Default CPU and job maximums may be changed by sending an application to Support.

Recommended environment variable settings

All regular shell recommended environment variables are loaded automatically. Exceptions is if your default shell is tcsh and you have job script with the header #!/bin/bash or #!/bin/sh in that case you have to add into job script:

#PBS -S /bin/bash

If your job script is in tcsh you don't need to apply the procedure above.
Please avoid using PBS -S in all other situations, using this directive can lead to some issues with the environment variables, especially with /bin/ksh.

Sometimes there can be the problem with proper export of module functions, if you get module: command not found, try to add into your job script:

export -f module

If you still can get module functions in your job script try to add this:

#PBS -V
source ${MODULESHOME}/init/REPLACE_WITH_YOUR_SHELL_NAME
# ksh example:
# source ${MODULESHOME}/init/ksh

MPI on hexagon is highly tuneable. Sometimes you can receive messages that some MPI variables have to be adjusted. In this case just add the recommended export line into you job script on a line before the aprun command. Normally recommended messages are quite verbose. For example (bash syntax):

export MPICH_UNEX_BUFFER_SIZE=90000000

Redirect output of running application to the /work file system. See: Data (Hexagon)#Disk quota and accounting:

aprun .... >& /work/$USER/combined.out
aprun .... >/work/$USER/app.out 2>/work/$USER/app.err
aprun .... >/work/$USER/app.out 2>/dev/null

Scheduling policy on the machine

Scheduler on Hexagon has fairshare setup in place. This ensures that all users will get adjusted priorities, based on initial and historical data from running jobs. Please check Queue priorities (Hexagon) for better understanding of the queuing system on Hexagon

Types of jobs that are prioritized

Jobs with a high number of cores are always prioritized.

Exceptions are short small and debugging jobs during working days and working hours:

  • up to 128 cores and walltime up to 20 minutes
  • up to 512 cores and walltime up to 1 hour

These type of jobs will then get a higher priority.

Types of jobs that are discouraged

All types of serial or sequential jobs. Hexagon is optimized for massively parallel MPI jobs.

If you have serial or sequential code to run, please consider applying for other NOTUR resources at http://www.notur.no.

Types of jobs that are not allowed (will be rejected or never start)

The following type of jobs will never start:

CPU-hour quota and accounting

To execute jobs on the supercomputer facilities one needs a user account plus password. People working at UiB, Uni Research AS or IMR can apply via this link. Others at http://www.notur.no.

Each user account is connected to at least one project (account). Each project has allocated a number of CPU hours (or quota). CPU-hour usage is defined as the elapsed (wall-clock) time of the user's job multiplied by the number of processors that is used. The quota is the maximum number of CPU hours that all the users connected to this project together can consume. After the quota is exhausted, it is no longer possible to submit jobs and one needs to apply for additional quota first.

How to list quota and usage per user and per project

Before using these commands please load the "mam" module (module load mam).

You can use the command gstatement to generate usage report per project:

gstatement -h -a my_projectname

A report for only a specified period can be generated as well (-s - start time, -e - end time):

gstatement -h -a my_projectname -s 2010-01-01 -e 2010-02-01

To display project usage per user use:

glsusage -h -a my_projectname

Example with time-period limit:

glsusage -h -a my_projectname -s 2010-01-01 -e 2010-02-01

You can see your available cpuaccounts and how much quota they have by using either the "cost" command or:

gbalance -h -u $USER

Cost command

Cost command will show you used CPU-hours for all accounts connected to the current user in the current allocation period.

cost

To show total used CPU hours of your connected projects:

cost -p

To show usage for another NOTUR period add -P, e.g.:

cost -P 2010.1 -p

Other commands

Command Description
cost will show CPU-hour accounting for the current user
gbalance will show resources per project
gstatement is used to generate statement for account activities
glsusage will show charge totals for each user that completed jobs
greserve is used to obtain a lien for usage

For all commands mentioned above module mam should be loaded.

FAQ / trouble shooting

Please refer to our general FAQ (Hexagon)