Application development (Hexagon)
- 1 Modules
- 2 Compilers and programming languages
- 3 Debugging tools
- 4 Application optimization
- 4.1 Performance optimization. General recommendations.
- 4.2 Performance analysis
- 5 Parallel applications
- 6 Recommended reading
Environment Modules allows you to dynamically modify your user environment by using information provided by "modulefiles". This make it easy to change between environments or settings, e.g. the Intel compiler environment and the PGI compiler environment. If you have problems during compiling, running the "module list" command could help you see if you have missing or wrong environment modules loaded.
When writing a PBS job script (see Job execution for more information), the wanted environment has to be set inside the script using the modules command. The reason for this is that the user environment is not inherited by the PBS script. The same applies for interactive jobs (i.e. qsub -I).
The "module" command have several subcommands, e.g. "module avail".
The following list shows some of the subcommands used with "module".
|avail||Lists all available modules|
|list||Lists the modules you are using|
|load "module_name"||Loads module "module_name"|
|unload "module_name"||Unloads module "module_name"|
|show "module_name"||Displays "module_name"'s configuration settings|
|swap "old_mod" "new_mod"||Unloads the "old_mod" and loads the "new_mod"|
To load the netcdf module into your environment you type:
module load netcdf
If you want a specific version of the module you instead specify:
module load netcdf/3.6.2
Please avoid using version numbers unless strictly necessary since older versions of packages may be removed at a later time.
If you want to change from the Cray compiler (default) to the Intel compiler you type:
module swap PrgEnv-cray PrgEnv-intel
You should also use swap if you want to load a different version of the same module, this will e.g. replace your current pgi version with 12.2.0:
module swap pgi pgi/12.2.0
A complete list of subcommands can be found in the module man page or here.
Please note, if the module command does not work inside your job scripts, add the line "export -f module" to your ~/.bashrc file. This should be automatically set for new users and is only valid if your shell is bash. For other shells you may source the corresponding file in /opt/modules/default/init/ inside your qsub script before you use any "module" command.
Compilers and programming languages
Four different compilers are available on Hexagon:
- Cray (default)
All compilation for compute nodes must be done using compiler wrappers. To switch between compilers module command must be used:
module switch PrgEnv-pgi PrgEnv-gnu
By default the latest available version will be loaded. You can switch to another compiler version with e.g.:
module switch pgi pgi/12.2.0
How to invoke the compiler
Compiling an application for use on the compute node should be done by the wrappers specified below. Running the command "module list" will give you one entry like "PrgEnv-###", where ### is either cray, pgi, gnu or intel.
Compiling programs for compute nodes
When using the compiler wrappers, the wrappers take care of MPI and all additional modules switches/settings automatically.
|Compute node compiler wrappers|
|Fortran 90/95 programs||ftn|
|Fortran 77 programs||f77|
NOTE: These wrappers also handles MPI and openMP, so you should not compile with mpicc, mpif90 or similar, nor should you need to add any reference to MPI libraries in CFLAGS or similar variables.
Compiling the C program test.c can be done by the command:
cc -o test.out test.c
Where test.out is my selected name of the executable file.
Compiling programs for login nodes
When compiling for the login node the executable will not be able to run on the compute nodes, neither will OpenMP or MPI be supported.
The general rule in this case is to call the compiler directly (like pgcc for PGI).
NOTE: You can compile code for login nodes using compute node wrappers, just keep in mind that in this case you will include MPI and other libraries which are loaded as modules.
Currently installed Programming Environments for compilers:
Frequently used compiler options
Compiling OpenMP programs To activate OpenMP directives, compile and link with
|-mp=nonuma||for the PGI compiler|
C and C++:
|-mp||for the PGI compiler|
Recommended compiler options
Normally if you use compiler wrappers all recommended options will be included.
In some cases you may need to use "--enable-static" during configure for running on compute nodes.
Usefull optimization flags for the AMD "Interlagos"
When using PGI the "-tp bulldozer-64" flag will improve the performance of your code. These options are automatically provided by the module craype-interlagos. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.
Recommended environment variable settings
We recommend you to have the module craype-interlagos loaded. It will automatically add recommended optimization flags. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.
Additionally, the "cray-libsci" module contains optimized versions of common scientific/math libraries (e.g. LAPACK, BLAS).
List of tools and usage summary
Several tools are available on hexagon for debugging.
Abnormal Termination Processing (ATP) is a system that monitors Cray XT System user applications, and should an application take a system trap, ATP preforms analysis on the dying application. With release 1.0 all of the stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the file "atpMergedBT.dot". The stack backtrace for the first process to die is sent to stderr as is the number of the signal that caused the death.
You can load ATP environment with:
module load atp
Further information on ATP can found in the intro_atp man page.
This gdb based debugger and launcher allows users to attach to and debug codes which execute multiple processes or threads.
You can load lgdb environment with:
module load cray-lgdb
Usage documentation can be found in the manpage:
The following example shows how to connect to an already running program:
qstat -f JOBID | grep exec_host ssh loginX #take from exec_host of previous command ps x | grep aprun # find your aprun module load cray-lgdb # to connect to the first rank lgdb --pes=0 --pid=APRUNPID # You use APRUNPID from ps x command above # to connect to a list of ranks (from first to 8th) lgdb --pes=0-7 --pid=APRUNPID # You use APRUNPID from ps x command above
TotalView is a graphical, source-level, multiprocess debugger.
When using this debugger you need to turn on X-forwarding, which is done when you login via ssh. This is done by adding the -Y on newer ssh version, and -X on older. Following is an example of using a new version of ssh.
ssh -Y firstname.lastname@example.org
If you don't know if you have an old or new version of ssh, you should run "man ssh" and look for an explanation of "-X" and/or "-Y".
The program you want to debug has to be compiled with the debug option. Normally this is the "-g" option, but that depends on the compiler. The executable from this compilation will in the following examples be called "filename".
First, load the totalview module to get the correct environment variables set:
module load totalview
To start debugging run:
Which will start a graphical user interface.
Once inside the debugger, if you cannot see any source code, and keep the source files in a separate directory, add the search path to this directory via the main menu item File->Search path.
Source lines where it is possible to insert a breakpoint are marked with a box in the left column. Click on a box to toggle a breakpoint.
Double clicking a function/subroutine name in a source file should open the sourcefile. You can go back to the previous view by clicking on the left arrow on the top of the window.
The button "Go" runs the program from the beginning until the first breakpoint. "Next" and "Step" takes you one line / statement forward. "Out" will continue until the end of the current subroutine/function. "Run to" will continue until the next breakpoint.
The value of variables can be inspected by right clicking on the name, then choose "add to expression list". The variable will now be shown in a pop up window. Scalar variables will be shown with their value, arrays with their dimensions and type. To see all values in the array, right click on the variable in the pop up window and choose "dive". You can now scroll through the list of values. Another useful option is to visualize the array: after choosing "dive", open the menu item "Tools->Visualize" of the pop up window. If you did this with a 2D array, use middle button and drag mouse to rotate the surface that popped up, shift+middle button to pan, Ctrl+middle button to zoom in/out.
Running totalview inside the batch system (compute nodes)
qsub -I -l mppwidth=[#procs],walltime=[time] -A [account] -j oe -X mkdir -p /work/$USER/test_dir cp $HOME/test_dir/a.out /work/$USER/test_dir cd /work/$USER/test_dir module load xt-totalview totalview aprun -a -B ./a.out
Replace [#procs] with the core-count for the job. Note that totalview is licensed for a limited amount of cores.
Note: When totalview starts it will get 'aprun' up first. Click GO and YES.)
More information about TotalView can be found in the product knowledge base at http://kb.roguewave.com/kb/. Complete TotalView documentation is available from vendor - Rogue Wave Software, Inc. - at http://www.roguewave.com/help-support/documentation/totalview. Good, in-depth documentation is available also from Lawrence Livermore National Laboratory at https://computing.llnl.gov/tutorials/totalview/.
TotalView Remote Display Client (RDC) is a great tool, which provides integrated remote display capability to debug your code on a host machine. RDC is superior to X11 forwarding through SSH because of significant speed improvements and better GUI response. RDC will launch TotalView on the supercomputer and display it on the client machine.
For usage examples please go to notur.no/devtools
STAT (the Stack Trace Analysis Tool) is a highly scalable, lightweight tool that gathers and merges stack traces from all of the processes of a parallel application.
Introduction from LLNL:
"The scale of today's fastest supercomputers surpasses the capabilities of even the most advanced debuggers. For instance, Lawrence Livermore National Laboratory's Sequoia boasts 1.6 million cores—far beyond the reach of the most advanced, full-featured parallel debuggers. With future architectures this gap will only grow wider. To help fill this gap, we developed the Stack Trace Analysis tool (STAT) to help identify groups of processes in a parallel application that exhibit similar behaviour. A single representative of these groups can then be examined with a full-featured debugger like TotalView or DDT for more in-depth analysis."
You can attach to an already running aprun or you can start aprun under STAT control.
Attaching to an already running process
module load stat aprun -B ./myexe arg1 arg2 &  13017
Launching application under STAT
module load stat stat-cl -C aprun -B ./myexe arg1 arg2
Upon successful completion, STAT will write its output into working directory, eg:
Results written to /work-common/user/stat_results/myexe.0000
Later you can open results (.dot files) with stat-view.
Please see more info in "man intro_stat" and on the STAT website.
Performance optimization. General recommendations.
Compilation flags and environment settings
Correct optimization flags will be automatically selected if you use compiler wrappers and module craype-interlagos.
Enable FMA optimizations
Users should be aware that results obtained using FMA operations may differ in the lowest bits from results obtained on other X64 processors. The intermediate result fed from the multiplier to the adder is not rounded to 64 bits. Article at PGI.
Cray compiler: with -hfp3 or -hfp2
PGI: is default when you have -tp=bulldozer
GNU(set of optimizations): -march=bdver1 -Ofast -mprefer-avx128 -funroll-all-loops -ftree-vectorize
Please always very that the result provided with the optimized version is correct. If not try to reduce optimizations.
Please also check AMD Compiler Options Quick Reference Guide
Recommended optimized libraries
The following modules are optimized by Cray and are therefore recommended to use:
- cray-libsci - BLAS, LAPACK, ScaLAPACK, BLACS, IRT, SuperLU, CRAFFT. See : General software and libraries (Hexagon)#Performance libraries
- petsc - MUMPS, SuperLU, ParMETIS, HYPRE. See : General software and libraries (Hexagon)#Performance libraries
- acml - ACML: Fast Fourier Transform (FFT) routines for real and complex data, etc. See : General software and libraries (Hexagon)#Performance libraries
Correct use of file systems
There is no local disk available on the compute nodes.
Only a shared file system is available - /work file system, which is a Lustre FS. Note that this file system is not optimized to be accessed as a local scratch. Please avoid having small read/writes per chunk, instead replace the access pattern with bigger chunks, creating well-formed IO.
Dedicating FPUs per core
Due to specific Interlagos design one FPU unit is shared with 2 cores (see more about Bulldozer at Wiki).
If you have massive calculations on floating point numbers, you can get performance increase by dedicating each FPU per one core. (This will double your CPU time usage.) This is how to do it:
If you are using Cray compiler, you can make it aware of your plans by:
module load craype-interlagos-cu
With other compilers just compile code as regular.
Next you will need to properly allocate tasks per core with aprun and queuing system.
Run with OpenMP or with precise placement
If you are using OpenMP you need to specifically map your cores:
#PBS -l mppwidth=xx #PBS -l mppnppn=16 aprun -n xx -N 16 -S 4 -cc 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 ./mycode
This example will use 16 mpi processes per node, 1 per FPU, leaving 16 cores per each node unused.
You may also want to add "-ss" which is strict memory placement (not allowed to have memory placed in another numa domain), though you may get out-of-mem error then depending on your memory usage. Avoiding cross-numa memory access will help the code by lower latency to memory.
List of tools and usage summary
Allinea Performance Reports
Allinea Performance Reports is a lightweight profiling tool, available on all NOTUR sites. It is producing a single page HTML file with CPU, MPI, IO(not available on Hexagon) and Memory split and in-line recommendations.
The instructions below are for static linking. If you are using dynamic linking, please contact us for help.
In order your program to able to produce a report, it has to be re-linked with "perf-reports" module loaded, e.g.
module load perf-reports cc my.c -o myexe
When you run your program precede "aprun" with "perf-reports":
module load perf-reports perf-report aprun -B myexe
At the end of run a nice HTML and text file will be produced.
You can find more information about Allinea Performance Reports at http://www.allinea.com/products/performance
The Cray performance analysis tool.
CrayPat is a performance analysis tool for evaluating program execution on Cray systems. CrayPat consists of three major components:
- pat_build - used to instrument the program to be analyzed (see "man pat_build")
- pat_report - a standalone text report generator that can be use to further explore the data generated by instrumented program execution (see "man pat_report")
- Apprentice2 - a graphical analysis tool that can be used, in addition to pat_report to further explore and visualize the data generated by instrumented program execution (see "man app2")
- Load the newest version of CrayPat:
module load perftools
- Compile your application:
make clean make
- Instrument the application to generate a sampling profile:
pat_build -O apa a.out
This will create an executable "a.out+pat".
- Run your application (in batch) using the executable "a.out+pat".
This will create the file "a.out+pat+<*>.xf".
- Create Sampling report files:
pat_report a.out+pat+<*>.xf > my_report.txt
This command will automatically create a report file "a.out+pat+<*>.ap2", which can be viewed by Apprentice2.
The command will also create two text files in ascii format: "a.out+pat+<*>.apa" and "my_report.txt".
- For Hardware Counting, instrument application for further analysis:
pat_build -O a.out+pat+<*>.apa
This will create an executable "a.out+apa".
- Modify run script to run the executable "a.out+apa", and add the environment variables
export PAT_RT_MPI_SYNC=0 export PAT_RT_HWPC=[2|3|...]
Running this instrumented application will create a file "a.out+apa+<*>.xf".
- Convert raw data:
pat_report a.out+apa+<*>.xf > my_hwcp_report.txt
This command will automatically create a report file "a.out+apa+<*>.ap2", which can be viewed by Apprentice2. The command will also create a new text file in ascii format: "my_hwcp_report.txt"
- View the results by Apprentice2:
app2 a.out+pat+<*>.ap2 & -for visualizing sampling results app2 a.out+apa+<*>.ap2 & -for visualizing hardware counting results
Apprentice2 generates a variety of interactive graphical reports. For more info, see man.
This summary is based on the slides of Luiz DeRose at the Cray XT4 workshop.
More information can be found in the corresponding manpages (man intro_craypat) or at http://docs.cray.com.
Hexagon has wrappers that should be used when compiling programs for the compute nodes. More information about the wrappers can be found here. These wrappers handle MPI automatically, by using a module called cray-mpich. "cray-mpich" is based on mpich3.
If you want to change from the default PGI compiler to GNU, PathScale or Intel you can do that by changing the PrgEnv module. This is done by using modules.
Not all MPI-2 features are supported, for a complete list - see:
At hexagon you can run OpenMP jobs within the node, i.e. on maximum 32 cores/node. Since hexagon is to be used for jobs with high core-counts the use of pure OpenMP is discouraged, see below for an explanation of MPI/OpenMP hybrid.
To activate openMP directives, compile with Fortran:
|-h omp||Cray compiler|
C and C++:
|-mp||for the PGI compiler|
|-h omp||Cray compiler|
In the batch-script set (replace "threads_per_node" with 1-32)
#PBS -l nodes=threads_per_node:ppn=threads_per_node export OMP_NUM_THREADS=threads_per_node
This number should correspond to
aprun -n 1 -d threads_per_node ...
You can run a hybrid MPI + OpenMP job where MPI is used between the nodes and OpenMP within the node.
No special compiler directives are needed to activate MPI, but to activate the OpenMP directives, compile and link with the following.
|-mp=nonuma||for the PGI compiler|
C and C++:
|-mp||for the PGI compiler|
In the batch-script set (total_pes = threads_per_node * mpi_processes)
#PBS -l nodes=total_pes:ppn=threads_per_node export OMP_NUM_THREADS=threads_per_node
These numbers should correspond to
aprun ... -n mpi_processes -d threads_per_node ...
Note: the threads_per_node must be <= 32.
Cray Programming Environment User's Guide - contains everything needed to start to work with examples on the Cray XE machine.