# Application development (Hexagon)

## Modules

Environment Modules allows you to dynamically modify your user environment by using information provided by "modulefiles". This make it easy to change between environments or settings, e.g. the Intel compiler environment and the PGI compiler environment. If you have problems during compiling, running the "module list" command could help you see if you have missing or wrong environment modules loaded.

When writing a PBS job script (see Job execution for more information), the wanted environment has to be set inside the script using the modules command. The reason for this is that the user environment is not inherited by the PBS script. The same applies for interactive jobs (i.e. qsub -I).

The "module" command have several subcommands, e.g. "module avail".

The following list shows some of the subcommands used with "module".

 Subcommand Description avail Lists all available modules list Lists the modules you are using load "module_name" Loads module "module_name" unload "module_name" Unloads module "module_name" show "module_name" Displays "module_name"'s configuration settings swap "old_mod" "new_mod" Unloads the "old_mod" and loads the "new_mod"

If you want a specific version of the module you instead specify:

Please avoid using version numbers unless strictly necessary since older versions of packages may be removed at a later time.

If you want to change from the Cray compiler (default) to the Intel compiler you type:

module swap PrgEnv-cray PrgEnv-intel

You should also use swap if you want to load a different version of the same module, this will e.g. replace your current pgi version with 12.2.0:

module swap pgi pgi/12.2.0

A complete list of subcommands can be found in the module man page or here.

Please note, if the module command does not work inside your job scripts, add the line "export -f module" to your ~/.bashrc file. This should be automatically set for new users and is only valid if your shell is bash. For other shells you may source the corresponding file in /opt/modules/default/init/ inside your qsub script before you use any "module" command.

## Compilers and programming languages

Four different compilers are available on Hexagon:

• Cray (default)
• PGI
• GNU
• Intel

All compilation for compute nodes must be done using compiler wrappers. To switch between compilers module command must be used:

module switch PrgEnv-pgi PrgEnv-gnu

By default the latest available version will be loaded. You can switch to another compiler version with e.g.:

module switch pgi pgi/12.2.0

### How to invoke the compiler

Compiling an application for use on the compute node should be done by the wrappers specified below. Running the command "module list" will give you one entry like "PrgEnv-###", where ### is either cray, pgi, gnu or intel.

#### Compiling programs for compute nodes

When using the compiler wrappers, the wrappers take care of MPI and all additional modules switches/settings automatically.

Compute node compiler wrappers
Fortran 90/95 programs ftn
Fortran 77 programs f77
C programs cc
C++ programs CC

NOTE: These wrappers also handles MPI and openMP, so you should not compile with mpicc, mpif90 or similar, nor should you need to add any reference to MPI libraries in CFLAGS or similar variables.

Compiling the C program test.c can be done by the command:

cc -o test.out test.c

Where test.out is my selected name of the executable file.

#### Compiling programs for login nodes

When compiling for the login node the executable will not be able to run on the compute nodes, neither will OpenMP or MPI be supported.

The general rule in this case is to call the compiler directly (like pgcc for PGI).

NOTE: You can compile code for login nodes using compute node wrappers, just keep in mind that in this case you will include MPI and other libraries which are loaded as modules.

### Compiler version(s)

Currently installed Programming Environments for compilers:

Compiler Module
Cray PrgEnv-cray
PGI PrgEnv-pgi
GNU PrgEnv-gnu
Intel PrgEnv-intel

### Frequently used compiler options

Compiling OpenMP programs To activate OpenMP directives, compile and link with

Fortran:

 -mp=nonuma for the PGI compiler

C and C++:

 -mp for the PGI compiler

### Recommended compiler options

Normally if you use compiler wrappers all recommended options will be included.

In some cases you may need to use "--enable-static" during configure for running on compute nodes.

Usefull optimization flags for the AMD "Interlagos"

When using PGI the "-tp bulldozer-64" flag will improve the performance of your code. These options are automatically provided by the module craype-interlagos. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.

### Recommended environment variable settings

We recommend you to have the module craype-interlagos loaded. It will automatically add recommended optimization flags. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.

Additionally, the "cray-libsci" module contains optimized versions of common scientific/math libraries (e.g. LAPACK, BLAS).

## Debugging tools

### List of tools and usage summary

Several tools are available on hexagon for debugging.

#### ATP

Abnormal Termination Processing (ATP) is a system that monitors Cray XT System user applications, and should an application take a system trap, ATP preforms analysis on the dying application. With release 1.0 all of the stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the file "atpMergedBT.dot". The stack backtrace for the first process to die is sent to stderr as is the number of the signal that caused the death.

You can load ATP environment with:

Further information on ATP can found in the intro_atp man page.

man intro_atp

#### Lgdb

This gdb based debugger and launcher allows users to attach to and debug codes which execute multiple processes or threads.

You can load lgdb environment with:

Usage documentation can be found in the manpage:

man lgdb

The following example shows how to connect to an already running program:

qstat -f JOBID | grep exec_host
ssh loginX #take from exec_host of previous command
ps x | grep aprun # find your aprun
# to connect to the first rank
lgdb --pes=0 --pid=APRUNPID # You use APRUNPID from ps x command above
# to connect to a list of ranks (from first to 8th)
lgdb --pes=0-7 --pid=APRUNPID # You use APRUNPID from ps x command above

#### TotalView

TotalView is a graphical, source-level, multiprocess debugger.

License is limited to the number of cores. Maximum is 66.

When using this debugger you need to turn on X-forwarding, which is done when you login via ssh. This is done by adding the -Y on newer ssh version, and -X on older. Following is an example of using a new version of ssh.

If you don't know if you have an old or new version of ssh, you should run "man ssh" and look for an explanation of "-X" and/or "-Y".

The program you want to debug has to be compiled with the debug option. Normally this is the "-g" option, but that depends on the compiler. The executable from this compilation will in the following examples be called "filename".

First, load the totalview module to get the correct environment variables set:

If you are going to run TotalView on more than 64 cores (up to 512):

To start debugging run:

totalview "filename"

Which will start a graphical user interface.

Once inside the debugger, if you cannot see any source code, and keep the source files in a separate directory, add the search path to this directory via the main menu item File->Search path.

Source lines where it is possible to insert a breakpoint are marked with a box in the left column. Click on a box to toggle a breakpoint.

Double clicking a function/subroutine name in a source file should open the sourcefile. You can go back to the previous view by clicking on the left arrow on the top of the window.

The button "Go" runs the program from the beginning until the first breakpoint. "Next" and "Step" takes you one line / statement forward. "Out" will continue until the end of the current subroutine/function. "Run to" will continue until the next breakpoint.

The value of variables can be inspected by right clicking on the name, then choose "add to expression list". The variable will now be shown in a pop up window. Scalar variables will be shown with their value, arrays with their dimensions and type. To see all values in the array, right click on the variable in the pop up window and choose "dive". You can now scroll through the list of values. Another useful option is to visualize the array: after choosing "dive", open the menu item "Tools->Visualize" of the pop up window. If you did this with a 2D array, use middle button and drag mouse to rotate the surface that popped up, shift+middle button to pan, Ctrl+middle button to zoom in/out.

Running totalview inside the batch system (compute nodes)

qsub -I -l mppwidth=[#procs],walltime=[time] -A [account] -j oe -X
mkdir -p /work/$USER/test_dir cp$HOME/test_dir/a.out /work/$USER/test_dir cd /work/$USER/test_dir
totalview aprun -a -B ./a.out

Replace [#procs] with the core-count for the job. Note that totalview is licensed for a limited amount of cores.

Note: When totalview starts it will get 'aprun' up first. Click GO and YES.)

More information about TotalView can be found in the product knowledge base at http://kb.roguewave.com/kb/. Complete TotalView documentation is available from vendor - Rogue Wave Software, Inc. - at http://www.roguewave.com/help-support/documentation/totalview. Good, in-depth documentation is available also from Lawrence Livermore National Laboratory at https://computing.llnl.gov/tutorials/totalview/.

TotalView Remote

TotalView Remote Display Client (RDC) is a great tool, which provides integrated remote display capability to debug your code on a host machine. RDC is superior to X11 forwarding through SSH because of significant speed improvements and better GUI response. RDC will launch TotalView on the supercomputer and display it on the client machine.

For usage examples please go to notur.no/devtools

## Application optimization

### Performance optimization. General recommendations.

#### Compilation flags and environment settings

Correct optimization flags will be automatically selected if you use compiler wrappers and module craype-interlagos.

#### Enable FMA optimizations

Users should be aware that results obtained using FMA operations may differ in the lowest bits from results obtained on other X64 processors. The intermediate result fed from the multiplier to the adder is not rounded to 64 bits. Article at PGI.

Cray compiler: with -hfp3 or -hfp2
PGI: is default when you have -tp=bulldozer
GNU(set of optimizations): -march=bdver1 -Ofast -mprefer-avx128 -funroll-all-loops -ftree-vectorize

Please always very that the result provided with the optimized version is correct. If not try to reduce optimizations.
Please also check AMD Compiler Options Quick Reference Guide

#### Recommended optimized libraries

The following modules are optimized by Cray and are therefore recommended to use:

#### Correct use of file systems

There is no local disk available on the compute nodes.

Only a shared file system is available - /work file system, which is a Lustre FS. Note that this file system is not optimized to be accessed as a local scratch. Please avoid having small read/writes per chunk, instead replace the access pattern with bigger chunks, creating well-formed IO.

#### Dedicating FPUs per core

Due to specific Interlagos design one FPU unit is shared with 2 cores (see more about Bulldozer at Wiki).
If you have massive calculations on floating point numbers, you can get performance increase by dedicating each FPU per one core. (This will double your CPU time usage.) This is how to do it:

If you are using Cray compiler, you can make it aware of your plans by:

With other compilers just compile code as regular.

Next you will need to properly allocate tasks per core with aprun and queuing system.

##### Run w/o OpenMP
#PBS -l mppwidth=xx
#PBS -l mppnppn=16
#PBS -l mppdepth=2
aprun -n xx -N 16 -d 2 -S 4 ./mycode

where xx is number of cores you want to use.

(i.e. you pretend to use openmp on half the cores, it will reserve one core in each pair to avoid sharing FPU, the -S 4 says how many cores to put in each numa domain - of which there are 4 on a node)

##### Run with OpenMP or with precise placement

If you are using OpenMP you need to specifically map your cores (since the depth in -d is used to specify openmp), the equivalent to the above "-d 2 -N 16" is:

#PBS -l mppwidth=xx
#PBS -l mppnppn=16
aprun -n xx -N 16 -S 4 -cc 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 ./mycode

This example will use 16 mpi processes per node, 1 per FPU, leaving 16 cores per each node unused.

You may also want to add "-ss" which is strict memory placement (not allowed to have memory placed in another numa domain), though you may get out-of-mem error then depending on your memory usage. Avoiding cross-numa memory access will help the code by lower latency to memory.

### Performance analysis

#### List of tools and usage summary

##### Allinea Performance Reports

Allinea Performance Reports is a lightweight profiling tool, available on all NOTUR sites. It is producing a single page HTML file with CPU, MPI, IO(not available on Hexagon) and Memory split and in-line recommendations.

In order your program to able to produce a report, it has to be re-linked with "perf-reports" module loaded, e.g.

cc my.c -o myexe

When you run your program precede "aprun" with "perf-reports":

perf-reports aprun -B myexe

At the end of run a nice HTML and text file will be produced.

##### CrayPat

The Cray performance analysis tool.

CrayPat is a performance analysis tool for evaluating program execution on Cray systems. CrayPat consists of three major components:

• pat_build - used to instrument the program to be analyzed (see "man pat_build")
• pat_report - a standalone text report generator that can be use to further explore the data generated by instrumented program execution (see "man pat_report")
• Apprentice2 - a graphical analysis tool that can be used, in addition to pat_report to further explore and visualize the data generated by instrumented program execution (see "man app2")

Example:

make clean
make
• Instrument the application to generate a sampling profile:
pat_build -O apa a.out

This will create an executable "a.out+pat".

• Run your application (in batch) using the executable "a.out+pat".

This will create the file "a.out+pat+<*>.xf".

• Create Sampling report files:
pat_report a.out+pat+<*>.xf > my_report.txt

This command will automatically create a report file "a.out+pat+<*>.ap2", which can be viewed by Apprentice2.
The command will also create two text files in ascii format: "a.out+pat+<*>.apa" and "my_report.txt".

• For Hardware Counting, instrument application for further analysis:
pat_build -O a.out+pat+<*>.apa

This will create an executable "a.out+apa".

• Modify run script to run the executable "a.out+apa", and add the environment variables
export PAT_RT_MPI_SYNC=0
export PAT_RT_HWPC=[2|3|...]

Running this instrumented application will create a file "a.out+apa+<*>.xf".

• Convert raw data:
pat_report a.out+apa+<*>.xf > my_hwcp_report.txt

This command will automatically create a report file "a.out+apa+<*>.ap2", which can be viewed by Apprentice2. The command will also create a new text file in ascii format: "my_hwcp_report.txt"

• View the results by Apprentice2:
app2 a.out+pat+<*>.ap2 &     -for visualizing sampling results
app2 a.out+apa+<*>.ap2 &     -for visualizing hardware counting results

Apprentice2 generates a variety of interactive graphical reports. For more info, see man.

man app2

This summary is based on the slides of Luiz DeRose at the Cray XT4 workshop.

More information can be found in the corresponding manpages (man intro_craypat) or at http://docs.cray.com.

##### IPM

cc -o a.out main.c

Next time you execute your binary it will generate IPM report.

To parse results into HTML:

ipm_parse -html IPM_result_file.0

A deeper IPM usage is covered on NOTUR pages.

## Parallel applications

### MPI

Hexagon has wrappers that should be used when compiling programs for the compute nodes. More information about the wrappers can be found here. These wrappers handle MPI automatically, by using a module called cray-mpich. "cray-mpich" is based on mpich3.

If you want to change from the default PGI compiler to GNU, PathScale or Intel you can do that by changing the PrgEnv module. This is done by using modules.

Not all MPI-2 features are supported, for a complete list - see:

man intro_mpi

### OpenMP

At hexagon you can run OpenMP jobs within the node, i.e. on maximum 4 cores/processors. Since hexagon is to be used for jobs with high core-counts the use of pure OpenMP is discouraged, see below for an explanation of MPI/OpenMP hybrid.

To activate openMP directives, compile with Fortran:

 -mp=nonuma PGI compiler -h omp Cray compiler -openmp Intel compiler -fopenmp GNU compiler

C and C++:

 -mp for the PGI compiler -h omp Cray compiler -openmp Intel compiler -fopenmp GNU compiler

In the batch-script set (replace "threads_per_node" with 1-31)

This number should correspond to

### Hybrid MPI/OpenMP

You can run a hybrid MPI + OpenMP job where MPI is used between the nodes and OpenMP within the node.

No special compiler directives are needed to activate MPI, but to activate the OpenMP directives, compile and link with the following.

Fortran:

 -mp=nonuma for the PGI compiler

C and C++:

 -mp for the PGI compiler

In the batch-script set

#PBS -l mppnppn=mpi_processes_per_node
#PBS -l mppwidth=number_of_nodes

These numbers should correspond to

aprun ... -n number_of_nodes -d threads_per_node ...

Note: the mppnppn and mppdepth values must be chosen such that mppnppn x mppdepth <= 32.

## Checkpoint and restart of applications

To use the checkpointing feature the application must be compiled with blcr: