Application development (Hexagon)

From Hpcdoc
Jump to: navigation, search

Modules

Environment Modules allows you to dynamically modify your user environment by using information provided by "modulefiles". This make it easy to change between environments or settings, e.g. the Intel compiler environment and the PGI compiler environment. If you have problems during compiling, running the "module list" command could help you see if you have missing or wrong environment modules loaded.

When writing a PBS job script (see Job execution for more information), the wanted environment has to be set inside the script using the modules command. The reason for this is that the user environment is not inherited by the PBS script. The same applies for interactive jobs (i.e. qsub -I).

The "module" command have several subcommands, e.g. "module avail".

The following list shows some of the subcommands used with "module".

Subcommand Description
avail Lists all available modules
list Lists the modules you are using
load "module_name" Loads module "module_name"
unload "module_name" Unloads module "module_name"
show "module_name" Displays "module_name"'s configuration settings
swap "old_mod" "new_mod" Unloads the "old_mod" and loads the "new_mod"

To load the netcdf module into your environment you type:

module load netcdf

If you want a specific version of the module you instead specify:

module load netcdf/3.6.2

Please avoid using version numbers unless strictly necessary since older versions of packages may be removed at a later time.

If you want to change from the Cray compiler (default) to the Intel compiler you type:

module swap PrgEnv-cray PrgEnv-intel

You should also use swap if you want to load a different version of the same module, this will e.g. replace your current pgi version with 12.2.0:

module swap pgi pgi/12.2.0

A complete list of subcommands can be found in the module man page or here.

Please note, if the module command does not work inside your job scripts, add the line "export -f module" to your ~/.bashrc file. This should be automatically set for new users and is only valid if your shell is bash. For other shells you may source the corresponding file in /opt/modules/default/init/ inside your qsub script before you use any "module" command.

Compilers and programming languages

Four different compilers are available on Hexagon:

  • Cray (default)
  • PGI
  • GNU
  • Intel

All compilation for compute nodes must be done using compiler wrappers. To switch between compilers module command must be used:

module switch PrgEnv-pgi PrgEnv-gnu

By default the latest available version will be loaded. You can switch to another compiler version with e.g.:

module switch pgi pgi/12.2.0

How to invoke the compiler

Compiling an application for use on the compute node should be done by the wrappers specified below. Running the command "module list" will give you one entry like "PrgEnv-###", where ### is either cray, pgi, gnu or intel.

Compiling programs for compute nodes

When using the compiler wrappers, the wrappers take care of MPI and all additional modules switches/settings automatically.

Compute node compiler wrappers
Fortran 90/95 programs ftn
Fortran 77 programs f77
C programs cc
C++ programs CC

NOTE: These wrappers also handles MPI and openMP, so you should not compile with mpicc, mpif90 or similar, nor should you need to add any reference to MPI libraries in CFLAGS or similar variables.

Compiling the C program test.c can be done by the command:

cc -o test.out test.c

Where test.out is my selected name of the executable file.

Compiling programs for login nodes

When compiling for the login node the executable will not be able to run on the compute nodes, neither will OpenMP or MPI be supported.

The general rule in this case is to call the compiler directly (like pgcc for PGI).

NOTE: You can compile code for login nodes using compute node wrappers, just keep in mind that in this case you will include MPI and other libraries which are loaded as modules.

Compiler version(s)

Currently installed Programming Environments for compilers:

Compiler Module
Cray PrgEnv-cray
PGI PrgEnv-pgi
GNU PrgEnv-gnu
Intel PrgEnv-intel

Frequently used compiler options

Compiling OpenMP programs To activate OpenMP directives, compile and link with

Fortran:

-mp=nonuma for the PGI compiler

C and C++:

-mp for the PGI compiler

Recommended compiler options

Normally if you use compiler wrappers all recommended options will be included.

In some cases you may need to use "--enable-static" during configure for running on compute nodes.

Usefull optimization flags for the AMD "Interlagos"

When using PGI the "-tp bulldozer-64" flag will improve the performance of your code. These options are automatically provided by the module craype-interlagos. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.

Recommended environment variable settings

We recommend you to have the module craype-interlagos loaded. It will automatically add recommended optimization flags. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.

Additionally, the "cray-libsci" module contains optimized versions of common scientific/math libraries (e.g. LAPACK, BLAS).

Debugging tools

List of tools and usage summary

Several tools are available on hexagon for debugging.

ATP

Abnormal Termination Processing (ATP) is a system that monitors Cray XT System user applications, and should an application take a system trap, ATP preforms analysis on the dying application. With release 1.0 all of the stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the file "atpMergedBT.dot". The stack backtrace for the first process to die is sent to stderr as is the number of the signal that caused the death.

You can load ATP environment with:

module load atp

Further information on ATP can found in the intro_atp man page.

man intro_atp

Lgdb

This gdb based debugger and launcher allows users to attach to and debug codes which execute multiple processes or threads.

You can load lgdb environment with:

module load cray-lgdb

Usage documentation can be found in the manpage:

man lgdb

The following example shows how to connect to an already running program:

qstat -f JOBID | grep exec_host
ssh loginX #take from exec_host of previous command
ps x | grep aprun # find your aprun
module load cray-lgdb
# to connect to the first rank
lgdb --pes=0 --pid=APRUNPID # You use APRUNPID from ps x command above
# to connect to a list of ranks (from first to 8th)
lgdb --pes=0-7 --pid=APRUNPID # You use APRUNPID from ps x command above

TotalView

TotalView is a graphical, source-level, multiprocess debugger.

License is limited to the number of cores. Maximum is 66.

When using this debugger you need to turn on X-forwarding, which is done when you login via ssh. This is done by adding the -Y on newer ssh version, and -X on older. Following is an example of using a new version of ssh.

ssh -Y username@hexagon.bccs.uib.no

If you don't know if you have an old or new version of ssh, you should run "man ssh" and look for an explanation of "-X" and/or "-Y".

The program you want to debug has to be compiled with the debug option. Normally this is the "-g" option, but that depends on the compiler. The executable from this compilation will in the following examples be called "filename".

First, load the totalview module to get the correct environment variables set:

module load xt-totalview

If you are going to run TotalView on more than 64 cores (up to 512):

module load xt-totalview-notur

To start debugging run:

totalview "filename"

Which will start a graphical user interface.

Once inside the debugger, if you cannot see any source code, and keep the source files in a separate directory, add the search path to this directory via the main menu item File->Search path.

Source lines where it is possible to insert a breakpoint are marked with a box in the left column. Click on a box to toggle a breakpoint.

Double clicking a function/subroutine name in a source file should open the sourcefile. You can go back to the previous view by clicking on the left arrow on the top of the window.

Tv-full.png

The button "Go" runs the program from the beginning until the first breakpoint. "Next" and "Step" takes you one line / statement forward. "Out" will continue until the end of the current subroutine/function. "Run to" will continue until the next breakpoint.

The value of variables can be inspected by right clicking on the name, then choose "add to expression list". The variable will now be shown in a pop up window. Scalar variables will be shown with their value, arrays with their dimensions and type. To see all values in the array, right click on the variable in the pop up window and choose "dive". You can now scroll through the list of values. Another useful option is to visualize the array: after choosing "dive", open the menu item "Tools->Visualize" of the pop up window. If you did this with a 2D array, use middle button and drag mouse to rotate the surface that popped up, shift+middle button to pan, Ctrl+middle button to zoom in/out.

Running totalview inside the batch system (compute nodes)

qsub -I -l mppwidth=[#procs],walltime=[time] -A [account] -j oe -X
mkdir -p /work/$USER/test_dir
cp $HOME/test_dir/a.out /work/$USER/test_dir
cd /work/$USER/test_dir
module load xt-totalview
totalview aprun -a -B ./a.out

Replace [#procs] with the core-count for the job. Note that totalview is licensed for a limited amount of cores.

Note: When totalview starts it will get 'aprun' up first. Click GO and YES.)

More information about Totalview can be found in the product knowledge base at http://www.roguewave.com/support/knowledge-base.aspx
Totalview documentation is available at http://www.roguewave.com/support/product-documentation/totalview.aspx#totalview

Application optimization

Performance optimization. General recommendations.

Compilation flags and environment settings

Correct optimization flags will be automatically selected if you use compiler wrappers and module craype-interlagos.

Enable FMA optimizations

Users should be aware that results obtained using FMA operations may differ in the lowest bits from results obtained on other X64 processors. The intermediate result fed from the multiplier to the adder is not rounded to 64 bits. Article at PGI.

Cray compiler: with -hfp3 or -hfp2
PGI: is default when you have -tp=bulldozer
GNU(set of optimizations): -march=bdver1 -Ofast -mprefer-avx128 -funroll-all-loops -ftree-vectorize

Please always very that the result provided with the optimized version is correct. If not try to reduce optimizations.
Please also check AMD Compiler Options Quick Reference Guide

Recommended optimized libraries

The following modules are optimized by Cray and are therefore recommended to use:

Correct use of file systems

There is no local disk available on the compute nodes.

Only a shared file system is available - /work file system, which is a Lustre FS. Note that this file system is not optimized to be accessed as a local scratch. Please avoid having small read/writes per chunk, instead replace the access pattern with bigger chunks, creating well-formed IO.

Dedicating FPUs per core

Due to specific Interlagos design one FPU unit is shared with 2 cores (see more about Bulldozer at Wiki).
If you have massive calculations on floating point numbers, you can get performance increase by dedicating each FPU per one core. (This will double your CPU time usage.) This is how to do it:

If you are using Cray compiler, you can make it aware of your plans by:

module load craype-interlagos-cu

With other compilers just compile code as regular.

Next you will need to properly allocate tasks per core with aprun and queuing system.

Run w/o OpenMP
#PBS -l mppwidth=xx
#PBS -l mppnppn=16
#PBS -l mppdepth=2
aprun -n xx -N 16 -d 2 -S 4 ./mycode

where xx is number of cores you want to use.

(i.e. you pretend to use openmp on half the cores, it will reserve one core in each pair to avoid sharing FPU, the -S 4 says how many cores to put in each numa domain - of which there are 4 on a node)

Run with OpenMP or with precise placement

If you are using OpenMP you need to specifically map your cores (since the depth in -d is used to specify openmp), the equivalent to the above "-d 2 -N 16" is:

#PBS -l mppwidth=xx
#PBS -l mppnppn=16
aprun -n xx -N 16 -S 4 -cc 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 ./mycode

This example will use 16 mpi processes per node, 1 per FPU, leaving 16 cores per each node unused.

You may also want to add "-ss" which is strict memory placement (not allowed to have memory placed in another numa domain), though you may get out-of-mem error then depending on your memory usage. Avoiding cross-numa memory access will help the code by lower latency to memory.

Performance analysis

List of tools and usage summary

Allinea Performance Reports

Allinea Performance Reports is a lightweight profiling tool, available on all NOTUR sites. It is producing a single page HTML file with CPU, MPI, IO(not available on Hexagon) and Memory split and in-line recommendations.

The instructions below are for static linking. If you are using dynamic linking, please contact us for help.

In order your program to able to produce a report, it has to be re-linked with "perf-reports" module loaded, e.g.

module load perf-reports
cc my.c -o myexe

When you run your program precede "aprun" with "perf-reports":

module load perf-reports
perf-reports aprun -B myexe

At the end of run a nice HTML and text file will be produced.


You can find more information about Allinea Performance Reports at http://www.allinea.com/products/performance

CrayPat

The Cray performance analysis tool.

CrayPat is a performance analysis tool for evaluating program execution on Cray systems. CrayPat consists of three major components:

  • pat_build - used to instrument the program to be analyzed (see "man pat_build")
  • pat_report - a standalone text report generator that can be use to further explore the data generated by instrumented program execution (see "man pat_report")
  • Apprentice2 - a graphical analysis tool that can be used, in addition to pat_report to further explore and visualize the data generated by instrumented program execution (see "man app2")

Example:

  • Load the newest version of CrayPat:
module load perftools
  • Compile your application:
make clean
make 
  • Instrument the application to generate a sampling profile:
pat_build -O apa a.out

This will create an executable "a.out+pat".

  • Run your application (in batch) using the executable "a.out+pat".

This will create the file "a.out+pat+<*>.xf".

  • Create Sampling report files:
pat_report a.out+pat+<*>.xf > my_report.txt

This command will automatically create a report file "a.out+pat+<*>.ap2", which can be viewed by Apprentice2.
The command will also create two text files in ascii format: "a.out+pat+<*>.apa" and "my_report.txt".

  • For Hardware Counting, instrument application for further analysis:
pat_build -O a.out+pat+<*>.apa

This will create an executable "a.out+apa".

  • Modify run script to run the executable "a.out+apa", and add the environment variables
export PAT_RT_MPI_SYNC=0
export PAT_RT_HWPC=[2|3|...] 

Running this instrumented application will create a file "a.out+apa+<*>.xf".

  • Convert raw data:
pat_report a.out+apa+<*>.xf > my_hwcp_report.txt

This command will automatically create a report file "a.out+apa+<*>.ap2", which can be viewed by Apprentice2. The command will also create a new text file in ascii format: "my_hwcp_report.txt"

  • View the results by Apprentice2:
app2 a.out+pat+<*>.ap2 &     -for visualizing sampling results
app2 a.out+apa+<*>.ap2 &     -for visualizing hardware counting results

Apprentice2 generates a variety of interactive graphical reports. For more info, see man.

man app2

This summary is based on the slides of Luiz DeRose at the Cray XT4 workshop.

More information can be found in the corresponding manpages (man intro_craypat) or at http://docs.cray.com.

IPM

You can find the short version for hexagon below. Loading the module will add all requirement libraries for linking into cc wrapper.

module load ipm
cc -o a.out main.c

Next time you execute your binary it will generate IPM report.

To parse results into HTML:

module load ipm
ipm_parse -html IPM_result_file.0

A deeper IPM usage is covered on NOTUR pages.

IPM userguide

Parallel applications

MPI

Hexagon has wrappers that should be used when compiling programs for the compute nodes. More information about the wrappers can be found here. These wrappers handle MPI automatically, by using a module called cray-mpich. "cray-mpich" is based on mpich3.

If you want to change from the default PGI compiler to GNU, PathScale or Intel you can do that by changing the PrgEnv module. This is done by using modules.

Not all MPI-2 features are supported, for a complete list - see:

man intro_mpi

OpenMP

At hexagon you can run OpenMP jobs within the node, i.e. on maximum 4 cores/processors. Since hexagon is to be used for jobs with high core-counts the use of pure OpenMP is discouraged, see below for an explanation of MPI/OpenMP hybrid.

To activate openMP directives, compile with Fortran:

-mp=nonuma PGI compiler
-h omp Cray compiler
-openmp Intel compiler
-fopenmp GNU compiler

C and C++:

-mp for the PGI compiler
-h omp Cray compiler
-openmp Intel compiler
-fopenmp GNU compiler

In the batch-script set (replace "threads_per_node" with 1-31)

#PBS -l mppnppn=1,mppwidth=1,mppdepth=threads_per_node
export OMP_NUM_THREADS=threads_per_node

This number should correspond to

aprun ... -d threads_per_node ...

Hybrid MPI/OpenMP

You can run a hybrid MPI + OpenMP job where MPI is used between the nodes and OpenMP within the node.

No special compiler directives are needed to activate MPI, but to activate the OpenMP directives, compile and link with the following.

Fortran:

-mp=nonuma for the PGI compiler

C and C++:

-mp for the PGI compiler

In the batch-script set

#PBS -l mppnppn=mpi_processes_per_node
#PBS -l mppdepth=threads_per_node
#PBS -l mppwidth=number_of_nodes
export OMP_NUM_THREADS=threads_per_node

These numbers should correspond to

aprun ... -n number_of_nodes -d threads_per_node ...

Note: the mppnppn and mppdepth values must be chosen such that mppnppn x mppdepth <= 32.

Checkpoint and restart of applications

To use the checkpointing feature the application must be compiled with blcr:

module load blcr

With the module loaded, all necessary options will be automatically added by the compiler wrapper. Please recompile your application to include the blcr support. Note that only MPI and SHMEM programming models are supported.

The Cray checkpoint/restart solution uses the BLCR software from Berkley Lab's and inherits its limitations. For more information, refer to the BLCR documentation: http://upc-bugs.lbl.gov/blcr/doc/html/index.html.

The job must be submitted with the "-c enabled" parameter. Please see Job execution (Hexagon)#List of useful job script parameters.

Recommended reading

Cray XT Programming Environment User's Guide - contains everything needed to start to work with examples on the Cray XT machine.