FAQ (Hexagon)

From HPC documentation portal
Revision as of 12:12, 12 November 2015 by Alexander Oltu (Talk | contribs) (Compiling with older PGI versions in CLE5.2UP02)

Jump to: navigation, search

How do I log in on hexagon?

To log in on hexagon you need a ssh program installed on your desktop. The syntax for logging in depends on which ssh client you use. From a Linux desktop "ssh username@hexagon.hpc.uib.no" is sufficient. See also Secure Shell.

I typed my password wrong several times, now it seems I can not log in. Has my account been closed?

Your account has most likely not been closed. Your computer (IP-address) have been temporarily blocked in our firewall to prevent bruteforce attacks. Try again in 15 minutes. If you still cannot connect please contact Support.

How do I compile my software with MPI?

When compiling your software on hexagon you have to use the wrappers provided by Cray, ftn, cc, CC and f77. These wrappers include MPI.

See Application development (Hexagon) for more information.

How do I change the compiler?

By default you will have the PGI compiler loaded when you log on to the system. If this compiler for some reason does not work correctly or optimal for your program, you can change to GNU or PathScale. This is done by the module command.

For example will the "module swap PrgEnv-pgi PrgEnv-gnu" change the compiler from PGI to GNU. You will still use the same wrapper to compile you program.

If you are uncertain which compiler you are using, "module list" will show you a list of the modules you currently have loaded. Either PrgEnv-pgi, PrgEnv-gnu or PrgEnv-pathscale will be listed.

See Application development (Hexagon) for more information.

Mpiexec or mpirun does not seem to be available. How do I run my MPI program?

Cray do not use mpiexec or mpirun. Instead they have aprun, which HAS to be used in order to run programs on the compute nodes.

You have to provide aprun with some flags depending on how you want your software to run.

aprun -n 32 (-N 32) ./a.out
    

The above example will run a.out on 4 cpus, on ONE node. Hexagon has 32 cpus per node. (-N 32) is default and could in this case be omitted (and is therefore put in parenthesis).

aprun -n 4 -N 2 ./a.out
    

The above example will run a.out on 4 cpus, where each node will use 2 cpus. Hence, TWO nodes will be used.

Please note that even though examples above run on the same number of cores, the last example will be charged as it was running on all cpus on the two nodes. This is because the node will be completely reserved for your job, that is, no other job can run on the free cpus.

See Job execution (Hexagon) for more information.

When I try to run my software through aprun I get the error message: "No such file or directory" to my home directory. What is wrong?

The /home directory is not mounted on the compute nodes. In order to run your software, the executable has to be located in /work/$USER/somewhere. Additionally, your current directory has also to be somewhere in the /work file system.

I get this strange error message. What am I doing wrong?

The error message:

[unset]: _pmi_init: _pmi_preinit encountered an internal error
Assertion failed in file /tmp/ulib/mpt/nightly/3.1/040709/mpich2/src/mpid/cray/src/adi/mpid_init.c 
at line 178: 0  aborting job:
(null)

This error message (or something very similar) is returned if you are trying to run a program which is compiled for the compute nodes on a login node. The program has to be executed with aprun. See here for more information about the batch system and aprun.

What is the OOM killer?

The error message:

_pmii_daemon(SIGCHLD): PE 4 exit signal Killed
[NID 21]Apid 611039: initiated application termination
[NID 00021] Apid 611039: OOM killer terminated this process.

This error message is returned in the output if your program uses more memory than available on one or several compute nodes. OOM stands for Out-Of-Memory. The OOM-killer is a standard Linux kernel feature that kills a process that uses up all the memory on a machine. The issue can be fixed or worked around in several ways. Running on more (or sometimes also fewer) cores per job can help minimize memory used per core and giving the job as a whole more memory. You can also ask for more memory per core. You control this with the batch system parameters mppmem, "memory-per-core" and mppnppn, "cores per node". See Job execution (Hexagon) for more information about the batch system and aprun.

How do I change my password?

The passwords on hexagon is stored in a readonly filesystem, so it is not possible to for a user to change the password directly. Please contact Support.

I have not enough disk quota to run my job. What shall I do?

If you are running out of space in your home folder, or job is producing a large amount of stdout & stderr it is recommended to redirect stdout & stderr into file on the /work file system. If possible you should avoid producing the stdout or stderr from the application altogether since it creates a high load on the login node and may slow down your program. Some example usage (note that your shell may require a different syntax):

# Both stderr and stdout into one file
aprun .... >& /work/$USER/combined.out
# Stderr and stdout into different files
aprun .... >/work/$USER/app.out 2>/work/$USER/app.err
# Stdout into file and dropping stderr
aprun .... >/work/$USER/app.out 2>/dev/null

How to get compiler version information?

If executable has been compiled with debugging symbols you can fetch compiler version:

readelf -wi a.out|grep DW_AT_producer

How I can see all FLAGS included with compiler wrapper?

When you change modules and module versions flags in compiler wrapper are changed.

To see all flags included to compiler via compiler wrapper simply execute wrapper command with -v:

cc -v

I get error "Illegal instruction"

The reason you get "Illegal instruction" is because module xtpe-interlagos has optimization for Interlagos CPUs.
The login nodes have Instanbul CPU. Hence if you compile a code with the module xtpe-interlagos and want to run it on the login node you will get "Illegal instruction", while it will run fine on the compute node.
When you compile your code for the login node, you have to unload xtpe-interlagos module.

I get error "/opt/pgi/VERSION/linux86-64/12.2/libso/libpgc.so: undefined reference to `_mp_slave'"

Probably you are compiling with pgf90 (or simillar) instead of ftn wrapper. To workaround either you have to add -L /opt/pgi/default/linux86-64/default/libso -lpgmp during linking or export CRAYPE_LINK_TYPE=dynamic and use ftn to link objects.

I get issues with the bit-reproducibility after switching compiler from PGI 12 to PGI 13 and up.

The PGI version 13 has introduced vectorisation with the "-O2" flag, this significantly improves performance, but can result in no bit-reproducibility.

If you have issues with this you can try to add "-Mnovec" or "-Mvect=nosse" flag during compiling.

Compiling with older PGI versions in CLE5.2UP02

PGI 12 static linking is no longer supported due to security updates in GLIBC. The dynamic linking should still work. The following method applies to software compiling with older PGI versions (12 and 13).

module unload PrgEnv-cray 
module load PrgEnv-pgi 
module swap pgi pgi/12.10.0 
module swap cray-mpich cray-mpich2 
module swap cray-libsci cray-libsci/12.2.0 
module swap xtpe-interlagos craype-barcelona 
export CRAYPE_LINK_TYPE=dynamic 

For PGI 12, to avoid linking errors you will have to in addition use either:

  1. specify the '-mp=nonuma' option;
  2. or create a dynamic executable. Use '-dynamic' flag or set following environment variable "export CRAYPE_LINK_TYPE=dynamic".

When you run your code, you will need to do again above mentioned operations with the modules and add:

export CRAY_ROOTFS=DSL

NOTE: Please always try to use the latest PGI compiler and use the above method only when this is really needed.

I get module: command not found in my job script.

Most probably your default shell is /bin/tcsh and your job header has #!/bin/bash. If this is the case you can add into your job script:

#PBS -S /bin/bash

Or you can contact us at Support and change your default shell to /bin/bash.

More on this in Job execution (Hexagon)#Recommended environment variable settings.

Can I use resources_used.cput, resources_used.mem and resources_used.vmem information from job output file?

No. This information is incorrect and reflects only login node usage during job script execution.

Why it takes so long time in the queue to start my job?

Please check Queue priorities (Hexagon) page for better understanding of the queuing system on Hexagon.

I have: export OMP_NUM_THREADS=32, why the program is so much slower on hexagon?

The default core allocation policy on Hexagon is to use shared FPU, while on other machine you probably get 1 FPU per core. So with the floating point numbers the difference can be 2 times slower on Hexagon. If you have floating point numbers we recommend you to use 1 FPU per core:

Application_development_(Hexagon)#Dedicating_FPUs_per_core

and/or FMA,AVX:

Application_development_(Hexagon)#Enable_FMA_optimizations

PS. Properly compiled and executed code, should not show any big arithmetic performance differences per core between similar generations CPUs (Only a slight because of CPU frequency differences).

My problem is not listed here. What do I do?

Send an email to our administrators at Support describing your problem. It will be beneficial to provide the job number which failed, and paths to output file, error file, submit script and Makefile. Then one of the engineers will help you as soon as possible.