Running Hybrid MPI Programs


There are two techniques in common use for parallelising programs - MPI and OpenMP. Each of these has different benefits and drawbacks - it is common to use them together to create a 'hybrid' program using MPI for large-scale parallelism between nodes and OpenMP for parallelism within a single compute node.

These instructions focus on how to run such programs (rather than how to design them, which is rather more complicated). It also shows how to run multiple such programs alongside each other (for instance if you're coupling multiple models).

Threads, Cores, Sockets and Nodes


The CPUs in a supercomputer are part of a hierarchy of computing elements. At the highest level you have the supercomputer as a whole, a large room full of computers all wired together. Each computer within the supercomputer is called a 'Node', and is much like the computer on your desk. It has a pair of CPUs, RAM and a hard drive, however these are much more efficient than the ones normally used in consumer hardware.

Each physical CPU on the node is referred to as a 'Socket' in MPI parlance. The CPUs themselves are made up of multiple 'Cores' (8 cores per CPU on Raijin), each of these can independently perform computations. Going even further down than this it is possible for each core to use hyperthreading to run multiple 'Threads' of instructions at once, however this can be inefficient for large numerical codes so is disabled on Raijin.

MPI can send messages between processes at any level, so for instance two processes running on different nodes can communicate (although there is more latency the further up the hierarchy you need to go - communications are faster within a node than between nodes and communications between cores in a socket are faster still). On the other hand OpenMP only works within a single node since it relies on the node's memory being shared amongst the processes.

Affinity


An important concept when relating parallel programs to the hardware it runs on is processor affinity. Affinity is the mapping of processes (MPI ranks, OpenMP threads) to hardware (CPU sockets and cores). If processes aren't locked to a single core then the operating system can move them around, adding an overhead as the state of the core is saved & then loaded on a new core. Both MPI and OpenMP have ways to specify how the processes should be assigned to CPU cores.

Laying out MPI processes


By using different options to the mpirun command you are able to control how MPI assigns each process to physical cores. A useful option to use for this is --report-bindings, which shows a schematic of the process layout when you run a command.
$ mpirun --report-bindings -n 8 echo "Hello"
[raijin1:11266] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . .][. . . . . . . .]
[raijin1:11266] MCW rank 1 bound to socket 0[core 1]: [. B . . . . . .][. . . . . . . .]
[raijin1:11266] MCW rank 2 bound to socket 0[core 2]: [. . B . . . . .][. . . . . . . .]
[raijin1:11266] MCW rank 3 bound to socket 0[core 3]: [. . . B . . . .][. . . . . . . .]
[raijin1:11266] MCW rank 4 bound to socket 0[core 4]: [. . . . B . . .][. . . . . . . .]
[raijin1:11266] MCW rank 5 bound to socket 0[core 5]: [. . . . . B . .][. . . . . . . .]
[raijin1:11266] MCW rank 6 bound to socket 0[core 6]: [. . . . . . B .][. . . . . . . .]
[raijin1:11266] MCW rank 7 bound to socket 0[core 7]: [. . . . . . . B][. . . . . . . .]
Hello
...

The default MPI layout on Raijin is --bycore, MPI processes will be assigned to consecutive cores, filling sockets one by one. An alternative is --bysocket, which will assign to consecutive sockets (looping to the next core on the socket once it reaches the end):
$ mpirun --report-bindings --bysocket -n 8 echo "Hello"
[raijin1:11766] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . .][. . . . . . . .]
[raijin1:11766] MCW rank 1 bound to socket 1[core 0]: [. . . . . . . .][B . . . . . . .]
[raijin1:11766] MCW rank 2 bound to socket 0[core 1]: [. B . . . . . .][. . . . . . . .]
[raijin1:11766] MCW rank 3 bound to socket 1[core 1]: [. . . . . . . .][. B . . . . . .]
[raijin1:11766] MCW rank 4 bound to socket 0[core 2]: [. . B . . . . .][. . . . . . . .]
[raijin1:11766] MCW rank 5 bound to socket 1[core 2]: [. . . . . . . .][. . B . . . . .]
[raijin1:11766] MCW rank 6 bound to socket 0[core 3]: [. . . B . . . .][. . . . . . . .]
[raijin1:11766] MCW rank 7 bound to socket 1[core 3]: [. . . . . . . .][. . . B . . . .]
Hello
...

There are also options for controlling the maximum number of processes on a socket or node --npersocket N and --npernode N, e.g.:
$ mpirun --report-bindings --npersocket 4 -n 8 echo "Hello"
[raijin3:01591] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . .][. . . . . . . .]
[raijin3:01591] MCW rank 1 bound to socket 0[core 1]: [. B . . . . . .][. . . . . . . .]
[raijin3:01591] MCW rank 2 bound to socket 0[core 2]: [. . B . . . . .][. . . . . . . .]
[raijin3:01591] MCW rank 3 bound to socket 0[core 3]: [. . . B . . . .][. . . . . . . .]
[raijin3:01591] MCW rank 4 bound to socket 1[core 0]: [. . . . . . . .][B . . . . . . .]
[raijin3:01591] MCW rank 5 bound to socket 1[core 1]: [. . . . . . . .][. B . . . . . .]
[raijin3:01591] MCW rank 6 bound to socket 1[core 2]: [. . . . . . . .][. . B . . . . .]
[raijin3:01591] MCW rank 7 bound to socket 1[core 3]: [. . . . . . . .][. . . B . . . .]
Hello
...

Laying out OpenMP


The process control of MPI is entirely separate to that of OpenMP. OpenMP controls processor assignments using environment variables, namely OMP_NUM_THREADS and KMP_AFFINITY (for programs compiled with Intel compilers).

The value of OMP_NUM_THREADS tells the program how many threads to assign to each process, while KMP_AFFINITY describes how to lay them out. Adding "verbose" to the KMP_AFFINITY will print out a schematic of the OpenMP threads.
$ export OMP_NUM_THREADS=2
$ export KMP_AFFINITY=none,verbose
$ ./test
OMP: Info #179: KMP_AFFINITY: 2 packages x 8 cores/pkg x 1 threads/core (16 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
Hello
...

In the above example there is no OpenMP affinity, so each of the two threads can run on any CPU core. Other valid values for affinity are:

  • compact: Assign threads as close together as possible (e.g. on adjacent cores)
  • scatter: Assign threads as far apart as possible (e.g. on separate sockets)

$ export OMP_NUM_THREADS=2
$ export KMP_AFFINITY=scatter,verbose
$ ./test
OMP: Info #179: KMP_AFFINITY: 2 packages x 8 cores/pkg x 1 threads/core (16 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
Hello
...

Putting things together


The interfaces for controlling MPI and OpenMP are entirely separate, however OpenMP will respect MPI's process assignments. This means that if you give an MPI process a single core to run on then OpenMP will try and run all of its threads on that one core, with the threads conflicting for resources:
$ export OMP_NUM_THREADS=2
$ export KMP_AFFINITY=scatter,verbose
$ mpirun -n 1 ./test
OMP: Info #159: KMP_AFFINITY: 1 packages x 1 cores/pkg x 1 threads/core (1 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0}
Hello
...

To avoid this conflict you need assign more than one core to each MPI process, which you can do using the --cpus-per-proc N argument to mpirun. Usually you'll want to use a number of cores equal to the number of OpenMP threads:
$ export OMP_NUM_THREADS=2
$ export KMP_AFFINITY=scatter,verbose
$ mpirun --report-bindings --cpus-per-proc $OMP_NUM_THREADS -n 1 ./test
[raijin3:17569] MCW rank 0 bound to socket 0[core 0-1]: [B B . . . . . .][. . . . . . . .]
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 1 threads/core (2 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
Hello
...

The binding information gets complicated when you use multiple MPI processes, you can add --tag-output to see which MPI process information is coming from. In this case there are 2 MPI processes running on separate sockets, each having 2 OpenMP ranks on adjacent cores:
$ mpirun --report-bindings --cpus-per-proc $OMP_NUM_THREADS --bysocket --tag-output -n 2 ./a.out
[1,0]<stderr>:[raijin3:18805] MCW rank 0 bound to socket 0[core 0-1]: [B B . . . . . .][. . . . . . . .]
[1,0]<stderr>:OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 1 threads/core (2 total cores)
[1,0]<stderr>:OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
[1,0]<stderr>:OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
[1,0]<stderr>:OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1
[1,0]<stderr>:OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
[1,0]<stderr>:OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
[1,1]<stderr>:[raijin3:18805] MCW rank 1 bound to socket 1[core 0-1]: [. . . . . . . .][B B . . . . . .]
[1,1]<stderr>:OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 1 threads/core (2 total cores)
[1,1]<stderr>:OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
[1,1]<stderr>:OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 1 core 0
[1,1]<stderr>:OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1
[1,1]<stderr>:OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {8}
[1,1]<stderr>:OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {9}
[1,0]<stdout>:Hello World
...

Multiple Programs


If you want to run multiple programs in the same MPI session (e.g. for coupling multiple climate models) things get really interesting.

The basic syntax for a multiple program MPI run is
mpirun [global options] [program1 options] -n N1 program1 : [program2 options] -n N2 program2 [: ...]

As an example:
$ mpirun --report-bindings -n 1 echo "Hello" : -n 2 echo "World"
[raijin3:20270] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . .][. . . . . . . .]
[raijin3:20270] MCW rank 1 bound to socket 0[core 1]: [. B . . . . . .][. . . . . . . .]
[raijin3:20270] MCW rank 2 bound to socket 0[core 2]: [. . B . . . . .][. . . . . . . .]
World
Hello
World

Options like --bysocket and --cpus-per-proc are global options, so if one of the programs uses OpenMP then all programs must use the same cpus-per-proc, potentially wasting resources.

As an alternative you can create a 'rankfile', to manually assign cores to each process. Rankfiles look like:
rank 0=r1234 slot=0:0
rank 1=r1234 slot=1:0,1
rank 2=r1235 slot=0:4,5

Each line in the rankfile specifies a host (e.g. compute node r1234) and a CPU socket & one or more cores in the socket. In the example rank 0 will use the first core in the first socket of node r1234, rank 1 will use cores 0 and 1 of the second socket on that node and rank 2 will use cores 4 & 5 on r1235.

$ cat ranks
rank 0=raijin3 slot=0:0
rank 1=raijin3 slot=1:0,1
rank 2=raijin3 slot=0:4,5
 
$ mpirun --report-bindings --rf ranks -n 1 echo "hello" : -n 2 echo "world"
[raijin3:09881] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . .][. . . . . . . .] (slot list 0:0)
[raijin3:09881] MCW rank 1 bound to socket 1[core 0-1]: [. . . . . . . .][B B . . . . . .] (slot list 1:0,1)
[raijin3:09881] MCW rank 2 bound to socket 0[core 4-5]: [. . . . B B . .][. . . . . . . .] (slot list 0:4,5)
world
hello
world

This gives a large amount of flexibility, at the cost of being somewhat tedious to organise. When running on the compute nodes it is best to use a script to set this up; the environment variable PBS_NODEFILE gives the name of a file which describes all the compute nodes available for your job.

Once you've set up the correct bindings you can use the mpirun option '-x VARIABLE' to tell each program how many OpenMP threads to use:
$ mpirun --report-bindings --rf ranks -n 1 echo "hello" : -n 2 -x OMP_NUM_THREADS=2 ./a.out
mpirun: WARNING: unable to determine OpenMPI version of echo, runtime version may be incompatible
[raijin3:12295] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . .][. . . . . . . .] (slot list 0:0)
[raijin3:12295] MCW rank 1 bound to socket 1[core 0-1]: [. . . . . . . .][B B . . . . . .] (slot list 1:0,1)
[raijin3:12295] MCW rank 2 bound to socket 0[core 4-5]: [. . . . B B . .][. . . . . . . .] (slot list 0:4,5)
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 1 threads/core (2 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 1 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {8}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {9}
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 1 threads/core (2 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {4}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {5}
hello
...