Running MPQC

This chapter explains how to run MPQC in a variety of environments.

The first sections gives general information on running MPQC:

The final sections give specific information on running MPQC in different environments, as well as optimization hints:

Command Line Options

MPQC executable can be given the following command-line options:

option accept value? description
-i mandatory The name of the input file. MPQC will attempt to parse the given input file using the JSON, XML, and INFO formats (in that order). If -i is not given, and no options with omitted optional values are used (e.g., -D), the last command-line argument that does not specify an option will be assumed to give the input file name.
-o mandatory The name of the file to which all output (including error) from rank 0 will be directed. By default MPQC sends (normal) output to the standard output and the error output to the standard error.
-O mandatory Same as -o, except keeps the output from all ranks. If # of MPI ranks is greater than 1 then the per-rank output from ranks other than 0 will be written to files with ".<MPI rank>" appended. All empty per-rank output files are removed at the end of the execution.
-p mandatory The prefix for all relative file paths in the input file
-W mandatory the working directory in which to compute
-D optional unless "debugger" keyword is given in input KeyVal, create a debugger at start, with the optional argument as its JSON KeyVal input
-v no print the version number and exit
-w no print the warranty and exit
-L no print the license and exit
-k no print all registered (KeyVal-constructible) DescribedClass classes
-h no print the usage info and exit
-d no start the program and attach a debugger, if it is specified via the "debugger" keyword in the input, or via the -D option
-t no throw if a deprecated keyword is read
-meminfo no enable MADWorld's print_meminfo memory profiler

MPQC Usage Examples

  1. prints the usage info:
    mpqc -h
  2. run MPQC using input.json as the input file:
    mpqc input.json
  3. same as 2, but launches debugger upon failure using the default method (if the DISPLAY environment variable is defined, launches gdb or lldb via xterm, otherwise spins waiting for the user to attach debugger to each process):
    mpqc -D '{"cmd": "gdb_xterm"}' input.json
  4. same as 2, but launches gdb via xterm upon failure:
    mpqc -D '{"cmd": "gdb_xterm"}' input.json
  5. same as 2, but launches lldb via xterm upon failure:
    mpqc -D '{"cmd": "lldb_xterm"}' input.json
  6. same as 2, but broken since -D "consumes" input.json as its argument:
    mpqc -D input.json
  7. fixed version of 6, input.json is passed via -i:
    mpqc -D -i input.json
  8. same as 5, but redirects output to output.log':
    mpqc -d -D '{"cmd": "lldb_xterm"}' -o output.log input.json

MPQC Environment Variables

  • MPQC_WORK_DIR: the directory for POSIX I/O of large text/binary files; it needs to be valid in every MPI process; the default on each process is the current working directory of that MPI process

Running on a Distributed Memory Multiprocessor with MPI

MPQC requires a high-performance MPI implementation that supports MPI_THREAD_MULTIPLE operation. Modern MPI implementations expose a huge number of parameters that can be controlled by user via environment variables or other means. Correct execution of MPQC with most MPI implementations usually does not require changing any parameters; the few exceptions are listed below.

MVAPICH2

Environment variables MV2_ENABLE_AFFINITY and MV2_USE_LAZY_MEM_UNREGISTER must be both set to zero, e.g.:

export MV2_ENABLE_AFFINITY=0
export MV2_USE_LAZY_MEM_UNREGISTER=0

OpenMPI

When running with OpenMPI with 1 MPI rank per node the default is to bind all threads to 1 core. This is clearly suboptimal. The simplest solution is to provide –bind-to none to mpirun, however that may not be optimal either. The user should refer to OpenMPI mpirun documentation to understand how to pin threads appropriately.

MPQC Optimization Hints

MPQC was designed to execute correctly and efficiently on a typical platform without too much user intervention. To achieve optimal performance, however, manual tuning is necessary.

Number of MADWorld Threads

The total number of MADWorld threads by default is set to the number of available cores (this may be affected by whether hyperthreading is enabled or not). Of these one thread will always be used for messaging (note that this is distinct from MPI threads; modern MPI implementations will typically use one or more threads for their own messaging operations) and the rest will be used for computation. Environment variable MAD_NUM_THREADS can be used to control the total number of threads used by MADWorld. It is recommended to set it to the number of hardware threads that each MPI rank can access without contention (e.g. the number of cores in each physical node).

N.B. The actual number of threads created by MADWorld may be greater than the value given by MAD_NUM_THREADS, however the total number of active threads will exceed its value. This also assumes that user tasks are single-threaded (i.e. do not spin their own threads).

Communication Buffers

It is recommended to increase the number and size of communication buffers used by the active messaging system of MADWorld runtime. The following environment variables can be used to control the active messaging:

Environment Variable Description
MAD_SEND_BUFFERS The number of buffers that track pending outgoing messages. The default value depends on the platform; MAD_SEND_BUFFERS=128 by default for Linux and MacOS.
MAD_RECV_BUFFERS The number of preallocated buffers for receiving incoming messages eagerly. The default value is MAD_RECV_BUFFERS=128.
MAD_BUFFER_SIZE The size of each preallocated buffer for receiving incoming messages eagerly. The default is MAD_RECV_BUFFERS=1.5MB.

The most important factor for performance is the receive buffer size: TA Arrays' tiles must fit into these buffers to ensure best performance. For data that exceeds the eager buffer size a rendevous protocol will be used which will increases messaging latency and thus will can significantly impact message processing and overall performance.

MPI Ranks per node

Although in an ideal scenario one MPI rank per node is optimal for intra-node communication efficiency, in practice better performance may be obtained by using more than 1 MPI rank per node. Two factor play into this: (1) communication thread performance and (2) NUMA regions. The former determines whether single communication thread can process incoming messages at a rate sufficient to keep compute threads busy. On massively multicore machines such as Intel Xeon Phi one communication thread is not sufficient, so it is better to use more 1 MPI rank per node (this is only applicable if using more than 1 node). NUMA regions also factor into this. Since MADWorld and TiledArray do not assume any structure about the memory accessed by a single MPI rank, to improve memory locality in presence of NUMA regions it is recommended to create 1 MPI rank per NUMA region (e.g. 1 per socket in a multisocket CPU node, or 1 per NUMA region on Xeon Phi), and bind that rank's threads to its region. See your MPI implementation documentation for how to bind threads to NUMA regions.

BLAS/LAPACK Library

It is recommended to use sequential BLAS/LAPACK library interfaces. N.B. It should be possible to use multithreaded versions of MKL that use Intell Thread Building Blocks (TBB) correctly assuming that MADWorld also uses TBB.