SKIRT 9
advanced radiative transfer in dusty systems
Parallelization

This concept note describes the hydrid parallelization implementation in SKIRT to some level of detail. Topics discussed below include:

For information on how to use and configure the parallelization capabilities of SKIRT, refer to:

For an introduction to parallel computing concepts in general, refer to:

For an introduction to the message passing interface (MPI) used to implement multi-processing, refer to:

Hybrid parallelization in SKIRT

SKIRT allows both multi-threading and multi-processing parallelization, in any combination. This is illustrated in the following diagram:

UserHybridSkirt.png

With multi-threading, the code executed by the different processors uses the same memory locations. The threads share the entire process state, with all variables and functions. Multiple threads may attempt to read from and write to the same memory location at the same time, which may lead to race conditions and unpredictable behavior. There are mechanisms to avoid these problems, but the fact remains that, with a large number of threads, performance goes down because all threads are competing for a common resource.

With multi-processing, the execution of parallel code is performed by multiple, independent processes, each with their own memory addresses and process state. This avoids the performance issues from which multi-threading suffers. On the other hand, this kind of parallelization requires the implementation of explicit calls to a Message Passing Interface (MPI) at any point where communication is needed between processes. If implemented efficiently with a minimal amount of communication, multi-processing can scale much better for large number of processors than multi-threading. Also, processes can be allocated to different compute nodes, distributing the workload across a potentially large system with a distributed memory architecture.

The combination of multi-threading and multi-processing, called hybrid parallelization, allows SKIRT to perform efficiently on a wide range of system architectures, from laptops to supercomputers.

Note
In the current implementation, SKIRT requires all memory data structures to be duplicated in each process. In other words, there is no domain composition or other mechanism to distribute the data structures of the simulation across multiple processes. See When to use MPI for more information on how to best configure SKIRT in a multi-processing context.

The SKIRT parallelization support classes

The Parallel class hierarchy

Parallel is an abstract base class for subclasses that implement various parallelization schemes using one or more execution threads and/or one or more processes. A Parallel subclass instance can be created only through the ParallelFactory class, which constructs the appropriate Parallel subclass instance depending on the requested task allocation mode and the available parallel resources (see below). The client accesses the returned Parallel subclass instance using the common interface defined in this abstract Parallel base class.

The Parallel::call() function offered by the base class interface executes a specified target function \(N\) times as if it were part of a for loop over a range of indices from zero to \(N-1\). Each index in the range represents a particular task. To reduce the overhead of handing out the tasks, the loop is actualy chopped into chunks of consecutive indices. Rather than a single index, the target function is handed the first index of the chunk and the number of indices (tasks) in the chunk, and it is expected to iterate over the specified index range. The chunk sizes are determined automatically to achieve optimal load balancing given the available parallel resources, while still maximally reducing the overhead of handing out the chunks.

The Parallel subclasses and the parallelization schemes they implement are listed in the table below.

Shorthand Parallel subclass Description
S SerialParallel Single thread in the current process; isolated from any other processes
MT MultiThreadParallel Multiple coordinated threads in the current process; isolated from any other processes
MP MultiProcessParallel Single thread in each of multiple, coordinated processes
MTP MultiHybridParallel Multiple threads in each of multiple processes, all coordinated as a group
0 NullParallel No operation; any requests for performing tasks are ignored

The level of overhead differs substantially between the various schemes, so implementing each scheme separately allows optimizing performance in all use cases. For example, with the SerialParallel scheme all tasks are simply serialized and overhead is minimal.

In the subclasses that actually do implement parallelism, the chunks of tasks are handed out dynamically as the work progresses. This approach maximizes load balancing even if some task chunks take longer to complete than others. Depending on the parallelization scheme, one or more extra threads are used to serve work to the other threads and/or processes. These extra threads are not counted towards the number of threads specified by the user because they do not consume significant resources.

The ParallelFactory class

An instance of the ParallelFactory class serves as a factory for instances of Parallel subclasses, called its children. An important property of a factory object is the maximum number of parallel execution threads per process to be handed to its children, which is set during construction to the value of the number of threads per process specified by the user on the command line (or to the default number of threads determined from the hardware). A factory object assumes ownership for all its children. If a child of the appropriate type and with the appropriate number of threads already exists, it will be handed out again. As a result, a particular Parallel instance may be reused several times, reducing the overhead of creating and destroying the threads.

ParallelFactory clients can request a Parallel instance for one of the three task allocation modes described in the table below.

Task mode Description
Distributed All threads in all processes perform the tasks in parallel
Duplicated Each process performs all tasks; the results should be identical on all processes
RootOnly All threads in the root process perform the tasks in parallel; the other processes ignore the tasks

Depending on the requested task mode and the current run-time configuration (number of processes and number of threads in each process), a ParallelFactory object hands out the appropriate Parallel subclass as listed in the table below:

Mode/Runtime 1P 1T 1P MT MP 1T MP MT
Distributed S MT MP# MTP#
Duplicated S MT S S*
RootOnly S MT S/0 MT/0

(#) In Distributed mode with multiple processes, all threads require different random number sequences. Therefore, the MultiProcessParallel and MultiHybridParallel classes swith the Random instance associated with the simulation to arbitrary mode before performing tasks, and back to predictable mode after performing the tasks.

(*) In Duplicated mode with multiple processes, all tasks are performed by a single thread (in each process) because parallel threads executing tasks in an unpredictable order would see different random number sequences, possibly causing differences in the calculated results.

The ProcessManager class

The ProcessManager class is an interface to the Message Passing Interface (MPI) library (see The Message Passing Interface). Its implementation is the only place in the SKIRT code where explicit calls to this library are allowed.

The ProcessManager class offers functions for the following purposes:

Some of these functions are obviously used extensively by the implementations of the Parallel and ParallelFactory (sub)classes. The data synchronization functions are also called from other places in the SKIRT code.

Parallel coding pattern

The code block below illustrates the typical coding pattern for a parallel calculation.

size_t numT = ...;
Array Tv(numT); // result array is initialized to zeros
find<ParallelFactory>()->parallelDistributed()->call(numT, [this,&Tv] (size_t firstIndex, size_t numIndices)
{
for (size_t p=firstIndex; p!=firstIndex+numIndices; ++p)
{
...
Tv[p] = ...;
}
});

If the calculation may take longer than a few seconds to complete, the coding pattern should be extended with facilities to perform progress logging, as shown in the code block below.

size_t numT = ...;
auto log = find<Log>();
log->info("Calculating " + std::to_string(numT) + " temperatures...");
log->infoSetElapsed(numT);
Array Tv(numT); // result array is initialized to zeros
find<ParallelFactory>()->parallelDistributed()->call(numT, [this,log,&Tv] (size_t firstIndex, size_t numIndices)
{
const size_t logProgressChunkSize = ...;
while (numIndices)
{
size_t currentChunkSize = min(logProgressChunkSize, numIndices);
for (size_t p=firstIndex; p!=firstIndex+currentChunkSize; ++p)
{
...
Tv[p] = ...;
}
log->infoIfElapsed("Calculated temperatures: ", currentChunkSize);
firstIndex += currentChunkSize;
numIndices -= currentChunkSize;
}
});

Use of parallelization in SKIRT

The photon life cycle

For most SKIRT simulations, nearly all of the execution time is spent performing the photon life cycle. This is thus the most important code loop to be parallelized. The topmost level of the photon life cycle is implemented in the MonteCarloSimulation::performLifeCycle() function. This function is invoked as the target of a Parallel::call() from a separate function for each simulation phase. For example, the MonteCarloSimulation::runPrimaryEmission() function implements the primary emission phase.

As can be expected, the photon life cycle loop indirectly calls on a large number of classes in many areas of the code base, including sources, SEDs, geometries, material mixes, spatial grids, medium state, and instruments. Many of the data structures in the simulation are fully initialized during setup and are immutable during the photon life cycle phases. These data structures can be accessed from parallel execution threads without further concern. On the other hand, some data structures are updated during the photon life cycle; notably the data structures tracking the radiation field in every spatial cell and those recording observed fluxes in the instruments. Special care must be taken when updating these mutable data structures from parallel execution threads.

Fortunately, the layout and size of most mutable data structures can be determined and initialized during setup. As a result, the only updates happening in parallel consist of accumulating floating point values in shared locations. These updates are accomplished by calling the LockFree:add() function, which uses efficient low-level atomic operations (rather than high-level software locking) to avoid data races. There are, however, two situations where things are more complicated.

The first complexity arises with the calculation of emission spectra for imported primary sources and for secondary sources. Both types of sources consist of many separate entities (smoothed particles or spatial cells) for which distinct emission spectra must be determined. Obtaining these spectra may be time-consuming because of the interpolation in an SED template family or a complex stochastic heating calculation. Usually, multiple photon packets are emitted from the same entity, and recalculating the spectrum for each photon packet would be very inefficient. The solution to this problem has two components. Firstly, each source entity (particle or cell) is assigned a number of photon packets ahead of time, and all packets for a particular entity are emitted consecutively. Secondly, the calculated emission spectra are cached between photon packet launches in thread-local storage, so that each execution thread manages its own cache. For more information, see the SourceSystem, ImportedSource and SecondarySourceSystem classes.

The second complexity arises when recording statistics for observed fluxes in the instruments (an optional feature that can be enabled as part of the user configuration). This process requires accumulating the contributions of each indvidual photon packet to a given flux bin (for example, caused by the peel-off from multiple scattering events). The intermediate result for each photon packet, and per flux bin, must thus be cached during the packet's life cycle. In this case, we need a copy of the cache for each execution thread and for each instrument. To achieve this, the FluxRecorder class uses the ThreadLocalMember class template, which provides a thread-local instance data member using some fairly sophisticed C++ machinery.

Setup

For a typical "production" simulation, the setup phase usually represents only a small fraction of the total execution time. However, designing a new simulation configuration often requires many iterations, in which case reducing the number of photon packets being launched can help speed up the process. Similarly, when testing new features during a development cycle, a limited number of photon packets is often sufficient. For those use cases, it is still relevant to speed up time-consuming setup calculations through parallelization.

Many of the tasks performed during the setup phase are not worth considering for parallelization because they complete in a short time. Several other tasks cannot be parallelized easily because they involve creating and adjusting complex data structures. Some time-consuming tasks, however, can be parallelized with limited effort. For example, setup tasks that have been parallelized (using the Distributed task allocation mode discussed in the previous section) include:

Probes

Similarly, the time consumed by the various Probe subclasses is usually not overly significant, but in some cases it is still worth the limited effort to implement parallelization. For example, probe tasks that have been parallelized (using the RootOnly task allocation mode discussed in the previous section) include: