This section describes various methods for improving SPARTA performance for different classes of problems running on different kinds of machines.
Currently the only option is to use the KOKKOS accelerator packages provided with SPARTA that contains code optimized for certain kinds of hardware, including multi-core CPUs, GPUs, and Intel Xeon Phi coprocessors.
The Benchmark page of the SPARTA web site gives performance results for the various accelerator packages discussed in Section 5.2, for several of the standard SPARTA benchmark problems, as a function of problem size and number of compute nodes, on different hardware platforms.
Before trying to make your simulation run faster, you should understand how it currently performs and where the bottlenecks are.
The best way to do this is run the your system (actual number of particles) for a modest number of timesteps (say 100 steps) on several different processor counts, including a single processor if possible. Do this for an equilibrium version of your system, so that the 100-step timings are representative of a much longer run. There is typically no need to run for 1000s of timesteps to get accurate timings; you can simply extrapolate from short runs.
For the set of runs, look at the timing data printed to the screen and log file at the end of each SPARTA run. This section of the manual has an overview.
Running on one (or a few processors) should give a good estimate of the serial performance and what portions of the timestep are taking the most time. Running the same problem on a few different processor counts should give an estimate of parallel scalability. I.e. if the simulation runs 16x faster on 16 processors, its 100% parallel efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
The most important data to look at in the timing info is the timing breakdown and relative percentages. For example, trying different options for speeding up the FFTs will have little impact if they only consume 10% of the run time. If the collide time is dominating, you may want to look at the KOKKOS package, as discussed below. Comparing how the percentages change as you increase the processor count gives you a sense of how different operations within the timestep are scaling.
Another important detail in the timing info are the histograms of particles counts and neighbor counts. If these vary widely across processors, you have a load-imbalance issue. This often results in inaccurate relative timing data, because processors have to wait when communication occurs for other processors to catch up. Thus the reported times for "Communication" or "Other" may be higher than they really are, due to load-imbalance. If this is an issue, you can uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile SPARTA, to obtain synchronized timings.
Accelerated versions of various collide_style, fixes, computes, and other commands have been added to SPARTA via the KOKKOS package, which may run faster than the standard non-accelerated versions.
All of these commands are in the KOKKOS package provided with SPARTA. An overview of packages is give in Section packages.
SPARTA currently has acceleration support for three kinds of hardware, via the KOKKOS package: Many-core CPUs, NVIDIA GPUs, and Intel Xeon Phi.
Whether you will see speedup for your hardware may depend on the size problem you are running and what commands (accelerated and non-accelerated) are invoked by your input script. While these doc pages include performance guidelines, there is no substitute for trying out the KOKKOS package.
Any accelerated style has the same name as the corresponding standard style, except that a suffix is appended. Otherwise, the syntax for the command that uses the style is identical, their functionality is the same, and the numerical results it produces should also be the same, except for precision and round-off effects, and differences in random numbers.
For example, the KOKKOS package provides an accelerated variant of the Temperature Compute compute temp, namely compute temp/kk
To see what accelerate styles are currently available, see Section 3.5 of the manual. The doc pages for individual commands (e.g. compute temp) also list any accelerated variants available for that style.
To use an accelerator package in SPARTA, and one or more of the styles it provides, follow these general steps:
|install the accelerator package
|make yes-fft, make yes-kokkos, etc
|add compile/link flags to Makefile.machine in src/MAKE
or, using CMake from a build directory:
|install the accelerator package
|cmake -DPKG_FFT=ON -DPKG_KOKKOS=ON, etc
|add compile/link flags
|cmake -C /path/to/sparta/cmake/presets/kokkos_cuda.cmake -DKokkos_ARCH_PASCAL60=ON
Then do the following:
|prepare and test a regular SPARTA simulation
|lmp_kokkos_cuda -in in.script; mpirun -np 32 lmp_kokkos_cuda -in in.script
|enable specific accelerator support via '-k on' command-line switch,
|-k on g 1
|set any needed options for the package via "-pk" command-line switch or package command,
|only if defaults need to be changed, -pk kokkos reduction atomic
|use accelerated styles in your input via "-sf" command-line switch or suffix command
|lmp_kokkos_cuda -in in.script -sf kk
Note that the first 3 steps can be done as a single command with suitable make command invocations. This is discussed in Section 4 of the manual, and its use is illustrated in the individual accelerator sections. Typically these steps only need to be done once, to create an executable that uses one or more accelerator packages.
The last 4 steps can all be done from the command-line when SPARTA is launched, without changing your input script, as illustrated in the individual accelerator sections. Or you can add package and suffix commands to your input script.
The Benchmark page of the SPARTA web site gives performance results for the various accelerator packages for several of the standard SPARTA benchmark problems, as a function of problem size and number of compute nodes, on different hardware platforms.
Here is a brief summary of what the KOKKOS package provides.
The KOKKOS accelerator package doc page explains: