On Clusters
The element state determination, as well as solving the global system, can be distributed over a process grid. By enabling MPI, the application can be executed on clusters. The architecture design is explained in this page.
There are a few caveats. Since the distributed solvers for linear systems rely on external libraries, typically an implementation of ScaLAPACK. There are not many choices, one can use the reference implementation, which may not have a superb performance. One can also use AMD's implementation or Intel's implementation.
Prerequisites
Considering the procedure of manually compiling dependencies is cumbersome, we only support Intel's ecosystem for the moment.
That means, MKL
must be enabled via the option -DSP_ENABLE_MKL=ON
.MKL
can be installed a priori.
For example, according to the official documentation, with APT, it can be installed via the following commands.
It is likely that Intel's MPI library is also required, otherwise a working implementation shall be pre-installed. Note, the MPI bundled by your distro may not work.
Configuration
The minimum configuration requires two flags: -DSP_ENABLE_MKL=ON
and -DSP_ENABLE_MPI=ON
.CMake
will try to locate the installation of MKL
and automatically configure the project.
To ensure everything works as intended, it may be necessary to activate the oneAPI
environment.
This can be done via, for example, environment file
in the toolchain configuration in CLion, environmentSetupScript
in cmake-kits.json
in VS Code.
By default, it will use Intel's MPI.
If another implementation, for example, MPICH and/or OpenMPI, is used, use MPI_HOME
to override the path, for example, -DMPI_HOME=~/Documents/OpenMPI
.
The project tries to automatically detect the vendor of MPI via its path, but it may fail.
You may need to set MKL_MPI
to a proper value.
The available values can be found in /opt/intel/oneapi/mkl/latest/lib/cmake/mkl/MKLConfig.cmake
.
For example, on my machine, there are
!!! note If two different implementations of MPI are used, the compilation may succeed, the application can still crash on execution.
Typically, for problems that need to be run on clusters, the default 32-bit indexing is not sufficient. The limit of a 32-bit integer is slightly above 2 billion, meaning that, if the global matrix is stored in dense full format, the maximum size is 46340. If each node has 2 DoFs (2D node with translational DoFs only), this corresponds to 23170 nodes. If each node has 6 DoFs (3D node with both translational and rotational DoFs), this corresponds to 7723 nodes. But in the context of FEM, the full storage is rarely used, so this is the lower bound. Considering the banded storage, typically the bandwidth is a small fraction of the global size, say, 1%, then the lower bound should be at least 800000 3D shell nodes.
If the problem is sizeable, it is possible to enable 64-bit indexing via -DSP_ENABLE_64BIT_INDEXING=ON
.
This will link ilp64
version of MKL
, and compile all dependencies with the proper settings.
Last updated