Tweak Performance

suanPan prioritizes performance and designs the analysis logic in a parallel context.

Although the majority of the common analysis types can be parallelized, there are still some certain parts that have strong data dependencies that cannot be parallelized. According to Amdahl's law, there would be an upper bound of the theoretical speedup.

For example, for a static analysis of a simple model with a sufficiently large number of elements, there is no local iteration required to update element status, the major tasks are to assemble global stiffness matrix and solve it. In such a case, the performance is likely governed by the CPU capacity and often a large value of GFLOPS can be achieved (close to practical limit).

However, if one choose to perform a dynamic analysis of the same model with a fairly sophisticated time integration algorithm, such as GSSSS, as the effective stiffness would be the summation of the scaled versions of several global matrices, the analysis may be blocked by memory operations, which eventually leads to a lower value of GFLOPS.

In the nonlinear context, it is even more complicated. Several additional factors, such as the complexity of the material models used, the use of constraints, the element type, can all affect the performance.

Nevertheless, experience has shown that the performance is generally good enough for most cases. Users are encouraged to perf the performance of various analysis types.

Analysis Configurations

Here are some tips that may improve the performance.

  1. If the analysis is known to be linear elastic, use set linear_system true to skip convergence test and iteration. Note the analysis should be both material and geometric linear.

  2. If the global system is known to be symmetric, use set symm_mat true to use a symmetric storage. Analyses involving 1D materials are mostly (not always) symmetric. Analyses involving 2D and 3D materials are mostly (not always) not symmetric.

  3. Consider a proper stepping strategy. A fixed stepping size may be unnecessarily expensive. A proper adaptive stepping strategy can significantly improve the performance.

  4. Prefer a dense solver over a sparse solver if the system is small. A dense solver is generally faster than a sparse solver for small systems.

  5. Prefer a mixed-precision algorithm set precision mixed over a full-precision algorithm if the system is large. A mixed-precision algorithm is generally faster than a full-precision algorithm for large systems.

  6. The performance of various sparser solver can vary significantly. It is recommended to try different solvers to find the best one.

Tweaks

It is possible to tweak the performance in the following ways, which may or may not improve the performance.

OpenMP Threads

OpenMP is used by MKL and OpenBLAS to parallelize the matrix operations, alongside with SIMD instructions. It is possible to manually set OMP_NUM_THREADS to control the number of threads used. Pay attention to over-subscription.

OMP_DYNAMIC may affect cache locality and thus the performance. For computation intensive tasks, it is recommended to set it to false.

Affinity

CPU affinity can also affect the performance. Tweaking affinity, for example, with KMP_AFFINITY, can improve performance.

Memory Allocation

Memory fragmentation may downgrade analysis performance, especially for finite element analysis, in which there are a large number of small matrices and vectors. It is recommended to use a performant memory allocator, for example, a general purpose allocator like mimalloc.

On Linux, it is fairly easy to replace the default memory allocator. For example,

LD_PRELOAD=/path/to/libmimalloc.so  suanpan -f input.sp

Last updated