Tweak Performance
suanPan
prioritizes performance and designs the analysis logic in a parallel context accordingly.
Although the majority of analyses of common types can be parallelized, there are still certain parts that have strong data dependencies that cannot be parallelized. The speed-up factor is mostly determined by the serial code. According to Amdahl's law, there would be an upper bound of the theoretical speed-up.
Benchmark
Benchmarking is hard. Here we try to present a baseline of performance on recent PCs.
There is a benchmark
command that is designed to benchmark the platform by solving a large matrix repeatedly. The following is the main implementation. The matrix occupies 200 MB of memory. It is large enough to account for potential memory bandwidth bottleneck. It effectively reports the performance of dgesv
subroutine in LAPACK
.
As FEM is essentially solving large matrices, such a benchmark is practical and close to actually performance. Profiling it on an average laptop yields the following result. This particular platform is able to achieve a CPI (cycles per instruction) of 0.686. This serves as the baseline and upper bound of practical performance.
Large-sized Elastic Analysis
This example is a linear elastic analysis. There are 20124 nodes and 39990 shell elements in total, and since each node has six DoFs, there are in total 120744 DoFs.
Because an elastic material model is used, element state updating is trivial, requires only matrix multiplication. The main computation is assembling the global stiffness matrix and solving it. Profiling this example on the same platform yields the following result. It is able to achieve a CPI close to the baseline.
Medium-sized Plastic Analysis
This example contains 2990 nodes, and uses a plastic material model. Element state updating now involves local plasticity integration. It is able to achieve a CPI of 0.979 in this case.
Assuming all (or at least the majority of) instructions are useful instructions, one can conclude that decent performance can be practically achieved, even when the problem size is not significantly large.
Analysis Configurations
Here are some tips that may improve the performance.
If the analysis is known to be linear elastic, use
set linear_system true
to skip convergence test and iteration. Note the analysis should be both material and geometric linear.If the global system is known to be symmetric, use
set symm_mat true
to use a symmetric storage. Analyses involving 1D materials are mostly (not always) symmetric. Analyses involving 2D and 3D materials are mostly (not always) not symmetric.Consider a proper stepping strategy. A fixed stepping size may be unnecessarily expensive. A proper adaptive stepping strategy can significantly improve the performance.
Prefer a dense solver over a sparse solver if the system is small. A dense solver is generally faster than a sparse solver for small systems.
Prefer a mixed-precision algorithm
set precision mixed
over a full-precision algorithm if the system is large. A mixed-precision algorithm is generally faster than a full-precision algorithm for large systems. See following for details.The performance of various sparser solver can vary significantly. It is recommended to try different solvers to find the best one.
Mixed-Precision Algorithm
On some platforms, the performance of the mixed-precision algorithm can be significantly better than the full-precision algorithm. The mixed-precision algorithm converts the full-precision matrix to a lower precision matrix, and then solves the system using the lower precision matrix. Typically, only two to three iterations are required as each iteration reduces the relative error by a factor around machine epsilon of the lower precision.
The built-in tests consist of benchmarks for mixed-precision algorithms. One can execute the following command to run the tests.
One can find the following information.
The mixed-precision algorithm is around three times faster than the full-precision algorithm. Note the results are obtained with MKL on a platform with a 13-th generation Intel CPU. For platforms that have a slow memory bandwidth, the performance gain may not be as significant.
One could always benchmark the platform to find the best algorithm.
Tweaks
It is possible to tweak the performance in the following ways, which may or may not improve the performance.
OpenMP Threads
OpenMP is used by MKL and OpenBLAS to parallelize the matrix operations, alongside with SIMD instructions. It is possible to manually set OMP_NUM_THREADS to control the number of threads used. Pay attention to over-subscription.
OMP_DYNAMIC may affect cache locality and thus the performance. For computation intensive tasks, it is recommended to set it to false.
Affinity
CPU affinity can also affect the performance. Tweaking affinity, for example, with KMP_AFFINITY, can improve performance.
Memory Allocation
Memory fragmentation may downgrade analysis performance, especially for finite element analysis, in which there are a large number of small matrices and vectors. It is recommended to use a performant memory allocator, for example, a general purpose allocator like mimalloc.
On Linux, it is fairly easy to replace the default memory allocator. For example,
Last updated