




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This Best Practice Guide (BPG) extends the previously developed series of BPGs [1] (these older guides are still relevant as they provide valuable ...
Typology: Summaries
1 / 109
This page cannot be seen from the preview
Don't miss anything!
Updated 5-5-
Modern Processors
This Best Practice Guide (BPG) extends the previously developed series of BPGs [1] (these older guides are still relevant as they provide valuable background information) by providing an update on new technologies and systems for the further support of European High Performance Computing (HPC) user community in achieving a remarkable performance of their large-scale applications. It covers existing systems and aims to provide support for scientists to port, build and run their applications on these systems. While some benchmarking is part of this guide, the results provided are mainly an illustration of the different systems characteristics, and should not be used as guides for the comparison of systems presented nor should be used for system procurement considerations. Procurement [2] and benchmarking [3] are well covered by other PRACE work packages and are out of this BPG's discussion scope.
This BPG document has grown to be a hybrid of field guide and a textbook approach. The system and processor coverage provide some relevant technical information for the users who need a deeper knowledge of the system in order to fully utilise the hardware. While the field guide approach provides hints and starting points for porting and building scientific software. For this, a range of compilers, libraries, debuggers, performance analysis tools, etc. are covered. While recommendation for compilers, libraries and flags are covered we acknowledge that there is no magic bullet as all codes are different. Unfortunately there is often no way around the trial and error approach.
Some in-depth documentation of the covered processors is provided. This includes some background on the inner workings of the processors considered; the number of threads each core can handle; how these threads are imple- mented and how these threads (instruction streams) are scheduled onto different execution units within the core. In addition, this guide describes how the vector units with different lengths (256, 512 or in the case of SVE - variable and generally unknown until execution time) are implemented. As most of HPC work up to now has been done in 64 bit floating point the emphasis is on this data type, specially for vectors. In addition to the processor execut- ing units, memory in its many levels of hierarchy is important. The different implementations of Non-Uniform Memory Access (NUMA) are also covered in this BPG.
The guide gives a description of the hardware for a selection of relevant processors currently deployed in some PRACE HPC systems. It includes ARM64 (Huawei/HiSilicon and Marvell)^1 and x86-64 (AMD and Intel). It pro- vides information on the programming models and development environment as well as information about porting programs. Furthermore it provides sections about strategies on how to analyze and improve the performance of applications. While this guide does not provide an update on all recent processors, some of the previous BPG releases [1] do cover other processor architectures not discussed in this guide (e.g. Power architecture) and should be considered as a staring point for work.
This guide aims also to increase the user awareness on energy and power consumption of individual applications by providing some analysis on usefulness of maximum CPU frequency scaling based on the type of application considered (e.g. CPU-bound, memory-bound, etc.).
As mentioned earlier, this guide covers processors and technologies deployed in European systems (with a small exception to ARM SVE - soon to be deployed in EU). While European ARM and RISC-V are just over the horizon, systems using these processors are not deployed for production at the time of this writing (Q3-2020). As this European technology is still a couple of years from deployment it's not covered in the current guides, however, this new technology will require substantial documentation. Different types of accelerators are covered by other BPGs [1] and a corresponding update for these is currently being developed.
Emphasis has been given to providing relevant tips and hints via examples for scientists not deeply involved into the art of HPC programming. This document aims to provide a set of best practices that will make adaptation to these modern processors easier. It is not intended to replace an in depth textbook, nor replacing the documentation for the different tools described. The hope is that it should be the first document to reach for when starting to build a scientific software.
As for programming languages used in examples these are either C or Fortran. C is nice as it maps nicely to machine instructions for the simple examples and together with Fortran makes up the major languages used in
(^1) While the vector enabled Fujitsu A64FX with SVE processor is not covered there is a short section about Scalable Vector Extension (SVE) included.
Modern Processors
HPC. While knowledge of assembly programming is not as widespread as before the examples are on a level of simplicity that the should be easily accessible.
This current guide is a joined guide from several different guides in the past. The previous guides where each processor had it's own guide is a thing of the past. This guide covers all relevant processors (see above), but each processor chapter is still a separate chapter. The merger is not yet fully completed.
Furthermore, this guide will provide information on the following, recently deployed, European flagship super- computing systems:
Modern Processors
Figure 1. Kunpeng 920 block diagram [54]
Huawei reports the core supports almost all the ARMv8.4-A ISA features with a few exceptions, but including dot product and the FP16 FML extension, see [67].
The ThunderX2 is an evolution in the ThunderX family of processors from Cavium, now part of Marvell. The ThunderX2 provide full NEON 128 bits vector support, Simultaneous Multi Thread (SMT) support [75] and up to eight DDR4 memory controllers for high memory bandwidth (see later for measurements).
Figure 2. ThunderX2 block diagram [55]
Modern Processors
Figure 3. ThunderX2 block diagram [56]
From the core diagram above we can establish that there are two 128 bits NEON SIMD execution units and two scalar floating point execution units in addition to integer and load/store units.
Simultaneous Multi Thread [75] is a technique that provide the possibility to run multiple threads (instructions streams) on a single core, also known as Thread Level Parallelism (TLP). The core provide several parallel paths for threads and also a set of context for each thread. The implementation is somewhat more complex, see Figure 3, “ThunderX2 block diagram [56]”.
The practical manifestation is that it looks like there are twice (SMT-2) or four SMT-4) times as many cores in the processor. This can be beneficial for some application where less context switching can improve performance. In HPC the benefit is not always observed as most application are limited by memory bandwidth or the fact that the different threads share the same executional units. It may look like there are four times as many cores, but in reality there are no more compute resources. This is documented in Section 2.6, “Simultaneous Multi Threading (SMT) performance impact”.
With the release of Thunder X3 Marvell takes a new generation of their ARM processors to the market. Press coverage [68] tell about a major upgrade in performance. With up to 96 cores with SMT-4 yielding 384 logical cores, four 128 NEON SIMD units (no SVE yet). With expected 3x performance over Thunder X2. This new
Modern Processors
Table 2. Suggested compiler flags for Thunder X
Compiler Suggested flags GNU -O3 -march=armv8-a+simd -mcpu=thunderx2t99 -fomit-frame-pointer ARM HPC compiler -Ofast -march=armv8-a+simd -mcpu=thunderx2t99 -omit-frame-pointer
The NEON SIMD 128 bit vector execution unit can improve performance significantly. Even short vector units at 128 bit size can handle two 64 bit floating point numbers at the time, in principle double the floating point performance.
The NEON unit is IEEE-754 compliant with a few minor exceptions, (mostly related to rounding and comparing) [70]. NEON should therefore be as safe regarding numerical precision to use the NEON instructions as the scalar floating point instructions. Hence vector instructions can be applied almost universally.
Testing has shown that the performance gain is significant. Tests compiling reference implementation for matrix matrix multiplication, NASA Parallel Benchmark (NPB) [122] BT and a real life scientific stellar atmosphere code (Bifrost, using its own internal performance units)[130] demonstrate the effect :
Table 3. Effect of NEON SIMD
Benchmark Flags Performance Matrix matrix -Ofast -march=armv8.2-a+nosimd 1.88 GFLOPS multiplication -Ofast -march=armv8.2-a+simd 3.18 GFLOPS NPB BT -Ofast -march=armv8.2-a+nosimd -mcpu=tsv110 142929 Mop/s total -Ofast -march=armv8.2-a+simd -mcpu=tsv110 158920 Mop/s total Stellar atmosphere -Ofast -march=armv8-a+nosimd -mcpu=thunderx2t99 13.82 Mz/s (Bifrost) -Ofast -march=armv8-a+simd -mcpu=thunderx2t99 16.10 Mz/s
Speedups ranging from 1.69 to 1.11 for benchmarks are recorded while a speedup of 1.17 was demonstrated for real scientific code. These tests demonstrate the importance of a vector unit, even with a narrow 128 bit unit like the NEON.
With the significant performance gain experienced with NEON there is a general expectation that SVE will pro- vide a substantial floating point performance boost. The experience from the x86-64 architecture leave room for optimism. However, without real benchmarks this is still uncharted territory.
The optimisation flags O3 and Ofast invoke different optimisations for the GNU and the ARM HPC compiler. The code generation differs between O3 and Ofast. The Ofast can invoke unsafe optimisations that might not yield the same numerical result as O2 or even O3. Please consult the compiler documentation about high level of optimisation. Typical for the Ofast flag is what the GNU manual pages states : "Disregard strict standards compliance." This might be OK or cause problems for some codes. The simple paranoia.c [95] program might be enough to demonstrate the possible problems.
The ARM HPC Compiler comes with a performance library built for the relevant ARM implementations. It contail the usual routines, see [136] for details.
The syntax when using the ARM HPC compiler is shown in the table below.
Table 4. Compiler flags for ARM performance library using ARM compiler
Library type Flag Serial -armpl
Modern Processors
Library type Flag Serial. Infer the processor from the system -armpl=native Parallel / multithreaded -armpl=parallel
Just adding one of these to the compiling command line will invoke linking of the appropriate performance library.
Scalar math functions library, commonly known as libm, when distributed by the operating system is also available in an optimised version, called libamath.
Before the Arm Compiler links to the libm library, the compiler automatically links to the libamath library to provide enhanced performance by using the optimised functions. This is a default behaviour and does not require that user supplies any specific compiler flags to initiate this behaviour.
The performance of a well known benchmark called Savage [129] which exercise trigonometric functions expe- riences about 15% speedup when using the ARM library libamath compared to the standard libm.
For detailed information please consult the ARM HPC compiler documentation [136].
The SVE [73] will represent a major leap in floating point performance, see [71]. The 256 and 512 bit vector units with other architectures provide a major advantage in the floating point performance. While NEON could perform a limited set of 128 bit vector instructions the introduction of SVE will put the floating point capabilities on par with the major architecture currently deployed for HPC.
At time of writing there is only one ARM processor supporting SVE [62]. However, the compilers and libraries are ready. Both GNU compilers and the ARM HPC compilers support SVE (Cray and Fujitsu should also support SVE but this is not tested in this guide). The code generation has been tested using a range of scientific applications and SVE code generation work well. The exact performance speedup remains to be evaluated due to lack of suitable hardware.
Most of the scientific applications need to be recompiled with SVE enabled. Few problems have ever sprung up when selecting SVE generation. From the compiler and building point of view there should be no major issues. However, without real hardware to run on the performance gain is hard to measure.
Table 5. Flags to enable SVE code generation
Compiler SVE enabling lags GNU -march=armv8.1-a+sve ARM HPC compiler -march=armv8.1-a+sve
The SVE flag apply to all armv8-a and later versions. As the vector length in the current hardware is unknown the GNU flag "-msve-vector-bits" is set to "scalable" as default (-msve-vector-bits=scalable). This flag can be set to match the current hardware but this is not recommended, as the hardware vector length can vary and can be set at boot time. Best practice is to leave the vector length unknown and build vector length agnostic code.
Consider a commonly known loop (Stream benchmark):
DO 60 j = 1,n a(j) = b(j) + scalar*c(j) 60 CONTINUE
The compiler will generate NEON instructions if "+simd" is used, look at the Floating-point fused multiply-add to accumulator instruction (flma) and fused multiply-add vectors (fmad).
Modern Processors
Sum: a(i) = b(i) + c(i) Triad: a(i) = b(i) + q * c(i)
The following table shows the obtained memory bandwidth measured using the OpenMP version of the STREAM benchmark. The table shows the highest bandwidth obtained using different settings found for processor bindings. The bar chart illustrated in Figure 4 shows the need for thread/rank core bindings. It's vital to get this right in order to get a decent performance.
The STREAM benchmark was built using gcc with OpenMP support and compiled with a moderate level of optimisation, -O2.
Table 6. Stream performance, all numbers in MB/s
Processor Binding Copy Scale Add Triad Kunpeng 920 PROC BIND 322201 322214 324189 323997 Kunpeng 920 GOMP_CPU_AFFINITY=0-127 275666 275142 282769 280507 Kunpeng 920 Numactl -l 237508 238412 232506 232840
Figure 4. Stream memory bandwidth Kunpeng 920, OpenMP, 128 cores, 224GiB footprint (87% of installed RAM)
The table and figures show that thread binding is important, the actual settings of the binding is less important. Just setting the environment variable OMP_PROC_BINDING to "1" or "true" is making all the difference.
The memory bandwidth performance numbers are impressive compared to the majority of currently available systems. For applications that are memory bound this relatively high memory bandwidth will yield a performance benefit. From the benchmark results published by the vendors it is also evident that the ARM systems have memory bandwidth advantage.
Here, we show STREAM performance again for the Thunder X2 processor, but using the OpenMPI version rather than the OpenMP version. Using the OpenMPI version allows scaling and benchmarking across nodes, although in this case we are strictly focused on the single node case.
Modern Processors
Figure 5. Stream memory bandwidth on ThunderX2 with OpenMPI, Size N=400000000, including SMT settings impact
The above performance plot shows MPI STREAM for 3 different SMT configurations – SMT-1 (up to 64 process- es), SMT-2 (up to 128 processes, 2 hardware threads per core) and SMT-4 (up to 256 processes, 4 hardware threads per core). The array size that was chosen for this test is 400 million elements, which satisfies the rule that the memory requirements for the test should be greater than 3 times the LLC size. For this test, STREAM was compiled with GCC v9.2 and OpenMPI v4.0.2, using the compilation flags:
-O2 -mcmodel=large -fno-PIC
The results shown are the maximum performance achieved over 5 runs. As can be seen from the performance numbers, the highest bandwidth is achieved when using 64 MPI processes, regardless of the SMT configuration. The performance drops by almost 10MB/s when using 256 MPI processes in SMT-4 mode; given that the SMT- configuration is using all the available hardware threads, this is still very good performance.
The table below shows the results from running High Performance Linpack (HPL) on Fulhame. The tests were run with different SMT values but with the processor frequency set at 2.5GHz, i.e. it's highest value. To achieve this, memory turbo was disabled to allow the frequency to boost above the nominal 2.2GHz. Compilation was performed with GCC 9.2 and OpenMPI 4.0.2 linking to the ARM performance libraries.
Table 7. HPL performance numbers for Fulhame. All numbers in GFLOPS
SMT | Ranks 64 128 256 Off 742 N/A N/A 2 741 626 N/A 4 636 636 548
As might be reasonably expected, SMT-1 gives the best performance results. All hardware cores are fully engaged and there is no contention on their resources from hyper-threads. For comparison, with the frequency fixed at 1.0GHz (the lowest available on the ThunderX2), the performance of the SMT-4 with 256 processors case is 184GFLOPS.
Modern Processors
Figure 7. HPCG (MPI) performance using four Kunpeng 920 nodes with RoCE, 100GbE, ARM performance libraries
The ThunderX2 results are obtained from a single ThunderX2 node. The number of cores indicate number of apparent cores e.g. including the logical SMT cores. At 256 cores SMT-4 is used.
Figure 8. HPCG (MPI) performance using a single ThunderX
Modern Processors
Memory bandwidth is an important parameter of a system, and the impact of threaded memory bandwidth with respect to the Simultaneous Multi Threading (SMT) should be documented.
Figure 9. Memory bandwidth impact of SMT setting using OpenMP version of Stream
The Simultaneous Multi Threading (SMT) supported on the ThunderX2 processor has impact on application per- formance. Some benchmarks and applications have been run to explore the impact of this setting. The runs were done using all cores (64, 128 and 256 in the different SMT settings, this might explain the poor scaling for NPB in OpenMP versus good scaling in MPI versions).
Modern Processors
Figure 11. Application performance impact of SMT setting using the stellar atmosphere simulation MPI code Bifrost [130]
From the numbers presented in the figure above it's clear that turning the SMT off yields the best performance. Similar numbers have been reported for other applications like the material science code VASP.
The IOR benchmark suite is used to test file system performance on a variety of different supercomputers. It is easily configurable and uses MPI to coordinate multiple processes each reading or writing to stress a file system.
For this test, a single Thunder X2 node was used, with SMT-1. The version v3.2.1 of the benchmark was compiled with the GCC v9.2 toolchain and OpenMPI v4.0.2. Tests were carried out against a HDD-based LUSTRE file system mounted across an InfiniBand link with the rest of the system idle. Two different tests were configured and run, File Per Process (FPP) - one file per MPI rank and Single shared file (SF) - a single shared MPI-IO file.
Table 8. IOR performance. HDD-based LUSTRE Filesystem. All numbers in MiB/s
Num. Procs Read (FPP) Write (FPP) Read (SF) Write (SF) 1 921 962 1090 966 2 1422 1114 1400 1106 4 2062 1089 2411 1068 8 2729 1027 4173 818 16 3043 777 2959 425 32 4921 803 2923 303 64 3880 645 4482 321
Fulhame, named for Elizabeth Fulhame, a Scottish chemist who invented the concept of catalysis and discovered photoreduction, is a 64-node fully ARM-based cluster cluster housed at EPCC's ACF datacentre. It is part of the
Modern Processors
Catalyst collaboration between HPE, ARM, Cavium, Mellanox and three UK universities: Edinburgh, Bristol and Leicester.
The Fulhame system architecture is identical to the other Catalyst systems. They comprise 64 compute nodes, each of which has two 32-core Cavium ThunderX2 processors. These processors run at a base frequency of 2.1GHz with a boost available of up to 2.5GHz controlled via Dynamic Voltage and Frequency Scaling (DVFS). Each processor has eight 2666MHz DDR4 channels and runs with 128GiB of memory giving 4GiB/core (1GiB/thread in SMT-4) for a total of 256GiB/node. Fulhame uses Mellanox Infiniband for the interconnect between compute nodes, login/admin nodes and the storage system. Fulhame and the other Catalyst systems are fully ARM-based and use Thunder X2 nodes for storage servers and all login/admin nodes.
Account requests can be made to the EPCC Helpdesk: epcc-support@epcc.ed.ac.uk
The production environment on Fulhame is SuSE Linux Enterprise Server, v15.0. SuSE is a partner is the Catalyst project and provides support for this architecture specifically. The other two Catalyst systems run SLES 12sp3.
The programming environment is that of SLES v15.0 with the GNU v9.2 suite of compilers and libraries. ARM supply v20.0 of their performance libraries and compilers, with the libraries providing modules for both GNU and ARM toolchains. Libraries, compilers and software are provided via the normal Linux modules environment. Mellanox provide the relevant performance InfiniBand drivers and libraries directly to the Catalyst sites for this architecture.