Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Best Practice Guide Modern Processors, Summaries of Architecture

Wilmington University Architecture

This Best Practice Guide (BPG) extends the previously developed series of BPGs [1] (these older guides are still relevant as they provide valuable ...

Typology: Summaries

2021/2022

Uploaded on 09/12/2022

avni 🇺🇸

4.7

(3)

229 documents

1 / 109

This page cannot be seen from the preview

Don't miss anything!

Best Practice Guide

Modern Processors

Ole Widar Saastad, University of Oslo, Norway

Kristina Kapanova, NCSA, Bulgaria

Stoyan Markov, NCSA, Bulgaria

Cristian Morales, BSC, Spain

Anastasiia Shamakina, HLRS, Germany

Nick Johnson, EPCC, United Kingdom

Ezhilmathi Krishnasamy, University of Luxembourg, Luxembourg

Sebastien Varrette, University of Luxembourg, Luxembourg

Hayk Shoukourian (Editor), LRZ, Germany

Updated 5-5-2021

Partial preview of the text

Download Best Practice Guide Modern Processors and more Summaries Architecture in PDF only on Docsity!

Modern Processors

Ole Widar Saastad, University of Oslo, Norway

Kristina Kapanova, NCSA, Bulgaria

Stoyan Markov, NCSA, Bulgaria

Cristian Morales, BSC, Spain

Anastasiia Shamakina, HLRS, Germany

Nick Johnson, EPCC, United Kingdom

Ezhilmathi Krishnasamy, University of Luxembourg, Luxembourg

Sebastien Varrette, University of Luxembourg, Luxembourg

Hayk Shoukourian (Editor), LRZ, Germany

Updated 5-5-

Modern Processors

1. Introduction

This Best Practice Guide (BPG) extends the previously developed series of BPGs [1] (these older guides are still relevant as they provide valuable background information) by providing an update on new technologies and systems for the further support of European High Performance Computing (HPC) user community in achieving a remarkable performance of their large-scale applications. It covers existing systems and aims to provide support for scientists to port, build and run their applications on these systems. While some benchmarking is part of this guide, the results provided are mainly an illustration of the different systems characteristics, and should not be used as guides for the comparison of systems presented nor should be used for system procurement considerations. Procurement [2] and benchmarking [3] are well covered by other PRACE work packages and are out of this BPG's discussion scope.

This BPG document has grown to be a hybrid of field guide and a textbook approach. The system and processor coverage provide some relevant technical information for the users who need a deeper knowledge of the system in order to fully utilise the hardware. While the field guide approach provides hints and starting points for porting and building scientific software. For this, a range of compilers, libraries, debuggers, performance analysis tools, etc. are covered. While recommendation for compilers, libraries and flags are covered we acknowledge that there is no magic bullet as all codes are different. Unfortunately there is often no way around the trial and error approach.

Some in-depth documentation of the covered processors is provided. This includes some background on the inner workings of the processors considered; the number of threads each core can handle; how these threads are imple- mented and how these threads (instruction streams) are scheduled onto different execution units within the core. In addition, this guide describes how the vector units with different lengths (256, 512 or in the case of SVE - variable and generally unknown until execution time) are implemented. As most of HPC work up to now has been done in 64 bit floating point the emphasis is on this data type, specially for vectors. In addition to the processor execut- ing units, memory in its many levels of hierarchy is important. The different implementations of Non-Uniform Memory Access (NUMA) are also covered in this BPG.

The guide gives a description of the hardware for a selection of relevant processors currently deployed in some PRACE HPC systems. It includes ARM64 (Huawei/HiSilicon and Marvell)^1 and x86-64 (AMD and Intel). It pro- vides information on the programming models and development environment as well as information about porting programs. Furthermore it provides sections about strategies on how to analyze and improve the performance of applications. While this guide does not provide an update on all recent processors, some of the previous BPG releases [1] do cover other processor architectures not discussed in this guide (e.g. Power architecture) and should be considered as a staring point for work.

This guide aims also to increase the user awareness on energy and power consumption of individual applications by providing some analysis on usefulness of maximum CPU frequency scaling based on the type of application considered (e.g. CPU-bound, memory-bound, etc.).

As mentioned earlier, this guide covers processors and technologies deployed in European systems (with a small exception to ARM SVE - soon to be deployed in EU). While European ARM and RISC-V are just over the horizon, systems using these processors are not deployed for production at the time of this writing (Q3-2020). As this European technology is still a couple of years from deployment it's not covered in the current guides, however, this new technology will require substantial documentation. Different types of accelerators are covered by other BPGs [1] and a corresponding update for these is currently being developed.

Emphasis has been given to providing relevant tips and hints via examples for scientists not deeply involved into the art of HPC programming. This document aims to provide a set of best practices that will make adaptation to these modern processors easier. It is not intended to replace an in depth textbook, nor replacing the documentation for the different tools described. The hope is that it should be the first document to reach for when starting to build a scientific software.

As for programming languages used in examples these are either C or Fortran. C is nice as it maps nicely to machine instructions for the simple examples and together with Fortran makes up the major languages used in

(^1) While the vector enabled Fujitsu A64FX with SVE processor is not covered there is a short section about Scalable Vector Extension (SVE) included.

Modern Processors

HPC. While knowledge of assembly programming is not as widespread as before the examples are on a level of simplicity that the should be easily accessible.

This current guide is a joined guide from several different guides in the past. The previous guides where each processor had it's own guide is a thing of the past. This guide covers all relevant processors (see above), but each processor chapter is still a separate chapter. The merger is not yet fully completed.

Furthermore, this guide will provide information on the following, recently deployed, European flagship super- computing systems:

Fulhame @ EPCC, UK
MareNostrum @ BSC, Spain
SuperMUC-NG @ LRZ, Germany
Hawk @ HLRS, Germany
Betzy @ SIGMA2, Norway

Modern Processors

Figure 1. Kunpeng 920 block diagram [54]

Huawei reports the core supports almost all the ARMv8.4-A ISA features with a few exceptions, but including dot product and the FP16 FML extension, see [67].

2.1.2. ThunderX

The ThunderX2 is an evolution in the ThunderX family of processors from Cavium, now part of Marvell. The ThunderX2 provide full NEON 128 bits vector support, Simultaneous Multi Thread (SMT) support [75] and up to eight DDR4 memory controllers for high memory bandwidth (see later for measurements).

Figure 2. ThunderX2 block diagram [55]

Modern Processors

Figure 3. ThunderX2 block diagram [56]

From the core diagram above we can establish that there are two 128 bits NEON SIMD execution units and two scalar floating point execution units in addition to integer and load/store units.

2.1.2.1. SMT

Simultaneous Multi Thread [75] is a technique that provide the possibility to run multiple threads (instructions streams) on a single core, also known as Thread Level Parallelism (TLP). The core provide several parallel paths for threads and also a set of context for each thread. The implementation is somewhat more complex, see Figure 3, “ThunderX2 block diagram [56]”.

The practical manifestation is that it looks like there are twice (SMT-2) or four SMT-4) times as many cores in the processor. This can be beneficial for some application where less context switching can improve performance. In HPC the benefit is not always observed as most application are limited by memory bandwidth or the fact that the different threads share the same executional units. It may look like there are four times as many cores, but in reality there are no more compute resources. This is documented in Section 2.6, “Simultaneous Multi Threading (SMT) performance impact”.

2.1.2.2. New release - Thunder X

With the release of Thunder X3 Marvell takes a new generation of their ARM processors to the market. Press coverage [68] tell about a major upgrade in performance. With up to 96 cores with SMT-4 yielding 384 logical cores, four 128 NEON SIMD units (no SVE yet). With expected 3x performance over Thunder X2. This new

Modern Processors

Table 2. Suggested compiler flags for Thunder X

Compiler Suggested flags GNU -O3 -march=armv8-a+simd -mcpu=thunderx2t99 -fomit-frame-pointer ARM HPC compiler -Ofast -march=armv8-a+simd -mcpu=thunderx2t99 -omit-frame-pointer

2.2.1.1. Effect of using SIMD (NEON) vector instructions

The NEON SIMD 128 bit vector execution unit can improve performance significantly. Even short vector units at 128 bit size can handle two 64 bit floating point numbers at the time, in principle double the floating point performance.

The NEON unit is IEEE-754 compliant with a few minor exceptions, (mostly related to rounding and comparing) [70]. NEON should therefore be as safe regarding numerical precision to use the NEON instructions as the scalar floating point instructions. Hence vector instructions can be applied almost universally.

Testing has shown that the performance gain is significant. Tests compiling reference implementation for matrix matrix multiplication, NASA Parallel Benchmark (NPB) [122] BT and a real life scientific stellar atmosphere code (Bifrost, using its own internal performance units)[130] demonstrate the effect :

Table 3. Effect of NEON SIMD

Benchmark Flags Performance Matrix matrix -Ofast -march=armv8.2-a+nosimd 1.88 GFLOPS multiplication -Ofast -march=armv8.2-a+simd 3.18 GFLOPS NPB BT -Ofast -march=armv8.2-a+nosimd -mcpu=tsv110 142929 Mop/s total -Ofast -march=armv8.2-a+simd -mcpu=tsv110 158920 Mop/s total Stellar atmosphere -Ofast -march=armv8-a+nosimd -mcpu=thunderx2t99 13.82 Mz/s (Bifrost) -Ofast -march=armv8-a+simd -mcpu=thunderx2t99 16.10 Mz/s

Speedups ranging from 1.69 to 1.11 for benchmarks are recorded while a speedup of 1.17 was demonstrated for real scientific code. These tests demonstrate the importance of a vector unit, even with a narrow 128 bit unit like the NEON.

With the significant performance gain experienced with NEON there is a general expectation that SVE will pro- vide a substantial floating point performance boost. The experience from the x86-64 architecture leave room for optimism. However, without real benchmarks this is still uncharted territory.

2.2.1.2. Usage of optimisation flags

The optimisation flags O3 and Ofast invoke different optimisations for the GNU and the ARM HPC compiler. The code generation differs between O3 and Ofast. The Ofast can invoke unsafe optimisations that might not yield the same numerical result as O2 or even O3. Please consult the compiler documentation about high level of optimisation. Typical for the Ofast flag is what the GNU manual pages states : "Disregard strict standards compliance." This might be OK or cause problems for some codes. The simple paranoia.c [95] program might be enough to demonstrate the possible problems.

2.2.2. Vendor performance libraries

The ARM HPC Compiler comes with a performance library built for the relevant ARM implementations. It contail the usual routines, see [136] for details.

The syntax when using the ARM HPC compiler is shown in the table below.

Table 4. Compiler flags for ARM performance library using ARM compiler

Library type Flag Serial -armpl

Modern Processors

Library type Flag Serial. Infer the processor from the system -armpl=native Parallel / multithreaded -armpl=parallel

Just adding one of these to the compiling command line will invoke linking of the appropriate performance library.

Scalar math functions library, commonly known as libm, when distributed by the operating system is also available in an optimised version, called libamath.

Before the Arm Compiler links to the libm library, the compiler automatically links to the libamath library to provide enhanced performance by using the optimised functions. This is a default behaviour and does not require that user supplies any specific compiler flags to initiate this behaviour.

The performance of a well known benchmark called Savage [129] which exercise trigonometric functions expe- riences about 15% speedup when using the ARM library libamath compared to the standard libm.

For detailed information please consult the ARM HPC compiler documentation [136].

2.2.3. Scalable Vector Extension (SVE) software support

The SVE [73] will represent a major leap in floating point performance, see [71]. The 256 and 512 bit vector units with other architectures provide a major advantage in the floating point performance. While NEON could perform a limited set of 128 bit vector instructions the introduction of SVE will put the floating point capabilities on par with the major architecture currently deployed for HPC.

At time of writing there is only one ARM processor supporting SVE [62]. However, the compilers and libraries are ready. Both GNU compilers and the ARM HPC compilers support SVE (Cray and Fujitsu should also support SVE but this is not tested in this guide). The code generation has been tested using a range of scientific applications and SVE code generation work well. The exact performance speedup remains to be evaluated due to lack of suitable hardware.

Most of the scientific applications need to be recompiled with SVE enabled. Few problems have ever sprung up when selecting SVE generation. From the compiler and building point of view there should be no major issues. However, without real hardware to run on the performance gain is hard to measure.

Table 5. Flags to enable SVE code generation

Compiler SVE enabling lags GNU -march=armv8.1-a+sve ARM HPC compiler -march=armv8.1-a+sve

The SVE flag apply to all armv8-a and later versions. As the vector length in the current hardware is unknown the GNU flag "-msve-vector-bits" is set to "scalable" as default (-msve-vector-bits=scalable). This flag can be set to match the current hardware but this is not recommended, as the hardware vector length can vary and can be set at boot time. Best practice is to leave the vector length unknown and build vector length agnostic code.

2.2.3.1. Examples of SVE code generation

Consider a commonly known loop (Stream benchmark):

DO 60 j = 1,n a(j) = b(j) + scalar*c(j) 60 CONTINUE

The compiler will generate NEON instructions if "+simd" is used, look at the Floating-point fused multiply-add to accumulator instruction (flma) and fused multiply-add vectors (fmad).

Modern Processors

Sum: a(i) = b(i) + c(i) Triad: a(i) = b(i) + q * c(i)

The following table shows the obtained memory bandwidth measured using the OpenMP version of the STREAM benchmark. The table shows the highest bandwidth obtained using different settings found for processor bindings. The bar chart illustrated in Figure 4 shows the need for thread/rank core bindings. It's vital to get this right in order to get a decent performance.

The STREAM benchmark was built using gcc with OpenMP support and compiled with a moderate level of optimisation, -O2.

Table 6. Stream performance, all numbers in MB/s

Processor Binding Copy Scale Add Triad Kunpeng 920 PROC BIND 322201 322214 324189 323997 Kunpeng 920 GOMP_CPU_AFFINITY=0-127 275666 275142 282769 280507 Kunpeng 920 Numactl -l 237508 238412 232506 232840

Figure 4. Stream memory bandwidth Kunpeng 920, OpenMP, 128 cores, 224GiB footprint (87% of installed RAM)

The table and figures show that thread binding is important, the actual settings of the binding is less important. Just setting the environment variable OMP_PROC_BINDING to "1" or "true" is making all the difference.

The memory bandwidth performance numbers are impressive compared to the majority of currently available systems. For applications that are memory bound this relatively high memory bandwidth will yield a performance benefit. From the benchmark results published by the vendors it is also evident that the ARM systems have memory bandwidth advantage.

2.3.2. STREAM - memory bandwidth benchmark - Thunder X

Here, we show STREAM performance again for the Thunder X2 processor, but using the OpenMPI version rather than the OpenMP version. Using the OpenMPI version allows scaling and benchmarking across nodes, although in this case we are strictly focused on the single node case.

Modern Processors

Figure 5. Stream memory bandwidth on ThunderX2 with OpenMPI, Size N=400000000, including SMT settings impact

The above performance plot shows MPI STREAM for 3 different SMT configurations – SMT-1 (up to 64 process- es), SMT-2 (up to 128 processes, 2 hardware threads per core) and SMT-4 (up to 256 processes, 4 hardware threads per core). The array size that was chosen for this test is 400 million elements, which satisfies the rule that the memory requirements for the test should be greater than 3 times the LLC size. For this test, STREAM was compiled with GCC v9.2 and OpenMPI v4.0.2, using the compilation flags:

-O2 -mcmodel=large -fno-PIC

The results shown are the maximum performance achieved over 5 runs. As can be seen from the performance numbers, the highest bandwidth is achieved when using 64 MPI processes, regardless of the SMT configuration. The performance drops by almost 10MB/s when using 256 MPI processes in SMT-4 mode; given that the SMT- configuration is using all the available hardware threads, this is still very good performance.

2.3.3. High Performance Linpack

The table below shows the results from running High Performance Linpack (HPL) on Fulhame. The tests were run with different SMT values but with the processor frequency set at 2.5GHz, i.e. it's highest value. To achieve this, memory turbo was disabled to allow the frequency to boost above the nominal 2.2GHz. Compilation was performed with GCC 9.2 and OpenMPI 4.0.2 linking to the ARM performance libraries.

Table 7. HPL performance numbers for Fulhame. All numbers in GFLOPS

SMT | Ranks 64 128 256 Off 742 N/A N/A 2 741 626 N/A 4 636 636 548

As might be reasonably expected, SMT-1 gives the best performance results. All hardware cores are fully engaged and there is no contention on their resources from hyper-threads. For comparison, with the frequency fixed at 1.0GHz (the lowest available on the ThunderX2), the performance of the SMT-4 with 256 processors case is 184GFLOPS.

Modern Processors

Figure 7. HPCG (MPI) performance using four Kunpeng 920 nodes with RoCE, 100GbE, ARM performance libraries

The ThunderX2 results are obtained from a single ThunderX2 node. The number of cores indicate number of apparent cores e.g. including the logical SMT cores. At 256 cores SMT-4 is used.

Figure 8. HPCG (MPI) performance using a single ThunderX

Modern Processors

2.6. Simultaneous Multi Threading (SMT) performance

impact

Memory bandwidth is an important parameter of a system, and the impact of threaded memory bandwidth with respect to the Simultaneous Multi Threading (SMT) should be documented.

Figure 9. Memory bandwidth impact of SMT setting using OpenMP version of Stream

The Simultaneous Multi Threading (SMT) supported on the ThunderX2 processor has impact on application per- formance. Some benchmarks and applications have been run to explore the impact of this setting. The runs were done using all cores (64, 128 and 256 in the different SMT settings, this might explain the poor scaling for NPB in OpenMP versus good scaling in MPI versions).

Modern Processors

Figure 11. Application performance impact of SMT setting using the stellar atmosphere simulation MPI code Bifrost [130]

From the numbers presented in the figure above it's clear that turning the SMT off yields the best performance. Similar numbers have been reported for other applications like the material science code VASP.

2.7. IOR

The IOR benchmark suite is used to test file system performance on a variety of different supercomputers. It is easily configurable and uses MPI to coordinate multiple processes each reading or writing to stress a file system.

For this test, a single Thunder X2 node was used, with SMT-1. The version v3.2.1 of the benchmark was compiled with the GCC v9.2 toolchain and OpenMPI v4.0.2. Tests were carried out against a HDD-based LUSTRE file system mounted across an InfiniBand link with the rest of the system idle. Two different tests were configured and run, File Per Process (FPP) - one file per MPI rank and Single shared file (SF) - a single shared MPI-IO file.

Table 8. IOR performance. HDD-based LUSTRE Filesystem. All numbers in MiB/s

Num. Procs Read (FPP) Write (FPP) Read (SF) Write (SF) 1 921 962 1090 966 2 1422 1114 1400 1106 4 2062 1089 2411 1068 8 2729 1027 4173 818 16 3043 777 2959 425 32 4921 803 2923 303 64 3880 645 4482 321

2.8. European ARM processor based systems

2.8.1. Fulhame (EPCC)

Fulhame, named for Elizabeth Fulhame, a Scottish chemist who invented the concept of catalysis and discovered photoreduction, is a 64-node fully ARM-based cluster cluster housed at EPCC's ACF datacentre. It is part of the

Modern Processors

Catalyst collaboration between HPE, ARM, Cavium, Mellanox and three UK universities: Edinburgh, Bristol and Leicester.

2.8.1.1. System Architecture / configuration

The Fulhame system architecture is identical to the other Catalyst systems. They comprise 64 compute nodes, each of which has two 32-core Cavium ThunderX2 processors. These processors run at a base frequency of 2.1GHz with a boost available of up to 2.5GHz controlled via Dynamic Voltage and Frequency Scaling (DVFS). Each processor has eight 2666MHz DDR4 channels and runs with 128GiB of memory giving 4GiB/core (1GiB/thread in SMT-4) for a total of 256GiB/node. Fulhame uses Mellanox Infiniband for the interconnect between compute nodes, login/admin nodes and the storage system. Fulhame and the other Catalyst systems are fully ARM-based and use Thunder X2 nodes for storage servers and all login/admin nodes.

2.8.1.2. System Access

Account requests can be made to the EPCC Helpdesk: epcc-support@epcc.ed.ac.uk

2.8.1.3. Production Environment

The production environment on Fulhame is SuSE Linux Enterprise Server, v15.0. SuSE is a partner is the Catalyst project and provides support for this architecture specifically. The other two Catalyst systems run SLES 12sp3.

2.8.1.4. Programming Environment

The programming environment is that of SLES v15.0 with the GNU v9.2 suite of compilers and libraries. ARM supply v20.0 of their performance libraries and compilers, with the libraries providing modules for both GNU and ARM toolchains. Libraries, compilers and software are provided via the normal Linux modules environment. Mellanox provide the relevant performance InfiniBand drivers and libraries directly to the Catalyst sites for this architecture.

Best Practice Guide Modern Processors, Summaries of Architecture

Related documents

Partial preview of the text

Download Best Practice Guide Modern Processors and more Summaries Architecture in PDF only on Docsity!

Modern Processors

Ole Widar Saastad, University of Oslo, Norway

Kristina Kapanova, NCSA, Bulgaria

Stoyan Markov, NCSA, Bulgaria

Cristian Morales, BSC, Spain

Anastasiia Shamakina, HLRS, Germany

Nick Johnson, EPCC, United Kingdom

Ezhilmathi Krishnasamy, University of Luxembourg, Luxembourg

Sebastien Varrette, University of Luxembourg, Luxembourg

Hayk Shoukourian (Editor), LRZ, Germany

Modern Processors

1. Introduction

2.1.2. ThunderX

2.1.2.1. SMT

2.1.2.2. New release - Thunder X

2.2.1.1. Effect of using SIMD (NEON) vector instructions

2.2.1.2. Usage of optimisation flags

2.2.2. Vendor performance libraries

2.2.3. Scalable Vector Extension (SVE) software support

2.2.3.1. Examples of SVE code generation

2.3.2. STREAM - memory bandwidth benchmark - Thunder X

2.3.3. High Performance Linpack

2.6. Simultaneous Multi Threading (SMT) performance

impact

2.7. IOR

2.8. European ARM processor based systems

2.8.1. Fulhame (EPCC)

2.8.1.1. System Architecture / configuration

2.8.1.2. System Access

2.8.1.3. Production Environment

2.8.1.4. Programming Environment