














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The digital system design, is very helpful series of lecture slides, which made programming an easy task. The major points in these laboratory assignment are:Taking Cuda to Ludicrous Speed, Righteous Performance, Compiling with Optimizations, Brandon’s Particle Code, Code Modifications, Bandwidth Bound, Thread Occupancy, Instruction Stream, Branch Divergence
Typology: Slides
1 / 22
This page cannot be seen from the preview
Don't miss anything!
1
Performance: How Much Is Enough?
(CPU Edition)
2
Could I be getting better performance?
Probably a little bit. Most of the performance is handled in HW
How much better?
If you compile – O3, you can get faster (maybe 2x)
If you are careful about tiling your memory, you can get faster on codes that benefit from that (maybe 2-3x)
Is that much performance worth the work?
Compiling with optimizations is a no-brainer (and yet…)
Tiling is useful, but takes an investment
4
These threads can get some work done while we wait for memory
Shared Constant/Texture
Unroll loops Minimize branch divergence
5
7
8
using multiple threads to exploit memory-level parallelism
each thread can efficiently multi-pass over any data element
10
global void MatrixMulKernel(float Md, float Nd, float* Pd, int Width)**
{
// Calculate the row index of the Pd element and M int Row = blockIdx.yTILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.xTILE_WIDTH + threadIdx.x;**
float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[RowWidth+k] * Nd[kWidth+Col];**
Pd[RowWidth+Col] = Pvalue;*
}
11
Two memory accesses (8 bytes) per floating point multiply-add 4 B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21. GFLOPS
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
13
Md
Nd
Pd
Pdsub
TILE_WIDTH WIDTH WIDTH
TILE_WIDTH TILE_WIDTH
bx
tx 0 1 2 TILE_WIDTH-
0 1 2
by ty^
21
0
TILE_WIDTH-
2
1
0
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
14
Nd
Pd
Pdsub
TILE_WIDTH WIDTH WIDTH
TILE_WIDTH TILE_WIDTH
bx
tx 0 1 2 TILE_WIDTH-
0 1 2
by ty^
21
0
TILE_WIDTH-
2
1
0
TILE_WIDTH
TILE_WIDTH
TILE_WIDTHE
WIDTH
WIDTH
16
CUDA Code:
Kernel Execution Configuration
17
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
19
SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 22564B = 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively executing
TILE_WIDTH 32 would lead to 23232*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time per SM
The 86.4B/s bandwidth can now support (86.4/4)*16 = 347. GFLOPS!
GFLOPS
0
10
20
30
40
50
60
70
80
90
100
tile do n ly tile d &u n ro lle d
tile do n ly tile d &u n ro lle d
tile do n ly tile d &u n ro lle d
tile do n ly tile d &u n ro lle d no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s
20