Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Taking CUDA to Ludicrous Speed - Computer Organization and Design - Lecture Slides, Slides of Computer Aided Design (CAD)

The digital system design, is very helpful series of lecture slides, which made programming an easy task. The major points in these laboratory assignment are:Taking Cuda to Ludicrous Speed, Righteous Performance, Compiling with Optimizations, Brandon’s Particle Code, Code Modifications, Bandwidth Bound, Thread Occupancy, Instruction Stream, Branch Divergence

Typology: Slides

2012/2013

Uploaded on 04/24/2013

baijayanthi
baijayanthi 🇮🇳

4.5

(13)

171 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Taking CUDA to Ludicrous Speed
Getting Righteous Performance from your GPU
1
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Taking CUDA to Ludicrous Speed - Computer Organization and Design - Lecture Slides and more Slides Computer Aided Design (CAD) in PDF only on Docsity!

Taking CUDA to Ludicrous Speed

Getting Righteous Performance from your GPU

1

Performance: How Much Is Enough?

(CPU Edition)

2

 Could I be getting better performance?

 Probably a little bit. Most of the performance is handled in HW

 How much better?

 If you compile – O3, you can get faster (maybe 2x)

 If you are careful about tiling your memory, you can get faster on codes that benefit from that (maybe 2-3x)

 Is that much performance worth the work?

 Compiling with optimizations is a no-brainer (and yet…)

 Tiling is useful, but takes an investment

What’s Limiting My Code?

4

 Am I bandwidth bound? (How do I tell?)

 Make sure I have high thread occupancy to tolerate latencies (lots of

threads)

 These threads can get some work done while we wait for memory

 Move re-used values to closer memories

 Shared  Constant/Texture

 Am I not bandwidth bound – what is now my limit?

 Take a closer look at the instruction stream

 Unroll loops  Minimize branch divergence

CUDA Memories

Locality Matters!

5

CUDA Variable Type Qualifiers

7

device is optional when used with local,

shared, or constant

 Automatic variables without any qualifier reside in a

register

 Except arrays that reside in local memory

Variable declaration Memory Scope Lifetime

device local int LocalVar; local thread thread

device shared int SharedVar; shared block block

device int GlobalVar; global grid application

device constant int ConstantVar; constant grid application

A Common Programming Strategy

8

 Global memory resides in device memory (DRAM)

 much slower access than shared memory (200x!)

 …but also much larger

 So, a profitable way of performing computation on the

device is to tile data to take advantage of fast shared

memory:

 Partition data into subsets that fit into shared memory

 Each block will then:

 Load its subset from global memory to shared memory

 using multiple threads to exploit memory-level parallelism

 Perform the computation on the subset from shared memory

 each thread can efficiently multi-pass over any data element

 Copy results from shared memory back to global memory

Review

10

global void MatrixMulKernel(float Md, float Nd, float* Pd, int Width)**

{

// Calculate the row index of the Pd element and M int Row = blockIdx.yTILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.xTILE_WIDTH + threadIdx.x;**

float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[RowWidth+k] * Nd[kWidth+Col];**

Pd[RowWidth+Col] = Pvalue;*

}

How about performance on G80?

11

 All threads access global

memory for their input matrix

elements

 Two memory accesses (8 bytes) per floating point multiply-add  4 B/s of memory bandwidth/FLOPS  4*346.5 = 1386 GB/s required to achieve peak FLOP rating  86.4 GB/s limits the code at 21. GFLOPS

 The actual code runs at about

15 GFLOPS

 Need to drastically cut down

memory accesses to get closer

to the peak 346.5 GFLOPS

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Tiled Multiply

13

 Break up the execution of the kernel into

phases so that the data accesses in each

phase is focused on one subset (tile) of

Md and Nd

Md

Nd

Pd

Pdsub

TILE_WIDTH WIDTH WIDTH

TILE_WIDTH TILE_WIDTH

bx

tx 0 1 2 TILE_WIDTH-

0 1 2

by ty^

21

0

TILE_WIDTH-

2

1

0

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

Tiled Multiply

14

 Two Step process

1. Threads load all M and N values in the tile

into shared memory

2. Compute all the multiply-adds within that

tile and add them to the sum

 Note: must enforce barrier between

steps 1 and 2! Md

Nd

Pd

Pdsub

TILE_WIDTH WIDTH WIDTH

TILE_WIDTH TILE_WIDTH

bx

tx 0 1 2 TILE_WIDTH-

0 1 2

by ty^

21

0

TILE_WIDTH-

2

1

0

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

First-order Size Considerations in G

16

 Each thread block should have many threads

 TILE_WIDTH of 16 gives 16*16 = 256 threads

 There should be many thread blocks

 A 10241024 Pd gives 6464 = 4096 Thread Blocks

 Each thread block perform 2*256 = 512 float loads from

global memory for 256 * (2*16) = 8,192 mul/add

operations.

 Compute to memory ratio is now 16:1 !!

 Memory bandwidth no longer a limiting factor

CUDA Code:

Kernel Execution Configuration

17

// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);

dim3 dimGrid(Width / TILE_WIDTH,

Width / TILE_WIDTH);

G80 Shared Memory and Threading

19

 Each SM in G80 has 16KB shared memory

 SM size is implementation dependent!  For TILE_WIDTH = 16, each thread block uses 22564B = 2KB of shared memory.  Can potentially have up to 8 Thread Blocks actively executing

 This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads

per block)

 TILE_WIDTH 32 would lead to 23232*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time per SM

 Using 16x16 tiling, we reduce the accesses to the global

memory by a factor of 16

 The 86.4B/s bandwidth can now support (86.4/4)*16 = 347. GFLOPS!

Tiling Size Effects

GFLOPS

0

10

20

30

40

50

60

70

80

90

100

tile do n ly tile d &u n ro lle d

tile do n ly tile d &u n ro lle d

tile do n ly tile d &u n ro lle d

tile do n ly tile d &u n ro lle d no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s

20