Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Taking CUDA to Ludicrous Speed - Computer Organization and Design - Lecture Slides, Slides of Computer Aided Design (CAD)

Baba Ghulam Shah Badhshah University Computer Aided Design (CAD)

The digital system design, is very helpful series of lecture slides, which made programming an easy task. The major points in these laboratory assignment are:Taking Cuda to Ludicrous Speed, Righteous Performance, Compiling with Optimizations, Brandon’s Particle Code, Code Modifications, Bandwidth Bound, Thread Occupancy, Instruction Stream, Branch Divergence

Typology: Slides

2012/2013

Uploaded on 04/24/2013

baijayanthi 🇮🇳

4.5

(13)

171 documents

1 / 22

This page cannot be seen from the preview

Don't miss anything!

Taking CUDA to Ludicrous Speed

Getting Righteous Performance from your GPU

Docsity.com

Partial preview of the text

Download Taking CUDA to Ludicrous Speed - Computer Organization and Design - Lecture Slides and more Slides Computer Aided Design (CAD) in PDF only on Docsity!

Taking CUDA to Ludicrous Speed

Getting Righteous Performance from your GPU

Performance: How Much Is Enough?

(CPU Edition)

 Could I be getting better performance?

 Probably a little bit. Most of the performance is handled in HW

 How much better?

 If you compile – O3, you can get faster (maybe 2x)

 If you are careful about tiling your memory, you can get faster on codes that benefit from that (maybe 2-3x)

 Is that much performance worth the work?

 Compiling with optimizations is a no-brainer (and yet…)

 Tiling is useful, but takes an investment

What’s Limiting My Code?

 Am I bandwidth bound? (How do I tell?)

 Make sure I have high thread occupancy to tolerate latencies (lots of

threads)

 These threads can get some work done while we wait for memory

 Move re-used values to closer memories

 Shared  Constant/Texture

 Am I not bandwidth bound – what is now my limit?

 Take a closer look at the instruction stream

 Unroll loops  Minimize branch divergence

CUDA Memories

Locality Matters!

CUDA Variable Type Qualifiers

 device is optional when used with local,

shared, or constant

 Automatic variables without any qualifier reside in a

register

 Except arrays that reside in local memory

Variable declaration Memory Scope Lifetime

device local int LocalVar; local thread thread

device shared int SharedVar; shared block block

device int GlobalVar; global grid application

device constant int ConstantVar; constant grid application

A Common Programming Strategy

 Global memory resides in device memory (DRAM)

 much slower access than shared memory (200x!)

 …but also much larger

 So, a profitable way of performing computation on the

device is to tile data to take advantage of fast shared

memory:

 Partition data into subsets that fit into shared memory

 Each block will then:

 Load its subset from global memory to shared memory

 using multiple threads to exploit memory-level parallelism

 Perform the computation on the subset from shared memory

 each thread can efficiently multi-pass over any data element

 Copy results from shared memory back to global memory

Review

global void MatrixMulKernel(float Md, float Nd, float* Pd, int Width)**

{

// Calculate the row index of the Pd element and M int Row = blockIdx.yTILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.xTILE_WIDTH + threadIdx.x;**

float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[RowWidth+k] * Nd[kWidth+Col];**

Pd[RowWidth+Col] = Pvalue;*

}

How about performance on G80?

 All threads access global

memory for their input matrix

elements

 Two memory accesses (8 bytes) per floating point multiply-add  4 B/s of memory bandwidth/FLOPS  4*346.5 = 1386 GB/s required to achieve peak FLOP rating  86.4 GB/s limits the code at 21. GFLOPS

 The actual code runs at about

15 GFLOPS

 Need to drastically cut down

memory accesses to get closer

to the peak 346.5 GFLOPS

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Tiled Multiply

 Break up the execution of the kernel into

phases so that the data accesses in each

phase is focused on one subset (tile) of

Md and Nd

Pdsub

TILE_WIDTH WIDTH WIDTH

TILE_WIDTH TILE_WIDTH

tx 0 1 2 TILE_WIDTH-

0 1 2

by ty^

TILE_WIDTH-

TILE_WIDTH

TILE_WIDTHE

WIDTH

Tiled Multiply

 Two Step process

1. Threads load all M and N values in the tile

into shared memory

2. Compute all the multiply-adds within that

tile and add them to the sum

 Note: must enforce barrier between

steps 1 and 2! Md

Pdsub

TILE_WIDTH WIDTH WIDTH

TILE_WIDTH TILE_WIDTH

tx 0 1 2 TILE_WIDTH-

0 1 2

by ty^

TILE_WIDTH-

TILE_WIDTH

TILE_WIDTHE

WIDTH

First-order Size Considerations in G

 Each thread block should have many threads

 TILE_WIDTH of 16 gives 16*16 = 256 threads

 There should be many thread blocks

 A 10241024 Pd gives 6464 = 4096 Thread Blocks

 Each thread block perform 2*256 = 512 float loads from

global memory for 256 * (2*16) = 8,192 mul/add

operations.

 Compute to memory ratio is now 16:1 !!

 Memory bandwidth no longer a limiting factor

CUDA Code:

Kernel Execution Configuration

// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);

dim3 dimGrid(Width / TILE_WIDTH,

Width / TILE_WIDTH);

G80 Shared Memory and Threading

 Each SM in G80 has 16KB shared memory

 SM size is implementation dependent!  For TILE_WIDTH = 16, each thread block uses 22564B = 2KB of shared memory.  Can potentially have up to 8 Thread Blocks actively executing

 This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads

per block)

 TILE_WIDTH 32 would lead to 23232*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time per SM

 Using 16x16 tiling, we reduce the accesses to the global

memory by a factor of 16

 The 86.4B/s bandwidth can now support (86.4/4)*16 = 347. GFLOPS!

Tiling Size Effects

GFLOPS

100

tile do n ly tile d &u n ro lle d

tile do n ly tile d &u n ro lle d no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s

Taking CUDA to Ludicrous Speed - Computer Organization and Design - Lecture Slides, Slides of Computer Aided Design (CAD)

Related documents

Partial preview of the text

Download Taking CUDA to Ludicrous Speed - Computer Organization and Design - Lecture Slides and more Slides Computer Aided Design (CAD) in PDF only on Docsity!

Taking CUDA to Ludicrous Speed

Getting Righteous Performance from your GPU

What’s Limiting My Code?

 Am I bandwidth bound? (How do I tell?)

 Make sure I have high thread occupancy to tolerate latencies (lots of

threads)

 Move re-used values to closer memories

 Am I not bandwidth bound – what is now my limit?

 Take a closer look at the instruction stream

CUDA Memories

Locality Matters!

CUDA Variable Type Qualifiers

 device is optional when used with local,

shared, or constant

 Automatic variables without any qualifier reside in a

register

 Except arrays that reside in local memory

Variable declaration Memory Scope Lifetime

device local int LocalVar; local thread thread

device shared int SharedVar; shared block block

device int GlobalVar; global grid application

device constant int ConstantVar; constant grid application

A Common Programming Strategy

 Global memory resides in device memory (DRAM)

 much slower access than shared memory (200x!)

 …but also much larger

 So, a profitable way of performing computation on the

device is to tile data to take advantage of fast shared

memory:

 Partition data into subsets that fit into shared memory

 Each block will then:

 Load its subset from global memory to shared memory

 Perform the computation on the subset from shared memory

 Copy results from shared memory back to global memory

Review

How about performance on G80?

 All threads access global

memory for their input matrix

elements

 The actual code runs at about

15 GFLOPS

 Need to drastically cut down

memory accesses to get closer

to the peak 346.5 GFLOPS

Tiled Multiply

 Break up the execution of the kernel into

phases so that the data accesses in each

phase is focused on one subset (tile) of

Md and Nd

Tiled Multiply

 Two Step process

1. Threads load all M and N values in the tile

into shared memory

2. Compute all the multiply-adds within that

tile and add them to the sum

 Note: must enforce barrier between

steps 1 and 2! Md

First-order Size Considerations in G

 Each thread block should have many threads

 TILE_WIDTH of 16 gives 16*16 = 256 threads

 There should be many thread blocks

 A 10241024 Pd gives 6464 = 4096 Thread Blocks

 Each thread block perform 2*256 = 512 float loads from

global memory for 256 * (2*16) = 8,192 mul/add

operations.

 Compute to memory ratio is now 16:1 !!

 Memory bandwidth no longer a limiting factor

G80 Shared Memory and Threading

 Each SM in G80 has 16KB shared memory

 This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads

per block)

 Using 16x16 tiling, we reduce the accesses to the global

memory by a factor of 16

Tiling Size Effects