Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CS203 Advanced Computer Architecture, Lecture notes of Computer Science

CS203 Advanced Computer Architecture

Typology: Lecture notes

2023/2024

Uploaded on 05/23/2025

fancycode
fancycode 🇺🇸

7 documents

1 / 56

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Memory Hierarchy: Basics
Hung-Wei Tseng
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38

Partial preview of the text

Download CS203 Advanced Computer Architecture and more Lecture notes Computer Science in PDF only on Docsity!

Memory Hierarchy: Basics

Hung-Wei Tseng

Recap: von Neumann architecture

Processor

Memory

Storage

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data int main(){ printf(“Hello, world!\n”); }

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data Instruction Fetch Arithmetic Logical Units (ALU) Complex Arithmetic Operations (Mul/div) Branch/ Jump Memory Operations Instruction Decode Program Counter Registers

4883ec

sub $0x8,%rsp

0x8 0x

0x

0x10640x

By loading different programs into memory, your computer can perform different functions

https://en.wikipedia.org/wiki/Pareto_principle^5 Top 10% own 67% of the wealth in the U.S. 80% of users use only 20% of features

You only need to know 2% English words to understand 90% of conversations

Modern DRAM performance

8 https://www.anandtech.com/show/16143/insights-into-ddr5-subtimings-and-latencies

SDRAM DDR

Data Rate MT/s Bandwidth GB/s

CAS

(clk) Latency (ns) Year Data Rate MT/s Bandwidth GB/s

CAS

(clk) Latency (ns) Year 100 0.80 3 24.00 1992 400 3.20 5 25.00 1998 133 1.07 3 22.50 667 5.33 5 15. 800 6.40 6 15. DDR 2 DDR 3 400 3.20 5 25.00 2003 800 6.40 6 15.00 2007 667 5.33 5 15.00 1066 8.53 8 15. 800 6.40 6 15.00 1333 10.67 9 13. 1600 12.80 11 13. 1866 14.93 13 13. 2133 17.07 14 13. DDR 4 DDR 5 1600 12.80 11 13.75 2014 3200 25.60 22 13.75 2020 1866 14.93 13 13.92 3600 28.80 26 14. 2133 17.07 15 14.06 4000 32.00 28 14. 2400 19.20 17 14.17 4400 35.20 32 14. 2666 21.33 19 14.25 4800 38.40 34 14. 2933 23.46 21 14.32 5200 41.60 38 14. 3200 25.20 22 13.75 5600 44.80 40 14. 6000 48.00 42 14. 6400 51.20 46 14.

The “latency” gap between CPU and DRAM Ratio 0 20 40 60 80 Latency (ns) 0 5 10 15 20 25 30 1992 1998 2003 2007 2014 2020 CPU DRAM DRAM/CPU Ratio CPU Model i486 Pentium II Pentium 4 Core 2 Core i7-4790K Core i5-10600K DRAM Standa

SDRAM DDR DDR2 DDR3 DDR4 DDR
  • (^) Assume that we have a processor running @ 4 GHz and a program with 20% of load/store instructions. If the instruction has no memory access and the processor already fetches the instruction, the CPI is just 1. Now, consider we have DDR5. The program is well-optimized so precharge is never necessary — the memory access latency is 13.75 ns. What’s the average CPI (pick the closest one)? A. 9 B. 12 C. 15 D. 56 E. 67

The impact of “slow” memory

The memory-wall problem

Processor

Memory

Storage

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data int main(){ printf(“Hello, world!\n”); }

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data Instruction Fetch Arithmetic Logical Units (ALU) Complex Arithmetic Operations (Mul/div) Branch/ Jump Memory Operations Instruction Decode Program Counter Registers

4883ec

sub $0x8,%rsp

0x8 0x

0x

0x

Fetching instruction is 50x slower than other CPU operations! Even worse when your instruction needs to access data — another 50+ cycles

20% is under-estimating …

  • (^) Definition of “Speedup of Y over X” or say Y is n times faster than X:
  • (^) Amdahl’s Law —
    • (^) Corollary 1 — each optimization has an upper bound - (^) Corollary 2 — make the common case (the most time consuming case) fast!
    • (^) Corollary 3: Optimization has a moving target
    • (^) Corollary 4: Exploiting more parallelism from a program is the key to performance gain in modern architectures
    • (^) Corollary 5: Single-core performance still matters
    • (^) Corollary 6: Don’t hurt the non-common case too much speedup Y_over_X = n = Execution TimeX Execution TimeY Speedup enhanced ( f, s) = 1 ( 1 − f ) + f s 18

Recap: Speedup and Amdahl’s Law?

Speedupmax( f 1 , ∞) = 1 ( 1 − f 1 ) Speedupmax( f 2 , ∞) = 1 ( 1 − f 2 ) Speedupmax( f 3 , ∞) = 1 ( 1 − f 3 ) Speedupmax( f 4 , ∞) = 1 ( 1 − f 4 ) Speedup max ( f, ∞) = 1 ( 1 − f ) Speedup parallel ( f parallelizable , ∞) = 1 ( 1 − fparallelizable) Speedup parallel ( f parallelizable , ∞) = 1 ( 1 − fparallelizable) Speedup enhanced ( f, s, r) = 1 ( 1 − f ) + perf(r) + f s

Alternatives?

Fast, but expensive $$$

ProcessorProcessor

Memory Hierarchy

21

DRAM

Storage

SRAM $

Processor Core Registers larger fastest < 1ns tens of ns ens of us 32 or 64 words a few ns KBs ~ MBs GBs TBs

  • (^) Assume that we have a processor running @ 4 GHz and a program with 20% of load/store instructions. If the instruction has no memory access, the CPI is just 1. Now, in addition to we DDR5, whose latency 13.75 ns, we also got an SRAM cache with latency of just at 0.5 ns and can capture 90% of the desired data/instructions. what’s the average CPI (pick the closest one)? A. 6 B. 8 C. 10 D. 12 E. 67 How can “memory hierarchy” help in performance? CPU cycle time = 1 4 × 10

= 0.25ns Each $ access =

= 2 cycles Each DRAM access =

= 55 cycles CPI average = 1 + 100 % × [ 2 + ( 1 − 90 %) × 55 ] + 20 % × [ 2 + ( 1 − 90 %) × 55 ] = 10 cycles

CPU
L1 $
DRAM

1 − 90 %

L1? L2? L3?