Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

CS203 Advanced Computer Architecture, Lecture notes of Computer Science

University of California-Riverside Computer Science

CS203 Advanced Computer Architecture

Typology: Lecture notes

2023/2024

Uploaded on 05/23/2025

fancycode 🇺🇸

7 documents

1 / 56

This page cannot be seen from the preview

Don't miss anything!

Memory Hierarchy: Basics

Hung-Wei Tseng

Partial preview of the text

Download CS203 Advanced Computer Architecture and more Lecture notes Computer Science in PDF only on Docsity!

Memory Hierarchy: Basics

Hung-Wei Tseng

Recap: von Neumann architecture

Processor

Memory

Storage

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data int main(){ printf(“Hello, world!\n”); }

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data Instruction Fetch Arithmetic Logical Units (ALU) Complex Arithmetic Operations (Mul/div) Branch/ Jump Memory Operations Instruction Decode Program Counter Registers

4883ec

sub $0x8,%rsp

0x8 0x

0x

0x10640x

By loading different programs into memory, your computer can perform different functions

https://en.wikipedia.org/wiki/Pareto_principle^5 Top 10% own 67% of the wealth in the U.S. 80% of users use only 20% of features

You only need to know 2% English words to understand 90% of conversations

Modern DRAM performance

8 https://www.anandtech.com/show/16143/insights-into-ddr5-subtimings-and-latencies

SDRAM DDR

Data Rate MT/s Bandwidth GB/s

CAS

(clk) Latency (ns) Year Data Rate MT/s Bandwidth GB/s

CAS

(clk) Latency (ns) Year 100 0.80 3 24.00 1992 400 3.20 5 25.00 1998 133 1.07 3 22.50 667 5.33 5 15. 800 6.40 6 15. DDR 2 DDR 3 400 3.20 5 25.00 2003 800 6.40 6 15.00 2007 667 5.33 5 15.00 1066 8.53 8 15. 800 6.40 6 15.00 1333 10.67 9 13. 1600 12.80 11 13. 1866 14.93 13 13. 2133 17.07 14 13. DDR 4 DDR 5 1600 12.80 11 13.75 2014 3200 25.60 22 13.75 2020 1866 14.93 13 13.92 3600 28.80 26 14. 2133 17.07 15 14.06 4000 32.00 28 14. 2400 19.20 17 14.17 4400 35.20 32 14. 2666 21.33 19 14.25 4800 38.40 34 14. 2933 23.46 21 14.32 5200 41.60 38 14. 3200 25.20 22 13.75 5600 44.80 40 14. 6000 48.00 42 14. 6400 51.20 46 14.

The “latency” gap between CPU and DRAM Ratio 0 20 40 60 80 Latency (ns) 0 5 10 15 20 25 30 1992 1998 2003 2007 2014 2020 CPU DRAM DRAM/CPU Ratio CPU Model i486 Pentium II Pentium 4 Core 2 Core i7-4790K Core i5-10600K DRAM Standa

SDRAM DDR DDR2 DDR3 DDR4 DDR

(^) Assume that we have a processor running @ 4 GHz and a program with 20% of load/store instructions. If the instruction has no memory access and the processor already fetches the instruction, the CPI is just 1. Now, consider we have DDR5. The program is well-optimized so precharge is never necessary — the memory access latency is 13.75 ns. What’s the average CPI (pick the closest one)? A. 9 B. 12 C. 15 D. 56 E. 67

The impact of “slow” memory

The memory-wall problem

Processor

Memory

Storage

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data int main(){ printf(“Hello, world!\n”); }

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

Instructions 1f

6c6c6f2c

20776f

6c

Data Instruction Fetch Arithmetic Logical Units (ALU) Complex Arithmetic Operations (Mul/div) Branch/ Jump Memory Operations Instruction Decode Program Counter Registers

4883ec

sub $0x8,%rsp

0x8 0x

0x

Fetching instruction is 50x slower than other CPU operations! Even worse when your instruction needs to access data — another 50+ cycles

20% is under-estimating …

(^) Definition of “Speedup of Y over X” or say Y is n times faster than X:
(^) Amdahl’s Law —
- (^) Corollary 1 — each optimization has an upper bound - (^) Corollary 2 — make the common case (the most time consuming case) fast!
- (^) Corollary 3: Optimization has a moving target
- (^) Corollary 4: Exploiting more parallelism from a program is the key to performance gain in modern architectures
- (^) Corollary 5: Single-core performance still matters
- (^) Corollary 6: Don’t hurt the non-common case too much speedup Y_over_X = n = Execution TimeX Execution TimeY Speedup enhanced ( f, s) = 1 ( 1 − f ) + f s 18

Recap: Speedup and Amdahl’s Law?

Speedupmax( f 1 , ∞) = 1 ( 1 − f 1 ) Speedupmax( f 2 , ∞) = 1 ( 1 − f 2 ) Speedupmax( f 3 , ∞) = 1 ( 1 − f 3 ) Speedupmax( f 4 , ∞) = 1 ( 1 − f 4 ) Speedup max ( f, ∞) = 1 ( 1 − f ) Speedup parallel ( f parallelizable , ∞) = 1 ( 1 − fparallelizable) Speedup parallel ( f parallelizable , ∞) = 1 ( 1 − fparallelizable) Speedup enhanced ( f, s, r) = 1 ( 1 − f ) + perf(r) + f s

Alternatives?

Fast, but expensive $$$

ProcessorProcessor

Memory Hierarchy

DRAM

Storage

SRAM $

Processor Core Registers larger fastest < 1ns tens of ns ens of us 32 or 64 words a few ns KBs ~ MBs GBs TBs

(^) Assume that we have a processor running @ 4 GHz and a program with 20% of load/store instructions. If the instruction has no memory access, the CPI is just 1. Now, in addition to we DDR5, whose latency 13.75 ns, we also got an SRAM cache with latency of just at 0.5 ns and can capture 90% of the desired data/instructions. what’s the average CPI (pick the closest one)? A. 6 B. 8 C. 10 D. 12 E. 67 How can “memory hierarchy” help in performance? CPU cycle time = 1 4 × 10

= 0.25ns Each $ access =

= 2 cycles Each DRAM access =

= 55 cycles CPI average = 1 + 100 % × [ 2 + ( 1 − 90 %) × 55 ] + 20 % × [ 2 + ( 1 − 90 %) × 55 ] = 10 cycles

CPU

L1 $

DRAM

1 − 90 %

CS203 Advanced Computer Architecture, Lecture notes of Computer Science

Related documents

Partial preview of the text

Download CS203 Advanced Computer Architecture and more Lecture notes Computer Science in PDF only on Docsity!

Memory Hierarchy: Basics

Hung-Wei Tseng

Recap: von Neumann architecture

Processor

Memory

Storage

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

6c6c6f2c

20776f

6c

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

6c6c6f2c

20776f

6c

4883ec

sub $0x8,%rsp

0x8 0x

0x

0x10640x

Modern DRAM performance

SDRAM DDR

CAS

CAS

SDRAM DDR DDR2 DDR3 DDR4 DDR

The impact of “slow” memory

The memory-wall problem

Processor

Memory

Storage

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

6c6c6f2c

20776f

6c

f30f1efa

4883ec

488d3d

0f0000e

dcffffff

31c

c408c30f

6c6c6f2c

20776f

6c

4883ec

sub $0x8,%rsp

0x8 0x

0x

0x

20% is under-estimating …

Recap: Speedup and Amdahl’s Law?

Alternatives?

Fast, but expensive $$$

Memory Hierarchy

DRAM

Storage

SRAM $

CPU

L1 $