Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Superscalars, Computer architecture, Lecture notes of Advanced Computer Architecture

Maharishi Dayanand University Advanced Computer Architecture

Prof. Tillu Tiwari

Superscalars architecture, Computer architecture

Typology: Lecture notes

2020/2021

Uploaded on 02/04/2021

kabir-sethi 🇮🇳

1 document

1 / 24

This page cannot be seen from the preview

Don't miss anything!

CS4/MSc Parallel Architectures - 2017-2018

Lect. 3: Superscalar Processors

▪Pipelining: several instructions are simultaneously at different

stages of their execution

▪Superscalar: several instructions are simultaneously at the same

stages of their execution

▪Out-of-order execution: instructions can be executed in an order

different from that specified in the program

▪Dependences between instructions:

– Data Dependence (a.k.a. Read after Write - RAW)

– Control dependence

▪Speculative execution: tentative execution despite dependences

Partial preview of the text

Download Superscalars, Computer architecture and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!

CS4/MSc Parallel Architectures - 2017-

Lect. 3: Superscalar Processors

▪ Pipelining: several instructions are simultaneously at different

stages of their execution

▪ Superscalar: several instructions are simultaneously at the same

stages of their execution

▪ Out-of-order execution: instructions can be executed in an order

different from that specified in the program

▪ Dependences between instructions:

Data Dependence (a.k.a. Read after Write - RAW)
Control dependence

▪ Speculative execution: tentative execution despite dependences

CS4/MSc Parallel Architectures - 2017-

A 5-stage Pipeline

General

registers

IF ID EXE MEM WB

Memory

IF = instruction fetch (includes PC increment)

ID = instruction decode + fetching values from general purpose registers

EXE = arithmetic/logic operations or address computation

MEM = memory access or branch completion

WB = write back results to general purpose registers

CS4/MSc Parallel Architectures - 2017-

Multiple-issue Superscalar

▪ Start two instructions per clock cycle

IF I1 I

I1 I

ID

EXE

MEM

WB

I1 I

I5 I

I

I5 I7 I

I5 I7 I9 I

cycle 1 2 3 4 5 6

instruction

flow

I2 I4 I6 I8 I10 I

I2 I4 I6 I8 I

I2 I4 I6 I

I2 I4 I

I2 I

CPI → 0.5;

IPC → 2

CS4/MSc Parallel Architectures - 2017-

Advanced Superscalar Execution

▪ Ideally: in an n-issue superscalar, n instructions are fetched,

decoded, executed, and committed per cycle

▪ In practice:

Data, control, and structural hazards spoil issue flow
Multi-cycle instructions spoil commit flow

▪ Buffers at issue (issue queue) and commit (reorder buffer)

decouple these stages from the rest of the pipeline and regularize

somewhat breaks in the flow

General

registers

ID MEM

Fetch

engine

EXE WB

Memory

instructions instructions

CS4/MSc Parallel Architectures - 2017-

Problems At Instruction Fetch

▪ Control flow

e.g., 32 bit instructions and 32 byte instruction cache lines → 8 instructions per

cache line; 4-wide superscalar processor

Branch prediction is required within the instruction fetch stage
For wider issue processors multiple predictions are likely required
In practice most fetch units only fetch up to the first predicted taken branch

Case 1: single not taken

branch

Case 2: single taken

branch outside

fetch range and

into other cache line

CS4/MSc Parallel Architectures - 2017-

Example Frequencies of Control Flow

benchmark taken % avg. BB size

of inst. between taken

branches

eqntott 86.2 4.20 4.

espresso 63.8 4.24 6.

xlisp 64.7 4.34 6.

gcc 67.6 4.65 6.

sc 70.2 4.71 6.

compress 60.9 5.39 8.

Data from Rotenberg et. al. for SPEC 92 Int

▪ One branch about every 4 to 6 instructions

▪ One taken branch about every 5 to 9 instructions

CS4/MSc Parallel Architectures - 2017-

Example Advanced Fetch Unit

Figure from

Rotenberg et. al.

Control flow prediction

units:

i) Branch Target Buffer

ii) Return Address Stack

iii) Branch Predictor

Final alignment unit

2-way interleaved I-cache

Mask to select instructions

from each of the cache lines

CS4/MSc Parallel Architectures - 2017-

Trace Caches

▪ Traditional I-cache: instructions laid out in program order

▪ Dynamic execution order does not always follow program order

(e.g., taken branches) and the dynamic order also changes

▪ Idea:

Store instructions in execution order (traces)
Traces can start with any static instruction and are identified by the starting

instruction’s PC

Traces are dynamically created as instructions are normally fetched and branches

are resolved

Traces also contain the outcomes of the implicitly predicted branches
When the same trace is again encountered (i.e., same starting instruction and same

branch predictions) instructions are obtained from trace cache

Note that multiple traces can be stored with the same starting instruction

Superscalar: Other Challenges

Superscalar decode

– Replicate decoders (ok)

Superscalar issue

– Number of dependence tests increases

quadratically (bad)

▪ Superscalar register read

– Number of register ports increases linearly (bad)

CS4/MSc Parallel Architectures - 2017-

Superscalar: Other Challenges

▪ Superscalar execute

– Replicate functional units (Not bad)

▪ Superscalar bypass/forwarding

– Increases quadratically (bad)

– Clustering mitigates this problem

▪ Superscalar register-writeback

– Increases linearly (bad)

▪ ILP uncovered

– Limited by ILP inherent in program

– Bigger instruction windows

CS4/MSc Parallel Architectures - 2017-

References and Further Reading

▪ Original hardware trace cache:

“Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching”, E.

Rotenberg, S. Bennett, and J. Smith, Intl. Symp. on Microarchitecture, December

▪ Next trace prediction for trace caches:

“Path-Based Next Trace Prediction”, Q. Jacobson, E. Rotenberg, and J. Smith, Intl.

Symp. on Microarchitecture, December 1997.

▪ A Software trace cache:

“Software Trace Cache”, A. Ramirez, J.-L. Larriba-Pey, C. Navarro, J. Torrellas, and M.

Valero, Intl. Conf. on Supercomputing, June 1999.

CS4/MSc Parallel Architectures - 2017-

References and Further Reading

CS4/MSc Parallel Architectures - 2017-

Pros/Cons of Trace Caches

+ Instructions come from a single trace cache line

+ Branches are implicitly predicted

The instruction that follows the branch is fixed in the trace and implies the branch’s

direction (taken or not taken)

+ I-cache still present, so no need to change cache hierarchy

+ In CISC ISA’s (e.g., x86) the trace cache can keep decoded

instructions (e.g., Pentium 4)

Wasted storage as instructions appear in both I-cache and trace

cache, and in possibly multiple trace cache lines

Not very good when there are traces with common sub-paths
Not very good at handling indirect jumps and returns (which

have multiple targets, instead of only taken/not taken)

CS4/MSc Parallel Architectures - 2017-

Structure of a Trace Cache

Figure from

Rotenberg et. al.

Superscalars, Computer architecture, Lecture notes of Advanced Computer Architecture

Related documents

Partial preview of the text

Download Superscalars, Computer architecture and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!

Lect. 3: Superscalar Processors

▪ Pipelining: several instructions are simultaneously at different

stages of their execution

▪ Superscalar: several instructions are simultaneously at the same

stages of their execution

▪ Out-of-order execution: instructions can be executed in an order

different from that specified in the program

▪ Dependences between instructions:

▪ Speculative execution: tentative execution despite dependences

A 5-stage Pipeline

IF ID EXE MEM WB

Multiple-issue Superscalar

▪ Start two instructions per clock cycle

IF I1 I

I1 I

ID

EXE

MEM

WB

I1 I

I1 I

I1 I

I5 I

I

I5 I7 I

I5 I7 I9 I

I2 I4 I6 I8 I10 I

I2 I4 I6 I8 I

I2 I4 I6 I

I2 I4 I

I2 I

CPI → 0.5;

IPC → 2

Advanced Superscalar Execution

▪ Ideally: in an n-issue superscalar, n instructions are fetched,

decoded, executed, and committed per cycle

▪ In practice:

▪ Buffers at issue (issue queue) and commit (reorder buffer)

decouple these stages from the rest of the pipeline and regularize

somewhat breaks in the flow

ID MEM

EXE WB

Problems At Instruction Fetch

▪ Control flow

Example Frequencies of Control Flow

of inst. between taken

▪ One branch about every 4 to 6 instructions

▪ One taken branch about every 5 to 9 instructions

Example Advanced Fetch Unit

Trace Caches

▪ Traditional I-cache: instructions laid out in program order

▪ Dynamic execution order does not always follow program order

(e.g., taken branches) and the dynamic order also changes

▪ Idea:

Superscalar: Other Challenges

Superscalar decode

– Replicate decoders (ok)

Superscalar issue

– Number of dependence tests increases

quadratically (bad)

▪ Superscalar register read

– Number of register ports increases linearly (bad)

Superscalar: Other Challenges

▪ Superscalar execute

– Replicate functional units (Not bad)

▪ Superscalar bypass/forwarding

– Increases quadratically (bad)

– Clustering mitigates this problem

▪ Superscalar register-writeback

– Increases linearly (bad)

▪ ILP uncovered

– Limited by ILP inherent in program

– Bigger instruction windows

References and Further Reading

▪ Original hardware trace cache:

▪ Next trace prediction for trace caches: