Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Superscalars, Computer architecture, Lecture notes of Advanced Computer Architecture

Superscalars architecture, Computer architecture

Typology: Lecture notes

2020/2021

Uploaded on 02/04/2021

kabir-sethi
kabir-sethi 🇮🇳

1 document

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS4/MSc Parallel Architectures - 2017-2018
Lect. 3: Superscalar Processors
Pipelining: several instructions are simultaneously at different
stages of their execution
Superscalar: several instructions are simultaneously at the same
stages of their execution
Out-of-order execution: instructions can be executed in an order
different from that specified in the program
Dependences between instructions:
Data Dependence (a.k.a. Read after Write - RAW)
Control dependence
Speculative execution: tentative execution despite dependences
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Superscalars, Computer architecture and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!

CS4/MSc Parallel Architectures - 2017-

Lect. 3: Superscalar Processors

▪ Pipelining: several instructions are simultaneously at different

stages of their execution

▪ Superscalar: several instructions are simultaneously at the same

stages of their execution

▪ Out-of-order execution: instructions can be executed in an order

different from that specified in the program

▪ Dependences between instructions:

  • Data Dependence (a.k.a. Read after Write - RAW)
  • Control dependence

▪ Speculative execution: tentative execution despite dependences

CS4/MSc Parallel Architectures - 2017-

A 5-stage Pipeline

General

registers

IF ID EXE MEM WB

Memory

Memory

IF = instruction fetch (includes PC increment)

ID = instruction decode + fetching values from general purpose registers

EXE = arithmetic/logic operations or address computation

MEM = memory access or branch completion

WB = write back results to general purpose registers

CS4/MSc Parallel Architectures - 2017-

Multiple-issue Superscalar

▪ Start two instructions per clock cycle

IF I1 I

I1 I

ID

EXE

MEM

WB

I1 I

I1 I

I1 I

I5 I

I

I5 I7 I

I5 I7 I9 I

cycle 1 2 3 4 5 6

instruction

flow

I2 I4 I6 I8 I10 I

I2 I4 I6 I8 I

I2 I4 I6 I

I2 I4 I

I2 I

CPI → 0.5;

IPC → 2

CS4/MSc Parallel Architectures - 2017-

Advanced Superscalar Execution

▪ Ideally: in an n-issue superscalar, n instructions are fetched,

decoded, executed, and committed per cycle

▪ In practice:

  • Data, control, and structural hazards spoil issue flow
  • Multi-cycle instructions spoil commit flow

▪ Buffers at issue (issue queue) and commit (reorder buffer)

decouple these stages from the rest of the pipeline and regularize

somewhat breaks in the flow

General

registers

ID MEM

Fetch

engine

EXE WB

Memory

Memory

instructions instructions

CS4/MSc Parallel Architectures - 2017-

Problems At Instruction Fetch

▪ Control flow

  • e.g., 32 bit instructions and 32 byte instruction cache lines → 8 instructions per

cache line; 4-wide superscalar processor

  • Branch prediction is required within the instruction fetch stage
  • For wider issue processors multiple predictions are likely required
  • In practice most fetch units only fetch up to the first predicted taken branch

Case 1: single not taken

branch

Case 2: single taken

branch outside

fetch range and

into other cache line

CS4/MSc Parallel Architectures - 2017-

Example Frequencies of Control Flow

benchmark taken % avg. BB size

of inst. between taken

branches

eqntott 86.2 4.20 4.

espresso 63.8 4.24 6.

xlisp 64.7 4.34 6.

gcc 67.6 4.65 6.

sc 70.2 4.71 6.

compress 60.9 5.39 8.

Data from Rotenberg et. al. for SPEC 92 Int

▪ One branch about every 4 to 6 instructions

▪ One taken branch about every 5 to 9 instructions

CS4/MSc Parallel Architectures - 2017-

Example Advanced Fetch Unit

Figure from

Rotenberg et. al.

Control flow prediction

units:

i) Branch Target Buffer

ii) Return Address Stack

iii) Branch Predictor

Final alignment unit

2-way interleaved I-cache

Mask to select instructions

from each of the cache lines

CS4/MSc Parallel Architectures - 2017-

Trace Caches

▪ Traditional I-cache: instructions laid out in program order

▪ Dynamic execution order does not always follow program order

(e.g., taken branches) and the dynamic order also changes

▪ Idea:

  • Store instructions in execution order (traces)
  • Traces can start with any static instruction and are identified by the starting

instruction’s PC

  • Traces are dynamically created as instructions are normally fetched and branches

are resolved

  • Traces also contain the outcomes of the implicitly predicted branches
  • When the same trace is again encountered (i.e., same starting instruction and same

branch predictions) instructions are obtained from trace cache

  • Note that multiple traces can be stored with the same starting instruction

Superscalar: Other Challenges

Superscalar decode

– Replicate decoders (ok)

Superscalar issue

– Number of dependence tests increases

quadratically (bad)

▪ Superscalar register read

– Number of register ports increases linearly (bad)

CS4/MSc Parallel Architectures - 2017-

Superscalar: Other Challenges

▪ Superscalar execute

– Replicate functional units (Not bad)

▪ Superscalar bypass/forwarding

– Increases quadratically (bad)

– Clustering mitigates this problem

▪ Superscalar register-writeback

– Increases linearly (bad)

▪ ILP uncovered

– Limited by ILP inherent in program

– Bigger instruction windows

CS4/MSc Parallel Architectures - 2017-

CS4/MSc Parallel Architectures - 2017-

References and Further Reading

▪ Original hardware trace cache:

“Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching”, E.

Rotenberg, S. Bennett, and J. Smith, Intl. Symp. on Microarchitecture, December

▪ Next trace prediction for trace caches:

“Path-Based Next Trace Prediction”, Q. Jacobson, E. Rotenberg, and J. Smith, Intl.

Symp. on Microarchitecture, December 1997.

▪ A Software trace cache:

“Software Trace Cache”, A. Ramirez, J.-L. Larriba-Pey, C. Navarro, J. Torrellas, and M.

Valero, Intl. Conf. on Supercomputing, June 1999.

CS4/MSc Parallel Architectures - 2017-

References and Further Reading

CS4/MSc Parallel Architectures - 2017-

Pros/Cons of Trace Caches

+ Instructions come from a single trace cache line

+ Branches are implicitly predicted

  • The instruction that follows the branch is fixed in the trace and implies the branch’s

direction (taken or not taken)

+ I-cache still present, so no need to change cache hierarchy

+ In CISC ISA’s (e.g., x86) the trace cache can keep decoded

instructions (e.g., Pentium 4)

  • Wasted storage as instructions appear in both I-cache and trace

cache, and in possibly multiple trace cache lines

  • Not very good when there are traces with common sub-paths
  • Not very good at handling indirect jumps and returns (which

have multiple targets, instead of only taken/not taken)

CS4/MSc Parallel Architectures - 2017-

Structure of a Trace Cache

Figure from

Rotenberg et. al.