Download Superscalars, Computer architecture and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!
CS4/MSc Parallel Architectures - 2017-
Lect. 3: Superscalar Processors
▪ Pipelining: several instructions are simultaneously at different
stages of their execution
▪ Superscalar: several instructions are simultaneously at the same
stages of their execution
▪ Out-of-order execution: instructions can be executed in an order
different from that specified in the program
▪ Dependences between instructions:
- Data Dependence (a.k.a. Read after Write - RAW)
- Control dependence
▪ Speculative execution: tentative execution despite dependences
CS4/MSc Parallel Architectures - 2017-
A 5-stage Pipeline
General
registers
IF ID EXE MEM WB
Memory
Memory
IF = instruction fetch (includes PC increment)
ID = instruction decode + fetching values from general purpose registers
EXE = arithmetic/logic operations or address computation
MEM = memory access or branch completion
WB = write back results to general purpose registers
CS4/MSc Parallel Architectures - 2017-
Multiple-issue Superscalar
▪ Start two instructions per clock cycle
IF I1 I
I1 I
ID
EXE
MEM
WB
I1 I
I1 I
I1 I
I5 I
I
I5 I7 I
I5 I7 I9 I
cycle 1 2 3 4 5 6
instruction
flow
I2 I4 I6 I8 I10 I
I2 I4 I6 I8 I
I2 I4 I6 I
I2 I4 I
I2 I
CPI → 0.5;
IPC → 2
CS4/MSc Parallel Architectures - 2017-
Advanced Superscalar Execution
▪ Ideally: in an n-issue superscalar, n instructions are fetched,
decoded, executed, and committed per cycle
▪ In practice:
- Data, control, and structural hazards spoil issue flow
- Multi-cycle instructions spoil commit flow
▪ Buffers at issue (issue queue) and commit (reorder buffer)
decouple these stages from the rest of the pipeline and regularize
somewhat breaks in the flow
General
registers
ID MEM
Fetch
engine
EXE WB
Memory
Memory
instructions instructions
CS4/MSc Parallel Architectures - 2017-
Problems At Instruction Fetch
▪ Control flow
- e.g., 32 bit instructions and 32 byte instruction cache lines → 8 instructions per
cache line; 4-wide superscalar processor
- Branch prediction is required within the instruction fetch stage
- For wider issue processors multiple predictions are likely required
- In practice most fetch units only fetch up to the first predicted taken branch
Case 1: single not taken
branch
Case 2: single taken
branch outside
fetch range and
into other cache line
CS4/MSc Parallel Architectures - 2017-
Example Frequencies of Control Flow
benchmark taken % avg. BB size
of inst. between taken
branches
eqntott 86.2 4.20 4.
espresso 63.8 4.24 6.
xlisp 64.7 4.34 6.
gcc 67.6 4.65 6.
sc 70.2 4.71 6.
compress 60.9 5.39 8.
Data from Rotenberg et. al. for SPEC 92 Int
▪ One branch about every 4 to 6 instructions
▪ One taken branch about every 5 to 9 instructions
CS4/MSc Parallel Architectures - 2017-
Example Advanced Fetch Unit
Figure from
Rotenberg et. al.
Control flow prediction
units:
i) Branch Target Buffer
ii) Return Address Stack
iii) Branch Predictor
Final alignment unit
2-way interleaved I-cache
Mask to select instructions
from each of the cache lines
CS4/MSc Parallel Architectures - 2017-
Trace Caches
▪ Traditional I-cache: instructions laid out in program order
▪ Dynamic execution order does not always follow program order
(e.g., taken branches) and the dynamic order also changes
▪ Idea:
- Store instructions in execution order (traces)
- Traces can start with any static instruction and are identified by the starting
instruction’s PC
- Traces are dynamically created as instructions are normally fetched and branches
are resolved
- Traces also contain the outcomes of the implicitly predicted branches
- When the same trace is again encountered (i.e., same starting instruction and same
branch predictions) instructions are obtained from trace cache
- Note that multiple traces can be stored with the same starting instruction
Superscalar: Other Challenges
Superscalar decode
– Replicate decoders (ok)
Superscalar issue
– Number of dependence tests increases
quadratically (bad)
▪ Superscalar register read
– Number of register ports increases linearly (bad)
CS4/MSc Parallel Architectures - 2017-
Superscalar: Other Challenges
▪ Superscalar execute
– Replicate functional units (Not bad)
▪ Superscalar bypass/forwarding
– Increases quadratically (bad)
– Clustering mitigates this problem
▪ Superscalar register-writeback
– Increases linearly (bad)
▪ ILP uncovered
– Limited by ILP inherent in program
– Bigger instruction windows
CS4/MSc Parallel Architectures - 2017-
CS4/MSc Parallel Architectures - 2017-
References and Further Reading
▪ Original hardware trace cache:
“Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching”, E.
Rotenberg, S. Bennett, and J. Smith, Intl. Symp. on Microarchitecture, December
▪ Next trace prediction for trace caches:
“Path-Based Next Trace Prediction”, Q. Jacobson, E. Rotenberg, and J. Smith, Intl.
Symp. on Microarchitecture, December 1997.
▪ A Software trace cache:
“Software Trace Cache”, A. Ramirez, J.-L. Larriba-Pey, C. Navarro, J. Torrellas, and M.
Valero, Intl. Conf. on Supercomputing, June 1999.
CS4/MSc Parallel Architectures - 2017-
References and Further Reading
CS4/MSc Parallel Architectures - 2017-
Pros/Cons of Trace Caches
+ Instructions come from a single trace cache line
+ Branches are implicitly predicted
- The instruction that follows the branch is fixed in the trace and implies the branch’s
direction (taken or not taken)
+ I-cache still present, so no need to change cache hierarchy
+ In CISC ISA’s (e.g., x86) the trace cache can keep decoded
instructions (e.g., Pentium 4)
- Wasted storage as instructions appear in both I-cache and trace
cache, and in possibly multiple trace cache lines
- Not very good when there are traces with common sub-paths
- Not very good at handling indirect jumps and returns (which
have multiple targets, instead of only taken/not taken)
CS4/MSc Parallel Architectures - 2017-
Structure of a Trace Cache
Figure from
Rotenberg et. al.