Download Superscalar Processing and Parallelism: Understanding Multiprocessor Systems and more Slides Computer Aided Design (CAD) in PDF only on Docsity!
Increasing Machine Throughput
Superscalar Processing
Multiprocessor Systems
Multiprocess ing
• There are 3 generic ways to do multiple things “in
parallel”
- Instruction-level Parallelism (ILP)
- Superscalar
- doing multiple instructions (from a single program) simultaneously
- Data-level Parallelism (DLP)
- Do a single operation over a larger chunk of data
- Vector Processing
- “SIMD Extensions” like MMX
- Thread-level Parallelism (TLP)
- Multiple processes
- Can be separate programs
- …or a single program broken into separate threads
- Usually used on multiple process ors , but not required
Dynamic Scheduling
- 4 reservation stations for 4 separate pipelines
- Each pipeline may have a different depth
C o m m i t
u n i t
I n s t r u c t i o n f e t c h
a n d d e c o d e u n i t
I n - o r d e r i s s u e
I n - o r d e r c o
m m i t
L o a d /
S t o r e
F l o a t i n g
p o i n t
I n t e g e r I n t e g e r
F u n c t i o n a l …
u n i t s
O u t - o f - o r d e r e x e c u t e
R e s e r v a t i o n
s t a t i o n
R e s e r v a t i o n
s t a t i o n
R e s e r v a t i o n
s t a t i o n
R e s e r v a t i o n
s t a t i o n
COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
Docsity.com
How to Fill Pipeline Slots
- We’ve got lots of room to execute – now how do we fill
the slots?
- This process is called Scheduling
- A schedule is created, telling instructions when they can execute
- 2 (very different) ways to do this:
- Static Scheduling
- Compiler (or coder) arranges instructions into an order which
can be executed correctly
- Dynamic Scheduling
- Hardware in the processor reorders instructions at runtime to
maximize the number executing in parallel
Dynamic Pipeline Scheduling
- Allow the hardware to make scheduling decisions
- In order issue of instructions
- Out of order execution of instructions
- In case of empty resources:
- The hardware will look ahead in the instruction stream to see if there are
any instructions that are OK to execute
- As they are fetched, instructions get placed in
reservation stations – where they wait until their
inputs are ready
Dynamic Scheduling
- 4 reservation stations for 4 separate pipelines
- Each pipeline may have a different depth
C o m m i t
u n i t
I n s t r u c t i o n f e t c h
a n d d e c o d e u n i t
I n - o r d e r i s s u e
I n - o r d e r c o
m m i t
L o a d /
S t o r e
F l o a t i n g
p o i n t
I n t e g e r I n t e g e r
F u n c t i o n a l …
u n i t s
O u t - o f - o r d e r e x e c u t e
R e s e r v a t i o n
s t a t i o n
R e s e r v a t i o n
s t a t i o n
R e s e r v a t i o n
s t a t i o n
R e s e r v a t i o n
s t a t i o n
COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
Docsity.com
Dynamic Scheduling
Case Study
- Intel’s Pentium 4
- Possible for 126
instructions to be
“in-flight” at one
time!
“backwards” on this
since 2003
Thread-level Parallelism (TLP)
– If you have multiple threads…
• by having multiple programs running, or
• writing a multithreaded application
– …you can get higher performance by running these
threads:
• On multiple processors, or
• On a machine that has multithreading support
– SMT – (AKA “Hyperthreading”)
• Conceptually these are very similar
– The hardware is very different
The Jigsaw Puzzle Analogy
Serial Computing
Suppose you want to do a jigsaw puzzle
that has, say, a thousand pieces.
We can imagine that it’ll take you a
certain amount of time. Let’s say
that you can put the puzzle together in
an hour.
The More the Merrier?
Now let’s put Bob and Charlie on the
other two sides of the table. Each of
you can work on a part of the puzzle,
but there’ll be a lot more contention
for the shared resource (the pile of
puzzle pieces) and a lot more
communication at the interfaces. So
you will get noticeably less than a
4-to-1 speedup, but you’ll still have
an improvement, maybe something
like 3-to-1: the four of you can get it
done in 20 minutes instead of an hour.
Diminishing Returns
If we now put Dave and Ed and Frank
and George on the corners of the
table, there’s going to be a whole lot
of contention for the shared resource,
and a lot of communication at the
many interfaces. So the speedup
you’ll get will be much less than we’d
like; you’ll be lucky to get 5-to-1.
So we can see that adding more and
more workers onto a shared resource
is eventually going to have a
diminishing return.
More Distributed Processors
It’s a lot easier to add
more processors in
distributed parallelism.
But, you always have to
be aware of the need to
decompose the problem
and to communicate
among the processors.
Also, as you add more
processors, it may be
harder to load balance
the amount of work that
each processor gets.
Load Balancing
Load balancing means ensuring that everyone completes
their workload at roughly the same time.
For example, if the jigsaw puzzle is half grass and half sky,
then you can do the grass and Alice can do the sky, and then
you only have to communicate at the horizon – and the
amount of work that each of you does on your own is
roughly equal. So you’ll get pretty good speedup.