Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Superscalar Processing and Parallelism: Understanding Multiprocessor Systems, Slides of Computer Aided Design (CAD)

An overview of superscalar processing, multiple ways of achieving parallelism in computing systems, and the concepts of instruction-level, data-level, and thread-level parallelism. It also covers static and dynamic scheduling, pipelining, and cache coherence in the context of multiprocessor systems.

Typology: Slides

2012/2013

Uploaded on 04/24/2013

baijayanthi
baijayanthi 🇮🇳

4.5

(13)

171 documents

1 / 44

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Increasing Machine Throughput
Superscalar Processing
Multiprocessor Systems
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c

Partial preview of the text

Download Superscalar Processing and Parallelism: Understanding Multiprocessor Systems and more Slides Computer Aided Design (CAD) in PDF only on Docsity!

Increasing Machine Throughput

Superscalar Processing

Multiprocessor Systems

Multiprocess ing

• There are 3 generic ways to do multiple things “in

parallel”

  • Instruction-level Parallelism (ILP)
    • Superscalar
      • doing multiple instructions (from a single program) simultaneously
  • Data-level Parallelism (DLP)
    • Do a single operation over a larger chunk of data
      • Vector Processing
      • “SIMD Extensions” like MMX
  • Thread-level Parallelism (TLP)
    • Multiple processes
      • Can be separate programs
      • …or a single program broken into separate threads
    • Usually used on multiple process ors , but not required

Dynamic Scheduling

  •  4 reservation stations for  4 separate pipelines
    • Each pipeline may have a different depth

C o m m i t

u n i t

I n s t r u c t i o n f e t c h

a n d d e c o d e u n i t

I n - o r d e r i s s u e

I n - o r d e r c o

m m i t

L o a d /

S t o r e

F l o a t i n g

p o i n t

I n t e g e r I n t e g e r

F u n c t i o n a l …

u n i t s

O u t - o f - o r d e r e x e c u t e

R e s e r v a t i o n

s t a t i o n

R e s e r v a t i o n

s t a t i o n

R e s e r v a t i o n

s t a t i o n

R e s e r v a t i o n

s t a t i o n

COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
Docsity.com

How to Fill Pipeline Slots

  • We’ve got lots of room to execute – now how do we fill

the slots?

  • This process is called Scheduling
    • A schedule is created, telling instructions when they can execute
  • 2 (very different) ways to do this:
    • Static Scheduling
      • Compiler (or coder) arranges instructions into an order which

can be executed correctly

  • Dynamic Scheduling
    • Hardware in the processor reorders instructions at runtime to

maximize the number executing in parallel

Dynamic Pipeline Scheduling

  • Allow the hardware to make scheduling decisions
  • In order issue of instructions
  • Out of order execution of instructions
  • In case of empty resources:
    • The hardware will look ahead in the instruction stream to see if there are

any instructions that are OK to execute

  • As they are fetched, instructions get placed in

reservation stations – where they wait until their

inputs are ready

Dynamic Scheduling

  •  4 reservation stations for  4 separate pipelines
    • Each pipeline may have a different depth

C o m m i t

u n i t

I n s t r u c t i o n f e t c h

a n d d e c o d e u n i t

I n - o r d e r i s s u e

I n - o r d e r c o

m m i t

L o a d /

S t o r e

F l o a t i n g

p o i n t

I n t e g e r I n t e g e r

F u n c t i o n a l …

u n i t s

O u t - o f - o r d e r e x e c u t e

R e s e r v a t i o n

s t a t i o n

R e s e r v a t i o n

s t a t i o n

R e s e r v a t i o n

s t a t i o n

R e s e r v a t i o n

s t a t i o n

COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
Docsity.com

Dynamic Scheduling

Case Study

  • Intel’s Pentium 4
    • First appeared in 2000
  • Possible for 126

instructions to be

“in-flight” at one

time!

  • Processors have gone

“backwards” on this

since 2003

Thread-level Parallelism (TLP)

– If you have multiple threads…

• by having multiple programs running, or

• writing a multithreaded application

– …you can get higher performance by running these

threads:

• On multiple processors, or

• On a machine that has multithreading support

– SMT – (AKA “Hyperthreading”)

• Conceptually these are very similar

– The hardware is very different

The Jigsaw Puzzle Analogy

Serial Computing

Suppose you want to do a jigsaw puzzle

that has, say, a thousand pieces.

We can imagine that it’ll take you a

certain amount of time. Let’s say

that you can put the puzzle together in

an hour.

The More the Merrier?

Now let’s put Bob and Charlie on the

other two sides of the table. Each of

you can work on a part of the puzzle,

but there’ll be a lot more contention

for the shared resource (the pile of

puzzle pieces) and a lot more

communication at the interfaces. So

you will get noticeably less than a

4-to-1 speedup, but you’ll still have

an improvement, maybe something

like 3-to-1: the four of you can get it

done in 20 minutes instead of an hour.

Diminishing Returns

If we now put Dave and Ed and Frank

and George on the corners of the

table, there’s going to be a whole lot

of contention for the shared resource,

and a lot of communication at the

many interfaces. So the speedup

you’ll get will be much less than we’d

like; you’ll be lucky to get 5-to-1.

So we can see that adding more and

more workers onto a shared resource

is eventually going to have a

diminishing return.

More Distributed Processors

It’s a lot easier to add

more processors in

distributed parallelism.

But, you always have to

be aware of the need to

decompose the problem

and to communicate

among the processors.

Also, as you add more

processors, it may be

harder to load balance

the amount of work that

each processor gets.

Load Balancing

Load balancing means ensuring that everyone completes

their workload at roughly the same time.

For example, if the jigsaw puzzle is half grass and half sky,

then you can do the grass and Alice can do the sky, and then

you only have to communicate at the horizon – and the

amount of work that each of you does on your own is

roughly equal. So you’ll get pretty good speedup.