Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CS 258 Parallel Computer Architecture Lecture 1 Introduction ..., Lecture notes of Computer Architecture and Organization

A parallel computer is a collection of processing elements that cooperate to solve large problems. • Some broad issues:.

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

captainamerica
captainamerica 🇺🇸

4.4

(13)

250 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 258
Parallel Computer Architecture
Lecture 1
Introduction to Parallel Architecture
January 23, 2008
Prof John D. Kubiatowicz
Lec 1.21/23/08 Kubiatowicz CS258 ©UCB Spring 2008
Computer Architecture Is …
the attributes of a [computing] system as seen
by the programmer, i.e., the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARE
SOFTWARE
Lec 1.31/23/08 Kubiatowicz CS258 ©UCB Spring 2008
The Instruction Set: a Critical Interface
instruction set
software
hardware
Properties of a good abstraction
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels
Changes very slowly! (Although this is increasing)
Is there a solid interface for multiprocessors?
No standard hardware interface
Lec 1.41/23/08 Kubiatowicz CS258 ©UCB Spring 2008
What is Parallel Architecture?
A
parallel computer
is a collection of processing
elements that cooperate to solve large problems
Some broad issues:
Models of computation: PRAM? BSP? Sequential Consistency?
Resource Allocation:
»how large a collection?
»how powerful are the elements?
»how much memory?
Data access, Communication and Synchronization
»how do the elements cooperate and communicate?
»how are data transmitted between processors?
»what are the abstractions and primitives for cooperation?
Performance and Scalability
»how does it all translate into performance?
»how does it scale?
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download CS 258 Parallel Computer Architecture Lecture 1 Introduction ... and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!

CS 258

Parallel Computer Architecture

Lecture 1

Introduction to Parallel Architecture

January 23, 2008

Prof John D. Kubiatowicz

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Computer Architecture Is …

the attributes of a [computing] system as seenby the programmer, i.e., the conceptualstructure and functional behavior, as distinctfrom the organization of the data flows andcontrols the logic design, and the physicalimplementation.

Amdahl, Blaaw, and Brooks,

SOFTWARESOFTWARE

Lec 1.

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

The Instruction Set: a Critical Interface

instruction set

software hardware

Properties of a good abstraction

Lasts through many generations (portability)

Used in many different ways (generality)

Provides convenient

functionality to higher levels

Permits an efficient implementation at lower levels

Changes very slowly! (Although this is increasing)

Is there a solid interface for multiprocessors?

No standard hardware interface

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

What is Parallel Architecture?

A

parallel computer is a collection of processing

elements that cooperate to solve large problems

Some broad issues:

Models of computation: PRAM? BSP? Sequential Consistency?

Resource Allocation:

how large a collection?

how powerful are the elements?

how much memory?

Data access, Communication and Synchronization

how do the elements

cooperate and communicate?

how are data transmitted between processors?

what are the abstractions and primitives for cooperation?

Performance and Scalability

how does it all translate into performance?

how does it scale?

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Topologies of Parallel Machines

Symmetric Multiprocessor

Multiple processors in box with sharedmemory communication

Current MultiCore chips like this

Every processor runs copy of OS

Non-uniform shared-memory withseparate I/O through host

Multiple processors

Each with local memory

general scalable network

Extremely light “OS” on node providessimple services

Scheduling/synchronization

Network-accessible host for I/O

Cluster

Many independent machine connected withgeneral network

Communication through messages

P

P

P

P

Bus

Memory

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

P/M

Host

Network

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Conventional Wisdom (CW) inComputer Architecture

Old CW: Power is free, but transistors expensive

New CW is the “

Power wall”:

Power is expensive, but transistors are “free”

Can put more transistors on a chip than have thepower to turn on

Old CW: Only concern is dynamic power

New CW: For desktops and servers, static power dueto leakage is 40% of total power

Old CW: Monolithic uniprocessors are reliableinternally, with errors occurring only at pins

New CW: As chips drop below 65 nm feature sizes,they will have high soft and hard error rates

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Conventional Wisdom (CW)in Computer Architecture

Old CW: By building upon prior successes, continueraising level of abstraction and size of HW designs

New CW: Wire delay, noise, cross coupling, reliability,clock jitter, design validation, …stretch development time and cost of large designs at ≤

65 nm

Old CW: Researchers demonstrate newarchitectures by building chips

New CW: Cost of 65 nm masks, cost of ECAD,and design time for GHz clocks ⇒

Researchers no longer build believable chips

Old CW: Performance improves latency & bandwidth

New CW: BW improves > (latency improvement)

2

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Conventional Wisdom (CW)in Computer Architecture

Old CW: Multiplies slow, but loads and stores fast

New CW is the “

Memory wall”:

Loads and stores are slow, but multiplies fast

200 clocks to DRAM, but even FP multiplies only 4 clocks

Old CW: We can reveal more ILP via compilersand architecture innovation

Branch prediction, OOO execution, speculation, VLIW, …

New CW is the “

ILP wall”:

Diminishing returns on finding more ILP

Old CW: 2X CPU Performance every 18 months

New CW is

Power Wall + Memory Wall + ILP Wall =

Brick Wall

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Déjà vu all over again?

Multiprocessors imminent in 1970s, ‘80s, ‘90s, …
“… today’s processors … are nearing an impasse as technologiesapproach the speed of light..”
David Mitchell,

The Transputer: The Time Is Now (1989)

Transputer was premature ⇒

Custom multiprocessors strove to lead uniprocessors
Procrastination rewarded: 2X seq. perf. / 1.5 years
“We are dedicating all of our future product development to
multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2004)

Difference is all microprocessor companies switch tomultiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs) ⇒

Procrastination penalized: 2X sequential perf. / 5 yrs
Biggest programming challenge: 1 to 2 CPUs

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

CS258: Information

Instructor: Prof John D. Kubiatowicz

Office: 673

Soda Hall

Phone: 643-

Email: kubitron@cs.berkeley.edu

Office Hours:

Wed 1:00 - 2:00 or by appt.

Class: Mon, Wed 2:30-4:00pm

310 Soda Hall

Web page: http://www.cs/~kubitron/courses/cs258-S08/

Lectures available online <Noon day of lecture

Email: cs258@kubi.cs.berkeley.eduClip signup link on web page (as soon as it is up)

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Computer Architecture Topics (252+)

Instruction Set ArchitecturePipelining, Hazard Resolution,Superscalar, Reordering,Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID
VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and InstructionLevel Parallelism

Network

Communication

Other Processors

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Computer Architecture Topics (258)

M

Interconnection Network

S
P M P M P M P

Shared Memory,Message Passing,Data ParallelismTransactional MemoryCheckpoint/RestartNetwork InterfacesTopologies,Routing,Bandwidth,Latency,Reliability

Processor-Memory-Switch

MultiprocessorsNetworks and InterconnectionsProgramming Models/Communications StylesReliability/Fault ToleranceEverything in previous slide but more so!

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

What will you get out of CS258?

In-depth understanding of the design and engineeringof modern parallel computers

technology forces

Programming models

fundamental architectural issues

naming, replication, communication, synchronization

basic design techniques

cache coherence, protocols, networks, pipelining, …

methods of evaluation

from moderate to very large scale

across the hardware/software boundary

Study of REAL parallel processors

Research papers, white papers

Natural consequences??

Massive Parallelism

Reconfigurable computing?

Message Passing Machines

NOW

Peer-to-peer systems?

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Will it be worthwhile?

Absolutely!

Now, more than ever, industry trying to figure out how to buildthese new multicore chips….

The fundamental issues and solutions translateacross a wide spectrum of systems.

Crisp solutions in the context of parallel machines.

Pioneered at the thin-end of the platform pyramidon the most-demanding applications

migrate downward with time

Understand implicationsfor software

Network attachedstorage, MEMs, etc?

SuperServers

Departmenatal Servers

Workstations

Personal Computers

Workstations

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Role of a computer architect:

To design and engineer the various levels of a computersystem to maximize
performance and
programmability within
limits of
technology and
cost.

Parallelism: •

Provides alternative to faster clock for performance

Applies at all levels of system design

Is a fascinating perspective from which to viewarchitecture

Is increasingly central in information processing

How is instruction-level parallelism related to course-grainedparallelism??

Why Study Parallel Architecture?

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Is Parallel Computing Inevitable?

This was certainly not clear just a few years agoToday, however:

YES!

Industry is desperate for solutions!

Application demands: Our insatiable need forcomputing cycles

Technology Trends: Easier to build

Architecture Trends: Better abstractions

Current trends:

Today’s microprocessors are multiprocessors and/or havemultiprocessor support

Servers and workstations becoming MP: Sun, SGI, DEC,COMPAQ!...

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

TextBook: Two leaders in fieldText: Parallel Computer Architecture:

A Hardware/Software Approach,

By: David Culler & Jaswinder SinghCovers a range of topicsWe will not necessarily coverthem in order.

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

How will grading work?

No TA This Term!Rough Breakdown: •

20% Paper Summaries/Presentations

30% One Midterm

40% Research Project (work in pairs)

meet 3 times with me to see progress

give oral presentation

give poster session

written report like conference paper

6 weeks work full time for 2 people

Opportunity to do “research in the small” to help make transitionfrom good student to research colleague

10% Class Participation

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Application Trends

Application demand for performance fuels advances inhardware, which enables new appl’ns, which...

Cycle drives exponential increase in microprocessor performance

Drives parallel architecture harder

most demanding applications

Programmers willing to work really hard to improvehigh-end applications

Need incremental scalability:

Need range of system performance with progressively increasing cost

New Applications

More Performance

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Metrics of Performance

Compiler

Programming

Language Application

Datapath

Control

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second:MFLOP/s

Megabytes per secondCycles per second (clock rate) Answers per monthOperations per second

And What about: Programmability, Reliability, Energy?

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Speedup

Speedup (p processors) =

Common mistake:

Compare parallel program on 1 processor to parallel programon p processors

Wrong!:

Should compare uniprocessor program on 1 processor toparallel program on p processors

Why? Keeps you honest

It is easy to parallelize overhead.

Time (1 processor)Time (p processors)

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E

Performance w/

E

Speedup(E)

ExTime w/

E

Performance w/o E

Suppose that enhancement E accelerates a

fraction F of the task by a factor S, andthe remainder of the task is unaffected

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Amdahl’s Law for parallel programs?

(

)

par

overhead

par

parallel

ExTime

stuff

p,

ExTime

p

Fraction

Fraction

Speedup

Best you could ever hope to do:

(

)

par

maximum

Fraction

Speedup

stuff

p,

ExTime

p

Fraction

Fraction

ExTime

ExTime

overhead

par

par

ser

par

×

Worse: Overhead may kill your performance!

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Where is Parallel Arch Going?

Application Software

System

Software

SIMD

Message Passing

Shared Memory

Dataflow

SystolicArrays

Architecture

  • Uncertainty of direction paralyzed parallel software development!

Old view: Divergent architectures, no predictable pattern of growth.

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Granularity: •

Is communication fine or coarse grained?

Small messages vs big messages

Is parallelism fine or coarse grained

Small tasks (frequent synchronization) vs big tasks

If hardware handles fine-grained parallelism, theneasier to get incremental scalability

Fine-grained communication and parallelism harderthan coarse-grained:

Harder to build with low overhead

Custom communication architectures often needed

Ultimate course grained communication:

GIMPS (Great Internet Mercenne Prime Search)

Communication once a month

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Current Commercial Computing targets

Relies on parallelism for high end

Computational power determines scale of business that can behandled

Databases, online-transaction processing, decisionsupport, data mining, data warehousing ...

Google, Yahoo, ….

TPC benchmarks (TPC-C order entry, TPC-D decisionsupport)

Explicit scaling criteria provided

Size of enterprise scales with size of system

Problem size not fixed as p increases.

Throughput is performance measure (transactions per minute or tpm)

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Scientific Computing Demand

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Engineering Computing Demand

Large parallel machines a mainstay in manyindustries

Petroleum (reservoir analysis)

Automotive (crash simulation, drag analysis, combustionefficiency),

Aeronautics (airflow analysis, engine efficiency, structuralmechanics, electromagnetism),

Computer-aided design

Pharmaceuticals (molecular modeling)

Visualization

in all of the above

entertainment (films like Toy Story)

architecture (walk-throughs and rendering)

Financial modeling (yield and derivative analysis)

etc.

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Can anyone afford high-end MPPs???

ASCI (Accellerated Strategic Computing Initiative)ASCI White: Built by IBM

12.3 TeraOps, 8192 processors (RS/6000)

6TB of RAM, 160TB

Disk

2 basketball courts in size

Program it??? Message passing

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Need New class of applications

Handheld devices with ManyCoreprocessors!

Great Potential, right?

Human Interface applications veryimportant:“The Laptop/handheld is the Computer”

’07: HP number laptops > desktops
1B+ Cell phones/yr, increasing in function
Obtellini demoed “Universal Communicator”(Combination cell phone, PC, and Video Device)
Apple iPhone

User wants Increasing Performance, Weeksor Months of Battery Power

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

1980

1985

1990

1995

1 MIPS 10 MIPS 100 MIPS

1 GIPS

200 Sub-BandSpeech Coding

Words

Isolated SpeechRecognition

SpeakerVeri¼

cation

CELPSpeech Coding

ISDN-CD StereoReceiver

5,000 WordsContinuousSpeechRecognition

HDTVReceiverCIF Video

1,

Words

ContinuousSpeechRecognition

TelephoneNumberRecognition

10 GIPS

  • Also CAD, Databases,...

Applications: Speech and Image Processing

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Compelling Laptop/Handheld Apps

Meeting Diarist

Laptops/ Handheldsat meeting coordinateto create speakeridentified, partiallytranscribed textdiary of meeting

Teleconference speaker identifier,

speech helper

L/Hs used for teleconference, identifies who isspeaking, “closed caption” hint of what being said

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

1

2

4

8

16

32

64

64

128

256

512

1 10 100 1000

2003

2005

2007

2009

2011

2013

2015

Why Target 100+ Cores? •

5-year research program aim 8+ years out

Multicore: 2X / 2 yrs

64 cores in 8 years

Manycore: 8X to 16X multicore

Automatic

Parallelization,Thread Level

Speculation

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

4 Themes of View 2.0/ Par Lab

Applications

Compelling apps drive top-down research agenda

Identify Common Computational Patterns

Breaking through disciplinary boundaries

Developing Parallel Software with Productivity,Efficiency, and Correctness

2 Layers + Coordination & Composition Language+ Autotuning

OS and Architecture

Composable primitives, not packaged solutions

Deconstruction, Fast barrier synchronization, Partitions

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Personal

Health

Image

Retrieval

Hearing,

Music

Speech

Parallel Browser

Motifs/Dwarfs

Sketching

Legacy

Code

Schedulers

Communication &Synch. Primitives

Efficiency Language Compilers

Par Lab Research Overview

Easy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

OS

Arch.

Productivity

Layer

Efficiency

Layer

Correctness

Applications

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel

Frameworks

Static

Verification

DynamicChecking

Debuggingwith Replay

Directed

Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type

Systems

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

How do compelling apps relate to 13 motif/dwarfs?

“Motifs" Popularity

(Red Hot

Blue CoolBlue Cool)

Embed

SPEC

DB

Games

ML

HPC

Health

Image

Speech

Music Browser

1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body 10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

Developing Parallel Software

2 types of programmers

2 layers

Efficiency Layer

(10% of today’s programmers)

Expert programmers build Frameworks & Libraries,Hypervisors, …

“Bare metal” efficiency possible at Efficiency Layer

Productivity Layer

(90% of today’s programmers)

Domain experts / Naïve programmers productively buildparallel apps using frameworks & libraries

Frameworks & libraries composed to form app frameworks

Effective composition techniques allows the efficiencyprogrammers to be highly leveraged

Create language for Composition and Coordination (C&C)

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

Architectural Trends

Greatest trend in VLSI generation is increase inparallelism

Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

slows after 32 bit

adoption of 64-bit now under way, 128-bit far (notperformance issue)

great inflection point when 32-bit micro and cache fit on achip

Mid 80s to mid 90s: instruction level parallelism

pipelining and simple instruction sets, + compiler advances(RISC)

on-chip caches and functional units => superscalarexecution

greater sophistication: out of order execution, speculation,prediction

to deal with control transfer and latency problems

Next step: ManyCore.

Also: Thread level parallelism? Bit-level parallelism?

Lec 1.

Kubiatowicz CS258 ©UCB Spring 2008

0

1

2

3

4

5

6+

5 0 30 25 20 15 10

z

z

z

z

z

0

5

10

15

0

1

2

3

Fraction of total cycles (%)

Number of instructions issued

Speedup

Instructions issued per cycle

Can ILP go any farther?

Infinite resources and fetch bandwidth, perfectbranch prediction and renaming

real caches and non-zero miss latencies

1/23/

Kubiatowicz CS258 ©UCB Spring 2008

No. of processors in fully configured commercial shared-memory systems

Thread-Level Parallelism “on board”

Micro on a chip makes it natural to connect many toshared memory

dominates server and enterprise market, moving down to desktop

Alternative: many PCs sharing one complicated pipe
Faster processors began to saturate bus, then bustechnology advanced

today, range of sizes for bus-based systems, desktop to large servers

Proc

Proc

Proc

Proc

MEM