









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A parallel computer is a collection of processing elements that cooperate to solve large problems. • Some broad issues:.
Typology: Lecture notes
1 / 15
This page cannot be seen from the preview
Don't miss anything!
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
instruction set
software hardware
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient
functionality to higher levels
Permits an efficient implementation at lower levels
Changes very slowly! (Although this is increasing)
No standard hardware interface
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Models of computation: PRAM? BSP? Sequential Consistency?
Resource Allocation:
how large a collection?
how powerful are the elements?
how much memory?
Data access, Communication and Synchronization
how do the elements
cooperate and communicate?
how are data transmitted between processors?
what are the abstractions and primitives for cooperation?
Performance and Scalability
how does it all translate into performance?
how does it scale?
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Multiple processors in box with sharedmemory communication
Current MultiCore chips like this
Every processor runs copy of OS
Multiple processors
Each with local memory
general scalable network
Extremely light “OS” on node providessimple services
Scheduling/synchronization
Network-accessible host for I/O
Many independent machine connected withgeneral network
Communication through messages
P
P
P
P
Bus
Memory
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
P/M
Host
Network
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Conventional Wisdom (CW) inComputer Architecture
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Conventional Wisdom (CW)in Computer Architecture
2
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Conventional Wisdom (CW)in Computer Architecture
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Instruction Set ArchitecturePipelining, Hazard Resolution,Superscalar, Reordering,Prediction, Speculation,Vector, Dynamic Compilation
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
Input/Output and Storage
MemoryHierarchy
Pipelining and InstructionLevel Parallelism
Network
Communication
Other Processors
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Interconnection Network
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
technology forces
Programming models
fundamental architectural issues
naming, replication, communication, synchronization
basic design techniques
cache coherence, protocols, networks, pipelining, …
methods of evaluation
Research papers, white papers
Massive Parallelism
Reconfigurable computing?
Message Passing Machines
Peer-to-peer systems?
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Now, more than ever, industry trying to figure out how to buildthese new multicore chips….
Crisp solutions in the context of parallel machines.
migrate downward with time
SuperServers
Departmenatal Servers
Workstations
Personal Computers
Workstations
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Today’s microprocessors are multiprocessors and/or havemultiprocessor support
Servers and workstations becoming MP: Sun, SGI, DEC,COMPAQ!...
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
meet 3 times with me to see progress
give oral presentation
give poster session
written report like conference paper
6 weeks work full time for 2 people
Opportunity to do “research in the small” to help make transitionfrom good student to research colleague
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Cycle drives exponential increase in microprocessor performance
Drives parallel architecture harder
most demanding applications
Need range of system performance with progressively increasing cost
New Applications
More Performance
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Compiler
Programming
Language Application
Datapath
Control
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second:MFLOP/s
Megabytes per secondCycles per second (clock rate) Answers per monthOperations per second
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Compare parallel program on 1 processor to parallel programon p processors
Should compare uniprocessor program on 1 processor toparallel program on p processors
It is easy to parallelize overhead.
Time (1 processor)Time (p processors)
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Speedup due to enhancement E:
ExTime w/o E
Performance w/
Speedup(E)
ExTime w/
Performance w/o E
Suppose that enhancement E accelerates a
fraction F of the task by a factor S, andthe remainder of the task is unaffected
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
(
)
par
overhead
par
parallel
ExTime
stuff
p,
ExTime
p
Fraction
Fraction
Speedup
(
)
par
maximum
Fraction
Speedup
stuff
p,
ExTime
p
Fraction
Fraction
ExTime
ExTime
overhead
par
par
ser
par
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Small messages vs big messages
Small tasks (frequent synchronization) vs big tasks
Harder to build with low overhead
Custom communication architectures often needed
GIMPS (Great Internet Mercenne Prime Search)
Communication once a month
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Current Commercial Computing targets
Computational power determines scale of business that can behandled
Explicit scaling criteria provided
Size of enterprise scales with size of system
Problem size not fixed as p increases.
Throughput is performance measure (transactions per minute or tpm)
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Petroleum (reservoir analysis)
Automotive (crash simulation, drag analysis, combustionefficiency),
Aeronautics (airflow analysis, engine efficiency, structuralmechanics, electromagnetism),
Computer-aided design
Pharmaceuticals (molecular modeling)
Visualization
in all of the above
entertainment (films like Toy Story)
architecture (walk-throughs and rendering)
Financial modeling (yield and derivative analysis)
etc.
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
12.3 TeraOps, 8192 processors (RS/6000)
6TB of RAM, 160TB
Disk
2 basketball courts in size
Program it??? Message passing
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Handheld devices with ManyCoreprocessors!
Human Interface applications veryimportant:“The Laptop/handheld is the Computer”
User wants Increasing Performance, Weeksor Months of Battery Power
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
1980
1985
1990
1995
1 MIPS 10 MIPS 100 MIPS
1 GIPS
200 Sub-BandSpeech Coding
Words
Isolated SpeechRecognition
SpeakerVeri¼
cation
CELPSpeech Coding
ISDN-CD StereoReceiver
5,000 WordsContinuousSpeechRecognition
HDTVReceiverCIF Video
1,
Words
ContinuousSpeechRecognition
TelephoneNumberRecognition
10 GIPS
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Laptops/ Handheldsat meeting coordinateto create speakeridentified, partiallytranscribed textdiary of meeting
Teleconference speaker identifier,
speech helper
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
1
2
4
8
16
32
64
64
128
256
512
1 10 100 1000
2003
2005
2007
2009
2011
2013
2015
Automatic
Parallelization,Thread Level
Speculation
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Compelling apps drive top-down research agenda
Breaking through disciplinary boundaries
2 Layers + Coordination & Composition Language+ Autotuning
Composable primitives, not packaged solutions
Deconstruction, Fast barrier synchronization, Partitions
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Personal
Health
Image
Retrieval
Hearing,
Music
Speech
Parallel Browser
Motifs/Dwarfs
Sketching
Legacy
Code
Schedulers
Communication &Synch. Primitives
Efficiency Language Compilers
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
Hypervisor
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel
Frameworks
Static
Verification
DynamicChecking
Debuggingwith Replay
Directed
Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type
Systems
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
“Motifs" Popularity
Embed
SPEC
DB
Games
ML
HPC
Health
Image
Speech
Music Browser
1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body 10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
Developing Parallel Software
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
slows after 32 bit
adoption of 64-bit now under way, 128-bit far (notperformance issue)
great inflection point when 32-bit micro and cache fit on achip
Mid 80s to mid 90s: instruction level parallelism
pipelining and simple instruction sets, + compiler advances(RISC)
on-chip caches and functional units => superscalarexecution
greater sophistication: out of order execution, speculation,prediction
to deal with control transfer and latency problems
Next step: ManyCore.
Also: Thread level parallelism? Bit-level parallelism?
Lec 1.
Kubiatowicz CS258 ©UCB Spring 2008
0
1
2
3
4
5
6+
5 0 30 25 20 15 10
z
z
z
z
z
0
5
10
15
0
1
2
3
Fraction of total cycles (%)
Number of instructions issued
Speedup
Instructions issued per cycle
real caches and non-zero miss latencies
1/23/
Kubiatowicz CS258 ©UCB Spring 2008
No. of processors in fully configured commercial shared-memory systems
dominates server and enterprise market, moving down to desktop
today, range of sizes for bus-based systems, desktop to large servers
Proc
Proc
Proc
Proc
MEM