lec13-multiprocessors.ppt

8/9/2019 lec13-multiprocessors.ppt

1/69

Lecture 13: Multiprocessors

Kai [email protected]

http://list.zju.edu.cn/kaibu/comparch


2/69

Assignment 4 due June 3

Lab 5 demo due June 10

Quiz June 3


3/69

Chapter 5.1–5.4


4/69

IL !" #Linstruction-levelparallelism

thread-level

parallelism


5/69

MIM$multiple instruction streams

multiple %ata streams

Each processor fetches its own instructions

and operates on its own data


6/69

multiprocessorsmultiple instruction streams


computers consisting of tightly coupled processors

Coordination and usage

are typically controlled by

a single OS

Share memory

through a shared

address space


7/69

multiprocessorsmultiple instruction streams


computers consisting of tightly coupled processors

Muticore

Single-chip systems with

multiple cores

Multi-chip computers

each chip may be a

multicore sys


8/69

&'ploiting #L

two sotware models

! arallel processing

the e"ecution o a ti#htl$ coupled set othreads collaboratin# on a sin#le disk

! (e)uest!le*el parallelism

the e"ecution o multiple% relati&el$independent processes that ma$ori#inate rom one or more users


9/69

+utline

! 'ultiprocessor (rchitecture

! )entralized *hared+'emor$ (rch

! ,istributed shared memor$ anddirector$+based coherence


10/69

+utline





11/69

Multiprocessor Architecture

! (ccordin# to memor$ or#anization andinterconnect strate#$

! -wo classes

s,mmetric-centralize% share%!memor, multiprocessors /M0

%istribute% share% memor,multiprocessors $M0


12/69

centralize% share%!memor,

eight or fewer cores


13/69


Share a single centralized memory

All processors have equal access to


14/69


All processors have uniform latency from memory

Uniform memory access ( UMA ) multiprocessors


15/69


16/69

%istribute% share% memor,

more processors

istributing mem among the nodes

increases bandwidth ! reduces local-mem latency

physically distributed memory


17/69


more processors

NUMA" nonuniform memory access

access time depends on data word loc in mem



18/69


more processors

Disadvantages:

more comple# inter-processor communication

more comple# software to handle distributed mem



19/69

2ur%les o arallel rocessing

! imited parallelism a&ailable inpro#rams

! elati&el$ hi#h cost o communications


20/69



makes it diicult to achie&e #oodspeedups in an$ parallel processor



21/69


! imited parallelism aects speedup

! &'ample

to achie&e a speedup o 0 with 100processors% what raction o the ori#inalcomputation can be seuential2

Anserb$ (mdahls law


22/69



! &'ample


Anserb$ (mdahls law


23/69



! &'ample


Anserb$ (mdahls law

4ractionse 5 1 6 4ractionparallel

5 0.789


24/69



makes it diicult to achie&e #oodspeedups in an$ parallel processor

in practice% pro#rams oten use lessthan the ull complement o theprocessors when runnin# in parallelmode



25/69




in&ol&es the lar#e latenc$ o remoteaccess in a parallel processor


26/69



in&ol&es the lar#e latenc$ o remoteaccess in a parallel processor

&'ample

app runnin# on a 37+processor ';

700 ns or reerence to a remote memclock rate 7.0 0.8

Q: how much aster i no

communication &s i 0.79 remote re2


27/69


! &'ample




communication &s i 0.79 remote re2Anser

i 0.79 remote reerence


28/69


! &'ample





i 0.79 remote re% emote re cost


29/69


! &'ample





i 0.79 remote re

no comm is 1.3/0.8 5 7.? times aster


30/69


solutions

! insuicient parallelism

new sotware al#orithms that oer better

parallel perormancesotware s$stems that ma"imize theamount o time spent e"ecutin# with theull complement o processors

! long!latenc, remote communication

b$ architecture: cachin# shared data

b$ pro#rammer: multithreadin#%preetchin#


31/69

+utline





32/69

Centralize% /hare%!Memor,

$arge% multilevel cachesreduce mem bandwidth demands


33/69


Cache private&shared data


34/69


35/69


shared dataused by multiple processors

may be replicated in multiple caches to reduce

access latency% required mem bw% contention


36/69


shared dataused by multiple processors

may be replicated in multiple caches to reduce

access latency% required mem bw% contention

w/o additional precautions

different processors can have different values

for the same memory location


37/69

Cache Coherence roblem

write-through cache


38/69


39/69

Cache Coherence roblem

! ( memor$ s$stem is Coherent i an$read o a data item returns the mostrecentl$ written &alue o that data item

! -wo critical aspects

coherence: deines what &alues canbe returned b$ a read

consistenc,: determines when awritten &alue will be returned b$ a read


40/69

Coherence ropert,

! ( read b$ processor ; to location A thatollows a write b$ ; to A% with writes oA b$ another processor occurrin#

between the write and the read b$ ;%

alwa$s returns the &alue written b$ ;.

preserves program order


41/69

Coherence ropert,

! ( read b$ a processor to location A thatollows a write b$ anther processor to Areturns the written &alue i the read the

write are suicientl$ separated in timeand no other writes to A occur betweenthe two accesses.


42/69

Coherence ropert,

! Write serialization

two writes to the same location b$ an$two processors are seen in the sameorder b$ all processors


43/69

Consistenc,

! When a written &alue will be seen isimportant

! 4or e"ample% a write o A on oneprocessor precedes a read o A onanother processor b$ a &er$ smalltime% it ma$ be impossible to ensure

that the read returns the &alue o thedata written%

since the written data ma$ not e&en

ha&e let the processor at that point


44/69

Cache Coherence rotocols

! $irector, base%

the sharin# status o a particular blocko ph$sical memor$ is kept in onelocation% called directory

! /nooping

e&er$ cache that has a cop$ o the datarom a block o ph$sical memor$ couldtrack the sharin# status o the block


45/69

/nooping Coherence rotocol

! 6rite in*ali%ation protocol

in&alidates other copies on a write

e"clusi&e access ensures that no otherreadable or writable copies o an iteme"ist when the write occurs


46/69


! 6rite in*ali%ation protocol

in&alidates other copies on a write

write-back cache


47/69


! 6rite up%ate-broa%cast protocol

update all cached copies o a data itemwhen that item is written

consumes more bandwidth


48/69

6rite In*ali%ation rotocol

! -o perorm an in&alidate% the processorsimpl$ acuires bus access andbroadcasts the address to be

in&alidated on the bus! (ll processors continuousl$ snoop on

the bus% watchin# the addresses

! -he processors check whether theaddress on the bus is in their cache

i so% the correspondin# data in the

cache is in&alidated.


49/69


three block states (MSI protocol)

! In*ali%

! /hare%indicates that the block in the pri&atecache is potentiall$ shared

! Mo%iie%indicates that the block has beenupdated in the pri&ate cache

implies that the block is e'clusi*e


50/69



51/69



52/69



53/69


54/69

M/I &'tensions

! M+&/I

owned: indicates that the associatedblock is owned b$ that cache and out+o+date in memor$

'odiied + Cwned without writin# theshared block to memor$


55/69

increase mem bandwidth

through multi-bus ' interconnection networ(

and multi-ban( cache


56/69

Coherence Miss

! #rue sharing miss

irst write b$ a processor to a sharedcache block causes an in&alidation to

establish ownership o that block

another processor reads a modiiedword in that cache block

! 7alse sharing miss


57/69

Coherence Miss

! #rue sharing miss

! 7alse sharing miss

a sin#le &alid bit per cache blockoccurs when a block is in&alidated Danda subseuent reerence causes a missEbecause some word in the block% otherthan the one bein# read% is written into


58/69

Coherence Miss

! &'ample

assume words "1 and "7 are in thesame cache block% which is in shared

state in the caches o both ;1 and ;7.

identi$ each miss as a true sharin#miss% a alse sharin# miss% or a hit2


59/69

Coherence Miss

! &'ample

1. true sharing misssince "1 was read b$ ;7 and needs tobe in&alidated rom ;7


60/69


61/69

Coherence Miss

! &'ample

3. alse sharing misssince the block is in shared state% needto in&alidate it to write

but ;7 read "7 rather than "1


62/69

Coherence Miss

! &'ample

4. alse sharing missneed to in&alidate the block

;7 wrote "1 rather than "7


63/69

Coherence Miss

! &'ample

5. true sharing misssince the &alue bein# read was writtenb$ ;7 Din&alid + sharedE


64/69

+utline




A directory is added to each node)


65/69

y

Each directory trac(s the caches that share the

memory addresses of the portion of memory in

the node)need not broadcast for on every cache miss

$irector,!base%


66/69

$irector,!base%Cache Coherence rotocol

)ommon cache states

! /hare%

one or more nodes ha&e the block cached%

and the &alue in memor$ is up to date Daswell as in all the cachesE

! 9ncache%

no node has a cop$ o the cache block

! Mo%iie%e"actl$ one node has a cop$ o the cacheblock% and it has written the block% so thememor$ cop$ is out o date


67/69

$irector, rotocol

state transition diagram

for an individual cache bloc(

requests from outside the node in gray


68/69

$irector, rotocol

state transition diagram

for the directory

All actions in gray

because they*re all e#ternally caused


69/69

Documents

lec13-multiprocessors.ppt