Simple Instruction Pipelining · 2019-09-12 · Pipelining History 6.823 Some very early machines had limited pipelined execution (e.g., Zuse Z4, WWII) Usually overlap fetch of next

Asanovic/DevadasSpring 2002

6.823

Simple InstructionPipelining

Krste AsanovicLaboratory for Computer Science

Massachusetts Institute of Technology

Asanovic/Devadas Spring 2002Processor Performance 6.823

Equation Time = Instructions * Cycles * Time

Program Program Instruction Cycle

� Instructions per program depends on source code, compiler technology, and ISA

� Microcoded DLX from last lecture had cycles per instruction (CPI) of around 7 minimum

� Time per cycle for microcoded DLX fixed by microcode cycle time

— mostly ROM access + next µPC select logic


6.823Pipelined DLX

To pipeline DLX:

� First build unpipelined DLX with CPI=1

� Next, add pipeline registers to reduce cycle time while maintaining CPI=1


6.823A Simple Memory Model WriteEnable

Address ReadData

WriteData

MAGIC RAM

Clock

Reads and writes are always completed in one cycle� a Read can be done any time (i.e. combinational)� a Write is performed at the rising clock edge

if it is enabled ⇒ the write address, data, and enable

must be stable at the clock edge


6.823Datapath for ALU Instructions

OpCode

rd1

rs1 rs2

rd2

inst<25:21> inst<20:16>

GPRs z

ALU

inst<15:0>

inst<31:26> <5:0>

ExtSel OpSel

ALU Control

Imm Ext

BSrc

clk

inst<15:11>

RegWrite

ws wd

we

RegDst

0x4 Add

clk

addr inst

Inst. Memory

PC

0x4 Add

clk

addr inst

Inst. Memory

PC

rf2 / rf3 Reg / Imm

6 5 5 5 5 6 rf3 ← (rf1) func (rf2)

opcode rf1 rf2 immediate rf2 ← (rf1) op immediate 0 rf1 rf2 func 0 rf3

Asanovic/DevadasDatapath for Memory Spring 2002 6.823

Instructions Should program and data memory be separate?

Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory

at some level the two memories have to be the same

Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires

accessing the memory more than once during its execution

Asanovic/DevadasLoad/Store Instructions: Spring 2002 6.823

Harvard-Style Datapath

RegWrite MemWrite 0x4

Add

clk

addr inst

Inst. Memory

PC

WBSrc ALU / Mem

“base”

disp

ALU Control

z ALU

clk

addr

wdata

rdataData Memory

we

clk

rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

OpCode RegDst ExtSel OpSel BSrc

6 5 5 16 addressing mode (rf1) + displacementopcode rf1 rf2 ent displacem

31 26 25 21 20 16 15 0

rf1 is the base registerrf2 is the destination of a Load or the source for a Store

Asanovic/Devadas Spring 2002

Memory Hierarchy c.2002 6.823

Desktop & Small Server

Proc I$

D$

0.5-2ns 2-3 clk

8~64KB

On-chip Caches

L2

<10ns 5~15 clk 0.25-2MB

SRAM/ eDRAM

Interleaved Banks of DRAM

Hard DiskOff-chip L3 Cache

< 25ns 15~50 clk

1~8MB

~150ns100~300 clk64M~1GB

~10ms seek time~107 clk

20~100GB

Our memory model is a good approximation of the hierarchical memory system when we hit in the on-chip cache

Asanovic/Devadas Spring 2002Conditional Branches 6.823

PCSrc ( ~j / j ) RegWrite MemWrite WBSrc

0x4

clk

RegDst BSrcExtSelOpCode

Add

z

Add

OpSel

clk

zero?

clk

addr inst

Inst. Memory

PC rd1

GPRs

rs1 rs2 ws wd rd2

we

Imm Ext

addr

wdata

rdata Data Memory

we ALU

ALU Control

Asanovic/Devadas Spring 2002Register-Indirect Jumps 6.823

PCSrc ( ~j / j RInd / j PCR ) RegWrite MemWrite WBSrc

0x4

clk

Add

z

Add clk

Jump & Link?

clk

addr inst

Inst. Memory

PC rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdataData Memory

we ALU

ALU Control

OpCode RegDst ExtSel OpSel BSrc zero?

Asanovic/DevadasSpring 2002Jump & Link 6.823

0x4

clk

RegDst

RegWrite

BSrc zero?

WBSrc ALU / Mem / PC

31

ExtSelOpCode

Add

z

Add

OpSel

clk

MemWritePCSrc

clk

addr inst

Inst. Memory

PC rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdataData Memory

we ALU

ALU Control

rf3 / rf2 / R31

Asanovic/Devadas Spring 2002PC-Relative Jumps 6.823

RegWrite MemWrite WBSrc

0x4

clk

RegDst

31

OpCode

Add

z

Add

clk

PCSrc

clk

addr inst

Inst. Memory

PC rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdataData Memory

we ALU

No new datapath required

ALU Control

ExtSel OpSel BSrc zero?Ext16 / Ext26


6.823

Hardwired Control is pureCombinational Logic:

Unpipelined DLX

op code

zero?

combinational logic

ExtSel BSrc OpSel MemWrite WBSrc RegDst RegWrite PCSrc

Asanovic/Devadas Spring 2002ALU Control & Immediate 6.823

Extension

Inst<31:26> (Opcode)

Decode Map

Inst<5:0> (Func)

ALUop

0? +

OpSel ( Func, Op, +, 0? )

ExtSel ( sExt16, uExt16,sExt26, High16)

Asanovic/Devadas Spring 2002Hardwired Control worksheet 6.823

PCSrc RegWrite MemWrite WBSrc

0x4

clk

PCR / RInd / ~j ALU / Mem / PC

inst<25:21> inst<20:16>

inst<15:11>

inst<25:0>

inst<31:26><5:0>

31

0x4 Add

z

Add

RegDst

BSrc Reg / Imm

zero?

OpCode

clk

Add

clk

addr inst

Inst. Memory

PC rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdataData Memory

we ALU

ALU Control

ExtSel OpSelrf2 / rf3 / sExt16/uExt16/ Func/R31 sExt26/High16 Op/+ / 0?

Asanovic/DevadasHardwired Control Table Spring 2002 6.823

BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rf2 / rf3 / R31 PCSrc1 = j / ~j PCSrc2 = PCR / RInd

JR JALR

JAL J BEQZz=1

BEQZz=0

SW LW ALUiu ALUi ALUu ALU

PC Src

RegDest

WB Src

RegWrite

Mem Write

OpSel

BSrcExt Sel

Opcode

* * o yes RIndPC R31 RInd* * o no * *

PCRsExt26 * * no yes PC R31 PCRsExt26 * * no no * *

~jsExt16 * ? no no * * PCRsExt16 * ? no no * *

~jsExt16 Imm + yes no * *

~jImm Op no yes ALU rf2

~j* Reg Func no yes ALU rf3 ~j* Reg Func no yes ALU rf3

~jsExt16 Imm Op no yes ALU rf2

~jsExt16 Imm + no yes Mem rf2 uExt16

* n* n

00


6.823

Hardwired Unpipelined Machine

� Simple

� One instruction per cycle

� Why wasn’t this a popular machine style?


6.823Unpipelined DLX

Clock period must be sufficiently long for all of the following steps to be “completed”:

1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time

⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

� At the rising edge of the following clock, the PC, the register file and the memory are updated

Asanovic/Devadas Spring 2002Pipelined DLX Datapath 6.823

addr

wdata

rdata Data Memory

we ALU

Imm Ext

PC

0x4 Add

addr rdata

Inst. Memory

rd1

GPRs

rs1 rs2

ws wd rd2

we

write -back phase

fetch phase

execute phase

decode & Reg-fetch phase

memory phase

IR PC

Clock period can be reduced by dividing the execution of an instruction into multiple cycles

tC > max {tIM, tRF, tALU, tDM, tRW} = tDM (probably) However, CPI will increase unless instructions are pipelined

Asanovic/Devadas Spring 2002An Ideal Pipeline 6.823

stage stage stage 2 3 4

stage 1

� All objects go through the same stages

� No sharing of resources between any two stages

� Propagation delay through all pipeline stages is equal

� The scheduling of an object entering the pipeline is not affected by the objects in other stages

These conditions generally hold for industrial assembly lines. An instruction pipeline, however, cannot satisfy the last condition. Why?


6.823Pipelining History

� Some very early machines had limited pipelined execution (e.g., Zuse Z4, WWII) � Usually overlap fetch of next instruction with current execution

� IBM Stretch first major “supercomputer” incorporating extensive pipelining, result bypassing, and branch prediction � project started in 1954, delivered in 1961 � didn’t meet initial performance goal of 100x faster with 10x

faster circuits � up to 11 macroinstructions in pipeline at same time � microcode engine highly pipelined also (up to 6

microinstructions in pipeline at same time) � Stretch was origin of 8-bit byte and lower case characters,

carried on into IBM 360


6.823How to divide the datapathinto stages

Suppose memory is significantly slower than other stages. In particular, suppose

tIM = tDM = 10 units tALU = 5 units tRF = tRW = 1 unit

Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance


6.823Minimizing Critical Path

write -back phase

fetch phase

decode & Reg-fetch phase

memory phase

addr

wdata

rdata Data Memory

we ALU

Imm Ext

PC

0 x4 Add

addr rdata

Inst. Memory

rd1

GPRs

rs1 rs2

ws wd rd2

we

IR

& execute

tC > max {tIM, tRF + tALU, tDM, tRW}

Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase

⇒ increase the critical path by 10%

Asanovic/DevadasMaximum Speedup by Spring 2002 6.823

PipeliningFor the 4-stage pipeline, given

tIM = tDM = 10 units, tALU = 5 units, tRF = tRW= 1 unit tC could be reduced from 27 units to 10 units

⇒ speedup = 2.7

However, if tIM = tDM = tALU = tRF = tRW = 5 unitsThe same 4-stage pipeline can reduce tC from 25 units to 10 units

⇒ speedup = 2.5

But, since tIM = tDM = tALU = tRF = tRW, it is possible to achieve higher speedup with more stages in the pipeline.

A 5-stage pipeline can reduce tC from 25 units to 5 units ⇒ speedup = 5


6.823

Technology Assumptions We will assume

• A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!).

It makes the following timing assumption valid

tIM ≈ tRF ≈ tALU ≈ tDM ≈ tRW

A 5-stage pipelined Harvard-style architecture will be the focus of our detailed design

Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823

fetch

PC

0x4 Add

addr

wdata

rdata Memory

we

executedecode & Reg-fetch memory

IR rd1

GPRs

rs1 rs2 ws wd rd2

we

Imm Ext

addr

wdata

rdata Memory

we ALU

write -back

phase phase phase phase phase(IF) (ID) (EX) (MA) (WB)

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .instruction1 IF1 ID1 EX1 MA1 WB1

instruction2 IF2 ID2 EX2 MA2 WB2




Asanovic/Devadas5-Stage Pipelined Execution Spring 2002 6.823

Resource Usage Diagram

fetch

PC

0x4 Add

addr

wdata

rdata Memory

we

executedecode & Reg-fetch memory

IR rd1

GPRs

rs1 rs2 ws wd rd2

we

Imm Ext

addr

wdata

rdata Memory

we ALU

write -back

phase phase phase phase phase(IF) (ID) (EX) (MA) (WB)

time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 I5

ID I1 I2 I3 I4 I5

EX I1 I2 I3 I4 I5

MA I1 I2 I3 I4 I5

WB I1 I2 I3 I4 I5Res

ourc

es

Asanovic/DevadasPipelined Execution: Spring 2002 6.823

ALU Instructions

not quite correct!

31PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdata Data Memory

we ALU

Asanovic/DevadasPipelined Execution: Spring 2002 6.823

Need for Several IR’s

31 IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdata Data Memory

we ALU


6.823IRs and Control points

31 IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdata Data Memory

we ALU

Are control points connected properly? - Load/Store instructions - ALU instructions

Asanovic/DevadasSpring 2002Pipelined DLX Datapath 6.823

without jumps

31 IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR rd1

GPRs

rs1 rs2

ws wd rd2

we

Imm Ext

addr

wdata

rdata Data Memory

we ALU

ExtSel

OpSel

BSrc

RegDst

WBSrc MemWrite

RegWrite


6.823

How Instructions can Interact with each other in a Pipeline

• An instruction in the pipeline may need a resource being used by another instruction in the pipeline

structural hazard

• An instruction may produce data that is needed by a later instruction

data hazard

• In the extreme case, an instruction may determine the next instruction to be executed

control hazard (branches, interrupts,...)


6.823Feedback to Resolve Hazards

stage 1

stage 2

stage 3

stage 4

FB1 FB2 FB3 FB4

Controlling pipeline in this manner works providedthe instruction at stage i+1 can complete withoutany interference from instructions in stages 1 to i

(otherwise deadlocks may occur)

Feedback to previous stages is used to stall or kill instructions

Documents

Simple Instruction Pipelining · 2019-09-12 · Pipelining History 6.823 Some very early machines had limited pipelined execution (e.g., Zuse Z4, WWII) Usually overlap fetch of next