45
Latr: Lazy Translation Coherence Mohan Kumar * , Steffen Maass * , Sanidhya Kashyap, J´ an Vesel´ y , Zi Yan , Taesoo Kim, Abhishek Bhattacharjee , Tushar Krishna Georgia Institute of Technology Rutgers University * Co-First Authors March 28, 2018 Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 1 / 24

Latr: Lazy Translation Coherencesanidhya/pubs/2018/latr-slides.pdfLatr: Lazy Translation Coherence Mohan Kumar*, Steffen Maass*, Sanidhya Kashyap, J´an Vesel´y‡, Zi Yan‡, Taesoo

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Latr: Lazy Translation Coherence

Mohan Kumar*, Steffen Maass*, Sanidhya Kashyap, Jan Vesely‡,Zi Yan‡, Taesoo Kim, Abhishek Bhattacharjee‡, Tushar Krishna

Georgia Institute of Technology ‡Rutgers University

* Co-First Authors

March 28, 2018

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 1 / 24

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines

Terabytes of memory

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines

Terabytes of memory

Microsecond latency

Motivation

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24

Large NUMA machines

Terabytes of memory

Microsecond latency

⇒ Problem of Microsecond Latency in System Services⇒ TLB Coherence is Contributor in Important Subset

Impact of TLB coherence on applications

Multi-core MapReduce application

Prior research: 10x increase in shootdown time with increasing corecounts

Web servers (e.g., Apache)

Prior research and our findings: ≈35% of time spent in TLBshootdown

Die-stacked Memory

Swapping between on-chip and off-chip memory

Disaggregated Memory

Swapping between local and remote memory

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 3 / 24

Impact of TLB coherence on applications

Multi-core MapReduce application

Prior research: 10x increase in shootdown time with increasing corecounts

Web servers (e.g., Apache)

Prior research and our findings: ≈35% of time spent in TLBshootdown

Die-stacked Memory

Swapping between on-chip and off-chip memory

Disaggregated Memory

Swapping between local and remote memory

⇒ Can we mitigate this costly TLB shootdown?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 3 / 24

Table of contents

1 TLB Shootdown Background

2 Latr: Asynchronous TLB Shootdowns

3 Evaluation

4 Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 4 / 24

Table of contents

1 TLB Shootdown Background

2 Latr: Asynchronous TLB Shootdowns

3 Evaluation

4 Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 5 / 24

Translation lookaside buffer: Introduction

Cache for virtual → physical mapping, per-core structures

Accessed on every load/store

Unlike data caches (L3, etc.), coherence managed by OS

TLB coherence significantly impacts application performance

Virtual Address

PGD

PUD

PMD

PTETLB

Hit:PhysicalAddress

Miss:Page TableWalk Physical

Address

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 6 / 24

TLB coherence: Background

Hardware-based ApproachesProviding cache coherence to TLBsISA-level instruction support (ARM)Microcode-based approaches

Software-based ApproachesCurrent commodity OS design: Use Inter-Processor Interrupts (IPI)Optimization: Reduce number of shootdowns, better trackingMultikernel design: Use Message-Passing

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 7 / 24

TLB coherence: Background

Hardware-based ApproachesProviding cache coherence to TLBsISA-level instruction support (ARM)Microcode-based approaches

Software-based ApproachesCurrent commodity OS design: Use Inter-Processor Interrupts (IPI)Optimization: Reduce number of shootdowns, better trackingMultikernel design: Use Message-Passing

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 7 / 24

⇒ More Hardware Complexity

⇒ TLB shootdowns still significant

TLB shootdown internals in Linux

munmap() on core 1, application running on cores 1, 2, and 5:

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ...

OS

TLBTLB TLB TLB TLBTLB TLB TLB

❶Timeline:

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux

munmap() on core 1, application running on cores 1, 2, and 5:

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ...

OS

TLBTLB TLB TLB TLBTLB TLB TLB

❶Timeline:

❶ munmap()

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux

Context switch on core 1, local TLB shootdown:

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ... OS

TLBTLB TLB TLB TLBTLB TLB TLB

❶ ❷Timeline:

❶ munmap()❷ Local Shootdown

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux

Notify cores 2 and 5 via IPI, application blocked on core 1:

TLBTLB TLB TLB TLBTLB TLB TLB

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ... OS

❸Spin-wait

❸❶ ❷2.2µs

Timeline:

❶ munmap()❷ Local Shootdown❸ Send IPIs

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux

Execute context switch and TLB shootdown on cores 2 and 5:

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ... OS

TLBTLB TLB TLB TLBTLB TLB TLB

❹ ❹Spin-wait

❹❸❶ ❷2.2µs

Timeline:

❶ munmap()❷ Local Shootdown❸ Send IPIs❹ Remote Shootdown

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux

Cores 2 and 5 respond ACK via shared memory:

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ... OS

TLBTLB TLB TLB TLBTLB TLB TLB

❺ ❺Spin-wait

❺❹❸❶ ❷2.2µs

Timeline:

❶ munmap()❷ Local Shootdown❸ Send IPIs❹ Remote Shootdown❺ IPI ACK

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

TLB shootdown internals in Linux

Control is returned on all cores, TLB shootdown completed:

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ... OS

TLBTLB TLB TLB TLBTLB TLB TLB

❻❺❹❸❶ ❷2.2µs

Timeline:

5.9µs}Savings potential for asynchronousapproach with LATR

❶ munmap()❷ Local Shootdown❸ Send IPIs❹ Remote Shootdown❺ IPI ACK

munmap() complete❻Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24

Observation

Synchronous TLB shootdown is expensive:Up to 6µs delay with two sockets

Processing IPIs is expensive:Interrupt handler on remote coreLong wait time on initiating core

IPI send-and-wait delay:Unicast delivery of the IPIs (one at a time)

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 9 / 24

TLB shootdown: A necessary evil

Cost of a simple memory unmap operation (munmap()):

1 page on 16 cores with 2 sockets: up to 8µs≈ 70% from TLB shootdown alone

More expensive with more sockets:

0

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16

1 Socket

Latency

(µs)

Cores

munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24

TLB shootdown: A necessary evil

Cost of a simple memory unmap operation (munmap()):

1 page on 16 cores with 2 sockets: up to 8µs≈ 70% from TLB shootdown alone

More expensive with more sockets:

0

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16

1 Socket 2 Sockets

Latency

(µs)

Cores

munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24

TLB shootdown: A necessary evil

Cost of a simple memory unmap operation (munmap()):

1 page on 16 cores with 2 sockets: up to 8µs≈ 70% from TLB shootdown alone

More expensive with more sockets:

0

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16

Latency

(µs)

Cores

munmap()TLB Shootdown

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24

Table of contents

1 TLB Shootdown Background

2 Latr: Asynchronous TLB Shootdowns

3 Evaluation

4 Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 11 / 24

In this talk: Latr

Latr: Lazy Translation Coherence

Perform asynchronous TLB shootdownRemove remote shootdown from the critical pathTake advantage of change in ABI without affecting applications’correctness

Use shared memory instead of IPIEliminate send-and-wait delay of IPIs

Scope:free operations (in this talk)migration operations (see our paper)

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 12 / 24

In this talk: Latr

Latr: Lazy Translation Coherence

Perform asynchronous TLB shootdownRemove remote shootdown from the critical pathTake advantage of change in ABI without affecting applications’correctness

Use shared memory instead of IPIEliminate send-and-wait delay of IPIs

Scope:free operations (in this talk)migration operations (see our paper)

⇒ But: How to perform asynchronous shootdown?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 12 / 24

Latr States

Store virtual addresses to be flushed

Remote cores shootdown local TLB during

OS context switchOS scheduler tick (upper bound: 1ms in Linux)

Core5 Core6 Core7 Core8

TLB TLB TLB TLB

LATRStates

LATRStates

LATRStates

LATRStates

...S1: start; end; mm; flags; Core list; active S2 S64

LATR States Core1

Cache Coherency

QPI

Core1 Core2 Core3 Core4

TLB TLB TLB TLB

LATRStates

LATRStates

LATRStates

LATRStates

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 13 / 24

Latr: Example

munmap() initiated on core 1:

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ... OS

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

❶Timeline:

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

Latr: Example

munmap() initiated on core 1:

App1 Idle IdleApp2 App5 Idle IdleIdle

Application

Operating System

OS OS ... OS

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

❶Timeline:

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

Latr: Example

Set up Latr state (for cores 2 and 5), local shootdown:

App1 Idle IdleApp2 App5 Idle IdleIdle

OS OS ... OS

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

start end0x01 0x0F

mm Core list active0x1234 {2, 5} True

Core1, LATR State1:

flags0x1

Timeline:

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap()❷ Local Shootdown❸ Create LATR State

❶❶ ❷ ❸

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

Latr: Example

Return control on core 1. Time taken: 2.3µs, 70% reduction:

App1 Idle IdleApp2 App5 Idle IdleIdle

OS OS ... OS

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

❶Timeline:

start end0x01 0x0F

mm Core list active0x1234 {2, 5} True

Core1, LATR State1:

flags0x1

2.3µs

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete

❶ ❷ ❸ ❹

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

Latr: Example

Scheduler tick on core 2, local shootdown, reset state:

App1 Idle IdleApp2 App5 Idle IdleIdle

OS OS ... OS

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

start end0x01 0x0F

mm Core list active0x1234 {5} True

Core1, LATR State1:

flags0x1

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete❺ Shootdown Core2

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

Latr: Example

Scheduler tick on core 5, local shootdown, reset state:

App1 Idle IdleApp2 App5 Idle IdleIdle

OS OS ... OS

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

start end0x01 0x0F

mm Core list active0x1234 {} False

Core1, LATR State1:

flags0x1

❻Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete❺ Shootdown Core2

❻ Shootdown Core5

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

Latr: Example

Shootdown complete, Latr entry can be reused:

App1 Idle IdleApp2 App5 Idle IdleIdle

OS OS ... OS

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

LATRStates

start end0x01 0x0F

mm Core list active0x1234 {} False

Core1, LATR State1:

flags0x1

Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8

❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete❺ Shootdown Core2

❻ Shootdown Core5❼ Shootdown complete

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24

Lazy TLB shootdown: Correctness

Same physical memory or virtual memory is reused

Leads to memory corruption

⇒ Avoid same physical/virtual page reuse

Upper bound for TLB shootdown with Latr is 1msOS physical/virtual memory reclamation delayed by two scheduler ticks(2ms)Memory overhead is bounded by 21MB

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 15 / 24

Lazy TLB shootdown: Incorrect accesses

Memory accesses before Latr shootdown:

Consequence of incorrect application: Use After FreeBefore Latr shootdown, access (reads and writes) allowedExists in the current OS implementationAfter Latr shootdown, access results in segmentation fault

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 16 / 24

Scope of Latr

ABI change for free operations

Support for operations limited to few, frequently used operations:

Classification OperationsLazy operation

possible

Freemunmap(): unmap address range ✓madvise(): free memory range ✓

MigrationAutoNUMA page migration (⇒ See paper) ✓Page swap: swap page to disk ✓

Permission mprotect(): change page permission -

Ownership CoW: Copy on Write -

Remap mremap(): change physical address -

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 17 / 24

Table of contents

1 TLB Shootdown Background

2 Latr: Asynchronous TLB Shootdowns

3 Evaluation

4 Conclusion

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 18 / 24

Evaluation: Questions

Latr prototype developed for Linux 4.10

Evaluation questions

What are Latr’s benefits with microbenchmarks?What are Latr’s benefits with real-world applications exhibiting manyTLB shootdowns?What is the cost for Latr?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 19 / 24

Microbenchmark on eight sockets

Linux and Latr calling munmap() with one page on 120 cores:

0

20

40

60

80

100

120

140

20 40 60 80 100 120

Cost of munmap

20 40 60 80 100 1200

20

40

60

80

100

120

140Cost of TLB Shootdown

Latency

(µs)

Cores

LinuxLatr

Latency

(µs)

Cores

⇒ Up to 66.7% reduction for munmap()

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 20 / 24

Serving files with Apache

Linux, ABIS [ATC17], and Latr on 2 sockets:

0k

20k

40k

60k

80k

100k

120k

140k

160k

2 4 6 8 10 12

Apache Performance

2 4 6 8 10 120k

5k

10k

15k

20k

25k

30k

35kTLB Shootdowns per second

Req

uests

per

seco

nd

Cores

LinuxABISLatr

TLB

Shootdow

nsper

seco

nd

Cores

⇒ Up to 59.9% more requestssecond than Linux, 37.9% higher than ABIS.

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 21 / 24

Cost of Latr

Memory overhead is bounded by 21MB

Performance overheads for applications with few TLB shootdowns:

0.97

0.98

0.99

1.00

1.01

1.02

1.03

nginx 1

Apach

e 1

bodytrack

16

canneal 16

facesim

16

ferret

16

streamcluster

16

0

20

40

60

80

100

Overhead

TLB

Shootdow

nsper

secNormalized application performanceShootdowns per second

⇒ Latr shows small performance overheads of up to 1.7% due to addedoperations during scheduling.

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 22 / 24

Future work

Further applications of Latr in:

Disaggregated data centers

Heterogeneous memory

Applicability to PCID/ASID-based approaches

Impact on new features such as KPTI, . . . ?

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 23 / 24

Latr: Takeaways

The synchronous TLB shootdown is expensive

We propose a software-based asynchronous shootdown mechanism

Significant improvement in application performance with Latr

70% reduction for munmap(), for 16-core and 120-core machinesImproves Apache’s throughput by 60%

Asynchronous mechanism applicable to other services:

AutoNUMA (see our paper)

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 24 / 24

Latr: Takeaways

The synchronous TLB shootdown is expensive

We propose a software-based asynchronous shootdown mechanism

Significant improvement in application performance with Latr

70% reduction for munmap(), for 16-core and 120-core machinesImproves Apache’s throughput by 60%

Asynchronous mechanism applicable to other services:

AutoNUMA (see our paper)

Thanks!

Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 24 / 24