Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Latr: Lazy Translation Coherence
Mohan Kumar*, Steffen Maass*, Sanidhya Kashyap, Jan Vesely‡,Zi Yan‡, Taesoo Kim, Abhishek Bhattacharjee‡, Tushar Krishna
Georgia Institute of Technology ‡Rutgers University
* Co-First Authors
March 28, 2018
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 1 / 24
Motivation
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24
Large NUMA machines
Terabytes of memory
Motivation
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24
Large NUMA machines
Terabytes of memory
Microsecond latency
Motivation
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 2 / 24
Large NUMA machines
Terabytes of memory
Microsecond latency
⇒ Problem of Microsecond Latency in System Services⇒ TLB Coherence is Contributor in Important Subset
Impact of TLB coherence on applications
Multi-core MapReduce application
Prior research: 10x increase in shootdown time with increasing corecounts
Web servers (e.g., Apache)
Prior research and our findings: ≈35% of time spent in TLBshootdown
Die-stacked Memory
Swapping between on-chip and off-chip memory
Disaggregated Memory
Swapping between local and remote memory
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 3 / 24
Impact of TLB coherence on applications
Multi-core MapReduce application
Prior research: 10x increase in shootdown time with increasing corecounts
Web servers (e.g., Apache)
Prior research and our findings: ≈35% of time spent in TLBshootdown
Die-stacked Memory
Swapping between on-chip and off-chip memory
Disaggregated Memory
Swapping between local and remote memory
⇒ Can we mitigate this costly TLB shootdown?
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 3 / 24
Table of contents
1 TLB Shootdown Background
2 Latr: Asynchronous TLB Shootdowns
3 Evaluation
4 Conclusion
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 4 / 24
Table of contents
1 TLB Shootdown Background
2 Latr: Asynchronous TLB Shootdowns
3 Evaluation
4 Conclusion
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 5 / 24
Translation lookaside buffer: Introduction
Cache for virtual → physical mapping, per-core structures
Accessed on every load/store
Unlike data caches (L3, etc.), coherence managed by OS
TLB coherence significantly impacts application performance
Virtual Address
PGD
PUD
PMD
PTETLB
Hit:PhysicalAddress
Miss:Page TableWalk Physical
Address
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 6 / 24
TLB coherence: Background
Hardware-based ApproachesProviding cache coherence to TLBsISA-level instruction support (ARM)Microcode-based approaches
Software-based ApproachesCurrent commodity OS design: Use Inter-Processor Interrupts (IPI)Optimization: Reduce number of shootdowns, better trackingMultikernel design: Use Message-Passing
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 7 / 24
TLB coherence: Background
Hardware-based ApproachesProviding cache coherence to TLBsISA-level instruction support (ARM)Microcode-based approaches
Software-based ApproachesCurrent commodity OS design: Use Inter-Processor Interrupts (IPI)Optimization: Reduce number of shootdowns, better trackingMultikernel design: Use Message-Passing
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 7 / 24
⇒ More Hardware Complexity
⇒ TLB shootdowns still significant
TLB shootdown internals in Linux
munmap() on core 1, application running on cores 1, 2, and 5:
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ...
❶
OS
TLBTLB TLB TLB TLBTLB TLB TLB
❶Timeline:
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24
TLB shootdown internals in Linux
munmap() on core 1, application running on cores 1, 2, and 5:
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ...
❶
OS
TLBTLB TLB TLB TLBTLB TLB TLB
❶Timeline:
❶ munmap()
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24
TLB shootdown internals in Linux
Context switch on core 1, local TLB shootdown:
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ... OS
TLBTLB TLB TLB TLBTLB TLB TLB
❷
❶ ❷Timeline:
❶ munmap()❷ Local Shootdown
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24
TLB shootdown internals in Linux
Notify cores 2 and 5 via IPI, application blocked on core 1:
TLBTLB TLB TLB TLBTLB TLB TLB
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ... OS
❸Spin-wait
❸❶ ❷2.2µs
Timeline:
❶ munmap()❷ Local Shootdown❸ Send IPIs
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24
TLB shootdown internals in Linux
Execute context switch and TLB shootdown on cores 2 and 5:
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ... OS
TLBTLB TLB TLB TLBTLB TLB TLB
❹ ❹Spin-wait
❹❸❶ ❷2.2µs
Timeline:
❶ munmap()❷ Local Shootdown❸ Send IPIs❹ Remote Shootdown
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24
TLB shootdown internals in Linux
Cores 2 and 5 respond ACK via shared memory:
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ... OS
TLBTLB TLB TLB TLBTLB TLB TLB
❺ ❺Spin-wait
❺❹❸❶ ❷2.2µs
Timeline:
❶ munmap()❷ Local Shootdown❸ Send IPIs❹ Remote Shootdown❺ IPI ACK
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24
TLB shootdown internals in Linux
Control is returned on all cores, TLB shootdown completed:
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ... OS
❻
TLBTLB TLB TLB TLBTLB TLB TLB
❻❺❹❸❶ ❷2.2µs
Timeline:
5.9µs}Savings potential for asynchronousapproach with LATR
❶ munmap()❷ Local Shootdown❸ Send IPIs❹ Remote Shootdown❺ IPI ACK
munmap() complete❻Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 8 / 24
Observation
Synchronous TLB shootdown is expensive:Up to 6µs delay with two sockets
Processing IPIs is expensive:Interrupt handler on remote coreLong wait time on initiating core
IPI send-and-wait delay:Unicast delivery of the IPIs (one at a time)
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 9 / 24
TLB shootdown: A necessary evil
Cost of a simple memory unmap operation (munmap()):
1 page on 16 cores with 2 sockets: up to 8µs≈ 70% from TLB shootdown alone
More expensive with more sockets:
0
1
2
3
4
5
6
7
8
2 4 6 8 10 12 14 16
1 Socket
Latency
(µs)
Cores
munmap()
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24
TLB shootdown: A necessary evil
Cost of a simple memory unmap operation (munmap()):
1 page on 16 cores with 2 sockets: up to 8µs≈ 70% from TLB shootdown alone
More expensive with more sockets:
0
1
2
3
4
5
6
7
8
2 4 6 8 10 12 14 16
1 Socket 2 Sockets
Latency
(µs)
Cores
munmap()
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24
TLB shootdown: A necessary evil
Cost of a simple memory unmap operation (munmap()):
1 page on 16 cores with 2 sockets: up to 8µs≈ 70% from TLB shootdown alone
More expensive with more sockets:
0
1
2
3
4
5
6
7
8
2 4 6 8 10 12 14 16
Latency
(µs)
Cores
munmap()TLB Shootdown
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 10 / 24
Table of contents
1 TLB Shootdown Background
2 Latr: Asynchronous TLB Shootdowns
3 Evaluation
4 Conclusion
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 11 / 24
In this talk: Latr
Latr: Lazy Translation Coherence
Perform asynchronous TLB shootdownRemove remote shootdown from the critical pathTake advantage of change in ABI without affecting applications’correctness
Use shared memory instead of IPIEliminate send-and-wait delay of IPIs
Scope:free operations (in this talk)migration operations (see our paper)
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 12 / 24
In this talk: Latr
Latr: Lazy Translation Coherence
Perform asynchronous TLB shootdownRemove remote shootdown from the critical pathTake advantage of change in ABI without affecting applications’correctness
Use shared memory instead of IPIEliminate send-and-wait delay of IPIs
Scope:free operations (in this talk)migration operations (see our paper)
⇒ But: How to perform asynchronous shootdown?
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 12 / 24
Latr States
Store virtual addresses to be flushed
Remote cores shootdown local TLB during
OS context switchOS scheduler tick (upper bound: 1ms in Linux)
Core5 Core6 Core7 Core8
TLB TLB TLB TLB
LATRStates
LATRStates
LATRStates
LATRStates
...S1: start; end; mm; flags; Core list; active S2 S64
LATR States Core1
Cache Coherency
QPI
Core1 Core2 Core3 Core4
TLB TLB TLB TLB
LATRStates
LATRStates
LATRStates
LATRStates
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 13 / 24
Latr: Example
munmap() initiated on core 1:
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ... OS
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
❶
❶Timeline:
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24
Latr: Example
munmap() initiated on core 1:
App1 Idle IdleApp2 App5 Idle IdleIdle
Application
Operating System
OS OS ... OS
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
❶
❶Timeline:
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
❶ munmap()
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24
Latr: Example
Set up Latr state (for cores 2 and 5), local shootdown:
App1 Idle IdleApp2 App5 Idle IdleIdle
OS OS ... OS
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
start end0x01 0x0F
mm Core list active0x1234 {2, 5} True
Core1, LATR State1:
flags0x1
Timeline:
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
❶ munmap()❷ Local Shootdown❸ Create LATR State
❸
❶❶ ❷ ❸
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24
Latr: Example
Return control on core 1. Time taken: 2.3µs, 70% reduction:
App1 Idle IdleApp2 App5 Idle IdleIdle
OS OS ... OS
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
❶Timeline:
start end0x01 0x0F
mm Core list active0x1234 {2, 5} True
Core1, LATR State1:
flags0x1
2.3µs
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete
❶ ❷ ❸ ❹
❹
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24
Latr: Example
Scheduler tick on core 2, local shootdown, reset state:
App1 Idle IdleApp2 App5 Idle IdleIdle
OS OS ... OS
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
start end0x01 0x0F
mm Core list active0x1234 {5} True
Core1, LATR State1:
flags0x1
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete❺ Shootdown Core2
❺
❺
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24
Latr: Example
Scheduler tick on core 5, local shootdown, reset state:
App1 Idle IdleApp2 App5 Idle IdleIdle
OS OS ... OS
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
start end0x01 0x0F
mm Core list active0x1234 {} False
Core1, LATR State1:
flags0x1
❻Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete❺ Shootdown Core2
❻ Shootdown Core5
❻
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24
Latr: Example
Shootdown complete, Latr entry can be reused:
App1 Idle IdleApp2 App5 Idle IdleIdle
OS OS ... OS
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
LATRStates
start end0x01 0x0F
mm Core list active0x1234 {} False
Core1, LATR State1:
flags0x1
Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core8
❶ munmap()❷ Local Shootdown❸ Create LATR State❹ munmap() complete❺ Shootdown Core2
❻ Shootdown Core5❼ Shootdown complete
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 14 / 24
Lazy TLB shootdown: Correctness
Same physical memory or virtual memory is reused
Leads to memory corruption
⇒ Avoid same physical/virtual page reuse
Upper bound for TLB shootdown with Latr is 1msOS physical/virtual memory reclamation delayed by two scheduler ticks(2ms)Memory overhead is bounded by 21MB
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 15 / 24
Lazy TLB shootdown: Incorrect accesses
Memory accesses before Latr shootdown:
Consequence of incorrect application: Use After FreeBefore Latr shootdown, access (reads and writes) allowedExists in the current OS implementationAfter Latr shootdown, access results in segmentation fault
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 16 / 24
Scope of Latr
ABI change for free operations
Support for operations limited to few, frequently used operations:
Classification OperationsLazy operation
possible
Freemunmap(): unmap address range ✓madvise(): free memory range ✓
MigrationAutoNUMA page migration (⇒ See paper) ✓Page swap: swap page to disk ✓
Permission mprotect(): change page permission -
Ownership CoW: Copy on Write -
Remap mremap(): change physical address -
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 17 / 24
Table of contents
1 TLB Shootdown Background
2 Latr: Asynchronous TLB Shootdowns
3 Evaluation
4 Conclusion
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 18 / 24
Evaluation: Questions
Latr prototype developed for Linux 4.10
Evaluation questions
What are Latr’s benefits with microbenchmarks?What are Latr’s benefits with real-world applications exhibiting manyTLB shootdowns?What is the cost for Latr?
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 19 / 24
Microbenchmark on eight sockets
Linux and Latr calling munmap() with one page on 120 cores:
0
20
40
60
80
100
120
140
20 40 60 80 100 120
Cost of munmap
20 40 60 80 100 1200
20
40
60
80
100
120
140Cost of TLB Shootdown
Latency
(µs)
Cores
LinuxLatr
Latency
(µs)
Cores
⇒ Up to 66.7% reduction for munmap()
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 20 / 24
Serving files with Apache
Linux, ABIS [ATC17], and Latr on 2 sockets:
0k
20k
40k
60k
80k
100k
120k
140k
160k
2 4 6 8 10 12
Apache Performance
2 4 6 8 10 120k
5k
10k
15k
20k
25k
30k
35kTLB Shootdowns per second
Req
uests
per
seco
nd
Cores
LinuxABISLatr
TLB
Shootdow
nsper
seco
nd
Cores
⇒ Up to 59.9% more requestssecond than Linux, 37.9% higher than ABIS.
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 21 / 24
Cost of Latr
Memory overhead is bounded by 21MB
Performance overheads for applications with few TLB shootdowns:
0.97
0.98
0.99
1.00
1.01
1.02
1.03
nginx 1
Apach
e 1
bodytrack
16
canneal 16
facesim
16
ferret
16
streamcluster
16
0
20
40
60
80
100
Overhead
TLB
Shootdow
nsper
secNormalized application performanceShootdowns per second
⇒ Latr shows small performance overheads of up to 1.7% due to addedoperations during scheduling.
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 22 / 24
Future work
Further applications of Latr in:
Disaggregated data centers
Heterogeneous memory
Applicability to PCID/ASID-based approaches
Impact on new features such as KPTI, . . . ?
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 23 / 24
Latr: Takeaways
The synchronous TLB shootdown is expensive
We propose a software-based asynchronous shootdown mechanism
Significant improvement in application performance with Latr
70% reduction for munmap(), for 16-core and 120-core machinesImproves Apache’s throughput by 60%
Asynchronous mechanism applicable to other services:
AutoNUMA (see our paper)
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 24 / 24
Latr: Takeaways
The synchronous TLB shootdown is expensive
We propose a software-based asynchronous shootdown mechanism
Significant improvement in application performance with Latr
70% reduction for munmap(), for 16-core and 120-core machinesImproves Apache’s throughput by 60%
Asynchronous mechanism applicable to other services:
AutoNUMA (see our paper)
Thanks!
Mohan Kumar Latr: Lazy Translation Coherence March 28, 2018 24 / 24