ReiningintheOutliersinMap-ReduceClustersusingMantri · 2017. 9. 12. · velop Mantri, that takes early actions based on under-standing the causes and the opportunity cost of actions

Reining in the Outliers inMap-Reduce Clusters usingMantri

Ganesh Ananthanarayanan†⋄ Srikanth Kandula† Albert Greenberg†

Ion Stoica⋄ Yi Lu† Bikas Saha‡ Edward Harris‡†Microso Research ⋄ UC Berkeley ‡ Microso Bing

Abstract– Experience from an operational Map-Reducecluster reveals that outliers signi�cantly prolong job com-pletion. �e causes for outliers include run-time con-tention for processor, memory and other resources, diskfailures, varying bandwidth and congestion along net-work paths and, imbalance in task workload. We presentMantri, a system that monitors tasks and culls outliers us-ing cause- and resource-aware techniques. Mantri’s strate-gies include restarting outliers, network-aware placementof tasks and protecting outputs of valuable tasks. Usingreal-time progress reports,Mantri detects and acts on out-liers early in their lifetime. Early action frees up resourcesthat can be used by subsequent tasks and expedites the joboverall. Acting based on the causes and the resource andopportunity cost of actions lets Mantri improve over priorwork that only duplicates the laggards. Deployment inBing’s production clusters and trace-driven simulationsshow that Mantri improves job completion times by .

Introduction

In a very short time, Map-Reduce has become the domi-nant paradigm for large data processing on compute clus-ters. So ware frameworks based on Map-Reduce [, ,] have been deployed on tens of thousands of machinesto implement a variety of applications, such as buildingsearch indices, optimizing advertisements, and miningsocial networks.

While highly successful, Map-Reduce clusters comewith their own set of challenges. One such challenge isthe o en unpredictable performance of the Map-Reducejobs. A job consists of a set of tasks which are organized inphases. Tasks in a phase depend on the results computedby the tasks in the previous phase and can run in paral-lel. When a task takes longer to �nish than other similartasks, tasks in the subsequent phase are delayed. At keypoints in the job, a few such outlier tasks can prevent therest of the job from making progress. As the size of thecluster and the size of the jobs grow, the impact of outliersincreases dramatically. Addressing the outlier problem iscritical to speed up job completion and improve clustere�ciency.

Even a few percent of improvement in the e�ciencyof a cluster consisting of tens of thousands of nodes can

save millions of dollars a year. In addition, �nishing pro-duction jobs quickly is a competitive advantage. Doingso predictably allows SLAs to be met. In iterative mod-ify/ debug/ analyze development cycles, the ability to it-erate faster improves programmer productivity.

In this paper, we characterize the impact and causesof outliers by measuring a large Map-Reduce productioncluster. �is cluster is up to two orders of magnitudelarger than those in previous publications [, , ] andexhibits a high level of concurrency due to many jobs si-multaneously running on the cluster and many tasks ona machine. We �nd that variation in completion timesamong functionally similar tasks is large and that outliersin�ate the completion time of jobs by 34% at median.

We identify three categories of root causes for outliersthat are induced by the interplay between storage, net-work and structure of Map-Reduce jobs. First, machinecharacteristics play a key role in the performance of tasks.�ese include static aspects such as hardware reliabil-ity (e.g., disk failures) and dynamic aspects such as con-tention for processor, memory and other resources. Sec-ond, network characteristics impact the data transfer ratesof tasks. Datacenter networks are over-subscribed leadingto variance in congestion among di�erent paths. Finally,the speci�cs of Map-Reduce leads to imbalance in work –partitioning data over a low entropy key space o en leadsto a skew in the input sizes of tasks.

We present Mantri1, a system that monitors tasks andculls outliers based on their causes. It uses the follow-ing techniques: (i) Restarting outlier tasks cognizant ofresource constraints and work imbalances, (ii) Network-aware placement of tasks, and (iii) Protecting output oftasks based on a cost-bene�t analysis.

�e detailed analysis and decision process employed byMantri is a key departure from the state-of-the-art for out-lier mitigation in Map-Reduce implementations [, ,]; these focus only on duplicating tasks. To our knowl-edge, none of them protect against data loss induced re-computations or network congestion induced outliers.Mantri places tasks based on the locations of their datasources as well as the current utilization of network links.On a task’s completion, Mantri replicates its output if the

1From Sanskrit, a minister who keeps the king’s court in order

bene�t of not having to recompute outweighs the cost ofreplication.

Further, Mantri performs intelligent restarting of out-liers. A task that runs for long because it has more workto do will not be restarted; if it lags due to reading dataover a low-bandwidth path, it will be restarted only if amore advantageous network location becomes available.Unlike current approaches that duplicate tasks only at theend of a phase, Mantri uses real-time progress reports toact early. While early action on outliers frees up resourcesthat could be used for pending tasks, doing so is nontriv-ial. A duplicate may �nish faster than the original taskbut has the opportunity cost of consuming resources thatother pending work could have used.

In summary we make the following contributions.First, we provide an analysis of the causes of outliers ina large production Map-Reduce cluster. Second, we de-velop Mantri, that takes early actions based on under-standing the causes and the opportunity cost of actions.Finally, we perform an extensive evaluation of Mantri andcompare it to existing solutions.

Mantri runs live in all of Bing’s production clusters sinceMay . Results from a deployment of Mantri on a pro-duction cluster of thousands of servers and from replay-ing several thousand jobs collected on this cluster in asimulator show that:• Mantri reduces the completion time of jobs by 32% onaverage on the production clusters. Extensive simula-tions show that job phases are quicker by 21% and 42%at the th and th percentiles. Mantri’s median re-duction in completion time improves on the next bestscheme by .x while using fewer resources.

• By placing reduce tasks to avoid network hotspots,Mantri improves the completion times of the reducephases by 60%.

• By preferentially replicating the output of tasks that aremore likely to be lost or expensive to recompute, Mantri

speeds up half of the jobs by at least 20% each whileonly increasing the network tra�c by 1%.

Background

Wemonitored the cluster and so ware systems that sup-port the Bing search engine for over twelve months. �isis a cluster of tens of thousands of commodity serversmanaged by Cosmos [], a proprietary upgraded form ofDryad []. Despite a few di�erences, implementationsof Map-Reduce [, , , ] are broadly similar.

Most of the jobs in the examined cluster are writtenin Scope [], a mash-up language that mixes SQL-likedeclarative statements with user code. �e Scope com-piler transforms a job into a work�ow– a directed acyclicgraph where each node is a phase and each edge joins aphase that produces data to another that uses it. A phase

is a set of one or more tasks that run in parallel and per-form the same computation on di�erent parts of the in-put stream. Typical phases are map, reduce and join. �enumber of tasks in a phase is chosen at compile time. Atask will read its input over the network if it is not avail-able on the local disk but outputs are written to the localdisk. �e eventual outputs of a job (as well as rawdata) arestored in a reliable block storage system implemented onthe same servers that do computation. Blocks are repli-cated n-ways for reliability. A run-time scheduler assignstasks to machines, based on data locations, dependencepatterns and cluster-wide resource availability. �e net-work layout provides more bandwidth within a rack thanacross racks.We obtain detailed logs from the Scope compiler and

the Cosmos scheduler. At each of the job, phase and tasklevels, we record the execution behavior as representedby begin and end times, the machines(s) involved, thesizes of input and output data, the fraction of data thatwas read across racks and a code denoting the success ortype of failure. We also record the work�ow of jobs. Ta-ble depicts the random subset of logs that we analyzehere. Spanning eighteen days, this dataset is at least oneorder ofmagnitude larger than prior published data alongmany dimensions, e.g., number of jobs, cluster size.

�e Outlier Problem

We begin with a �rst principles approach to the outlierproblem, then analyze data from the production clusterto quantify the problem and obtain a breakdown of thecauses of outliers (§). Beginning at the �rst principlesmotivates a distinct approach (§) which as we show in §signi�cantly improves on prior art.

. Outliers in a Phase

Assume a phase consists ofn tasks and has s slots. Slot is avirtual token, akin to a quota, for sharing cluster resourcesamong multiple jobs. One task can run per slot at a time.On our cluster, the median ratio of n

sis 2.11 with a stdev

of 12.37. �e goal is to minimize the phase completiontime, i.e., the time when the last task �nishes.Based on data from the production cluster, we model

ti, the completion time of task i, as a function of the sizeof the data it processes, the code it runs, the resourcesavailable on the machine it executes and the bandwidthavailable on the network paths involved:

ti = f (datasize, code,machine, network) . ()

Large variation exists along each of the four variablesleading to considerable di�erence in task completiontimes. �e amount of data processed by tasks in the samephase varies, sometimes widely, due to limitations in di-viding work evenly. �e code is the same for tasks in a

Dates Phases Jobs Compute Data Network

-’ x 103 (years) (PB) (PB)

May , . . . .Jun , . . . .Jul , . . . .Aug , . . . .Sep , . . . .Oct , . . . .Nov , . . . .Dec , . . . .Jan , . . . .

Table : Details of the logs from a production cluster consistingof thousands of servers.

phase, but di�ers signi�cantly across phases (e.g., mapand reduce). Placing a task on a machine that has otherresource hungry tasks in�ates completion time, as doesreading data across congested links.

In the ideal scenario, where every task takes the sameamount of time, say T , scheduling is simple. Anywork-conserving schedule would complete the phase in(

⌈ns⌉ × T

)

.When the task completion time varies, how-ever, a naive work-conserving scheduler can take up to(

P

nti

s+ max ti

)

. A large variation in ti increases the

term max ti and manifests as outliers.

�e goal of a scheduler is to minimize the phase com-

pletion time and make it closer toP

nti

s. Sometimes, it

can do even better. By placing tasks at less congested ma-chines or network locations, the ti’s themselves can belowered. �e challenge lies in recognizing the aspects thatcan be changed and scheduling accordingly.

. Extending from a phase to a job

�ephase structure of Map-Reduce jobs adds to the vari-ability. An outlier in an early phase, by delaying whentasks that use its output may start, has cumulative e�ectson the job. At barriers in the work�ow, where none ofthe tasks in successive phase(s) can begin until all of thetasks in the preceding phase(s) �nish, even one outliercan bring the job to a standstill2. Barriers occur primar-ily due to reduce operations that are neither commuta-tive nor associative [], for instance, a reduce that com-putes the median of records that have the same key. Inour cluster, the median job work�ow has eight phases andeleven edges, are barriers (number of edges exceedsthe number of phases due to table joins).

Dependency across phases also leads to outliers whentask output is lost and needs to be recomputed. Data losshappens due to a combination of disk errors, so ware er-

2�ere is a variant in implementation where a slot is reserved fora task before all its inputs are ready. �is is either to amortize thelatency of network transfer by moving data over the network as soonas it is generated [, ], or compute partial results and present answersonline even before the job is complete []. Regardless, pre-allocationof slots hogs resources for longer periods if the input task(s) straggle.

Extract 22K Partition 13K Aggregate 51K

Barrier

File System

(a) Partial work�ow with the number of tasks in each phase

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5

# R

unnin

g T

asks

(Norm

aliz

ed b

y m

ax in p

hase)

Time (Normalized by Job Lifetime)

B

R R

ExtractPartition

Aggregate

(b) Time lapse of task execution (R=Recomputes, B=Barrier).

Figure : An example job from the production cluster

rors (e.g., bugs in garbage collectors) and timeouts due tomachines going unresponsive at times of high load. Infact, recomputes cause some of the longest waiting timesobserved on the production cluster. A recompute can cas-cade into earlier phases if the inputs for the recomputedtask are no longer available and need to be regenerated.

. Illustration of Outliers

Figure (a) shows the work�ow for a job whose structureis typical of those in the cluster. �e job reads a datasetof search usage and derives an index. It consists of twoMap-Reduce operations and a join, but for clarity we onlyshow the �rst Map-Reduce here. Phase names follow theDryad [] convention– extract reads raw blocks, parti-tion divides data on the key and aggregate reduces itemsthat share a key.

Figure (b) depicts a timeline of an execution of thiswork�ow. It plots the number of tasks of each phase thatare active, normalized by themaximum tasks active at anytime in that phase, over the lifetime of the job. Tasks inthe �rst two phases start in quick succession to each otherat ∼., whereas the third starts a er a barrier.

Some of the outliers are evident in the long lulls beforea phase endswhen only a few of its tasks are active. In par-ticular, note the regions before x∼. and x∼.. �e spikein phase here is due to the outliers in phase holdingon to the job’s slots. At the barrier, x∼., just a few outliershold back the job frommaking forward progress. �oughmost aggregate tasks �nish at x∼., the phase persists foranother .

�e worst cases of waiting immediately follow recom-putations of lost intermediate data marked by R. Recom-putations manifest as tiny blips near the x axes for phasesthat had �nished earlier, e.g., phase sees recomputes atx∼. though it �nished at x∼.. At x∼., note that aggre-gate almost stops due to a few recomputations.

Wenowquantify themagnitude of the outlier problem,before presenting our solution in detail.

Quantifying the Outlier Problem

We characterize the prevalence and causes of outliers andtheir impact on job completion times and cluster resourceusage. We will argue that three factors – dynamics, con-currency and scale, that are somewhat unique to largeMap-Reduce clusters for e�cient and economic opera-tion, lie at the core of the outlier problem. To our knowl-edge, we are the �rst to report detailed experiences froma large production Map-Reduce cluster.

. Prevalence of Outliers

Figure (a) plots the fraction of high runtime outliers andrecomputes in a phase. For exposition, we arbitrarily saythat a task has high runtime if its time to �nish is longerthan .x the median task duration in its phase. By re-computes, we mean instances where a task output is lostand dependent tasks wait until the output is regenerated.

We see in Figure (a) that of phases have morethan of their tasks as outliers. �e �gure also showsthat of the phases see no recomputes. �ough rare,recomputes have a widespread impact (§.). Two out ofa thousand phases have over of their tasks waiting fordata to be recomputed.

How much longer do outliers run for? Figure (b)shows that of the runtime outliers last less than .times the phase’s median task duration, with a uniformprobability of being delayed by between .x to .x. �etail is heavy and long– take more than x the me-dian duration. Ignoring these if they happen early in aphase, as current approaches do, appears wasteful.

Figure (b) shows that most recomputations behavenormally, of them are clustered about the mediantask, but take over x longer.

. Causes of Outliers

To tease apart the contributions of each cause, we �rst de-termine whether a task’s runtime can be explained by theamount of data it processes or reads across the network3.If yes, then the outlier is likely due to workload imbalanceor poor placement. Otherwise, the outlier is likely due toresource contention or problematic machines.

Figure (a) shows that in of the phases (top right),all the tasks with high runtimes (i.e., over .x the me-

3For each phase, we �t a linear regression model for task lifetimegiven the size of input and the volume of tra�c moved across lowbandwidth links. When the residual error for a task is less than ,i.e., its run time is within [., .]x of the time predicted by this model,we call it explainable.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5

Cu

mu

lative

Fra

ctio

n o

f P

ha

se

s

Fraction of Outliers

high runtimerecompute

(a) What fraction of tasks in aphase are outliers?

0

0.2

0.4

0.6

0.8

1

0 1 2 4 6 8 10

Cu

mu

lative

Ratio of Straggler Duration to the Duration of the Median Task

high runtimerecompute

(b) How much longer do out-liers take to �nish?

Figure : Prevalence of Outliers.

dian task) are well explained by the amount of data theyprocess or move on the network. Duplicating these taskswould not make them run faster and will waste resources.At the other extreme, in of the phases (bottom le ),none of the high runtime tasks are explained by the datathey process. Figure (b) shows tasks that take longerthan they should, as predicted by the model, but do nottake over .x the median task in their phase. Such taskspresent an opportunity for improvement. �ey may �n-ish faster if run elsewhere, yet current schemes donothingfor them. of the phases (on the top right) have over of such improvable tasks.

Data Skew: It is natural to ask why data size varies acrosstasks in a phase. Across phases, the coe�cient of vari-ation ( stdev

mean) in data size is . and . at the th and

th percentiles. From experience, dividing work evenlyis non-trivial for a few reasons. First, scheduling each ad-ditional task has overhead at the job manager. Networkbandwidth is another reason. �ere might be too muchdata on a machine for a task to process, but it may beworse to split the work into multiple tasks and move dataover the network. A third reason is poor coding practice.If the data is partitioned on a key space that has too littleentropy, i.e., a few keys correspond to a lot of data, thenthe partitions will di�er in size. Some reduce tasks are notamenable to splitting (neither commutative nor associa-tive []), and hence each partition has to be processed byone task. Some joins and sorts are similarly constrained.Duplicating tasks that run for long because they have a lotof work to do is counter-productive.

Crossrack Tra�c: Reduce phases contribute over of the cross rack tra�c in the cluster, while most of therest is due to joins. We focus on cross rack tra�c becausethe links upstream of the racks have less bandwidth thanthe cumulative capacity of servers in the rack.

We �nd that crossrack tra�c leads to outliers in twoways. First, in phases where moving data across racks isavoidable (through locality constraints), a task that endsup in a disadvantageous network location runs slowerthan others. Second, in phases where moving data acrossracks is unavoidable, not accounting for the competitionamong tasks within the phase (self-interference) leads tooutliers. In a reduce phase, for example, each task reads

0.2

0.4

0.6

0.8

1C

um

ula

tiv

e #

Ph

ase

s

0

0.2

0 20 40 60 80 100

Cu

mu

lati

ve

#P

ha

ses

% of tasks that have high

runtime but are explainable

(a)

0.2

0.4

0.6

0.8

1

Cu

mu

lati

ve

#P

ha

ses

0

0.2

0 20 40 60 80 100

Cu

mu

lati

ve

#P

ha

ses

% of tasks that are unexplainably

long, but not long per-se(b)

Figure : Contribution of data size to task runtime (see §.)

40

60

80

100

CD

F %

Ph

ase

Tim

e

0

20

0 20 40 60 80 100

CD

F %

Ph

ase

Ideal Redn. (%) in Completion Time

(62.8%)

Figure : For reduce phases, the reduction in comple-tion time over the current placement by placing tasks in anetwork-aware fashion.

40

60

80

100

1.2

1.3

1.4

CPU Ratio

Memory Ratio

#Recomputes

Re

com

p U

til

to A

vg

Cu

mu

lati

ve

0

20

40

1

1.1

0 10 20 30Fraction of Cluster (%)

#Recomputes

Re

com

p

Cu

mu

lati

ve

Figure : �e ratio of processor and memory usage whenrecomputations happen to the average at that machine (y).Also, the cumulative percentage of recomputations across ma-chines (y).

fromevery map task. Since themaps are spread across thecluster, regardless of where a reduce task is placed, it willread a lot of data from other racks. Current implementa-tions place reduce tasks on any machine with spare slots.A rack that has too many reduce tasks will be congestedon its downlink leading to outliers.

Figure compares the current placement with an idealone that minimizes the impact of network transfer. Whenpossible it avoids reading data across racks and if not,places tasks such that their competition for bandwidthdoes not result in hotspots. In over of the jobs, reducephases account for of the job’s lifetime. For the re-duce phases, the �gure shows that the median phase takes longer under the current placement.

Bad and Busy Machines: We rarely �nd machines thatpersistently in�ate runtimes. Recomputations, however,aremore localized. Half of them happen on of the ma-chines in the cluster. Figure plots the cumulative shareof recomputes across machines on the axes on the right.�e �gure also plots the ratio of processor and memoryutilization during recomputes to the overall average onthat machine. �e occurrence of recomputes is correlated

0

50

100

150

200

250

300

0 200 400 600 800 1000

Tim

e~

(min

ute

s)

Machine Id

model outliers recomputes

Figure : Clustering recomputations and outliers.

with increased use of resources by at least . �e sub-set of machines that triggers most of the recomputes issteady over days but varies over weeks, likely indicativeof changing hotspots in data popularity or corruption indisks [].

Figure investigates the occurrence of “spikes” in out-liers. For legibility, we only plot a subset of the machines.We �nd that runtime outliers (shown as stars) cluster bytime. If outliers were happening at random, there shouldnot be any horizontal bands. Rather it appears that jobscontend for resources at some times. Even at these busytimes, other lightly loaded machines exist. Recomputa-tions (shown as circles) cluster by machine. When a ma-chine loses the output of a task, it has a higher chance oflosing the output of other tasks.

Rarely does an entire rack of servers experience thesame anomaly. When an anomaly happens, the frac-tion of other machines within the rack that see the sameanomaly is less than 1

20for recomputes, and 4

20for run-

time with high probability. So, it is possible to restart atask, or replicate output to protect against loss on anothermachine within the same rack as the original machine.

. Impact of Outliers

We now examine the impact of outliers on job comple-tion times and cluster usage. Figure plots the CDF forthe ratio of job completion times, with di�erent types ofoutliers included, to an ideal execution that neither hasskewed run times nor loses intermediate data. �e y-axesweighs each job by the total cluster time its tasks take torun. �e hypothetical scenarios, with some combinationof outliers present but not the others, do not exist in prac-tice. So we replayed the logs in a trace driven simulatorthat retains the structure of the job, the observed task du-rations and the probabilities of the various anomalies (de-tails in §). �e �gure shows that at median, the job com-pletion time would be lower by if runtime outliersdid not happen, and by more than when none of theoutliers happen. Recomputations impact fewer jobs thanruntime outliers, but when they do, they delay comple-tion time by a larger amount.

40

60

80

100

No Stragglers

No Recomputes

CD

F %

Jo

b T

ime

0

20

0 20 40 60 80 100

No Recomputes

Neither

Ideal Redn. (%) in Completion Time

CD

F %

Jo

b

(34.7%)

Figure : Percentage speed-up of job completion time in theideal case when (some combination of) outliers do not occur.

Contention

for resources

Paths have

diff. capacity

• duplicate

• kill, restart

network aware

placement

• replicate output

• pre-compute

start tasks that

do more first

Unequal work

division

executeTask Operations read input

Outlier

Causes

Solutions

Input becomes

unavailable

Figure : �eOutlier Problem: Causes and Solutions

By inducing high variability in repeat runs of the samejob, outliers make it hard to meet SLAs. At median, theratio of stdev

meanin job completion time is ., i.e., jobs have a

non-trivial probability of taking twice as long or �nishinghalf as quickly.

To summarize, we take the following lessons from ourexperience.• High running times of tasks do not necessarily indicateslow execution - there are multiple reasons for legiti-mate variation in durations of tasks.

• Every job is guaranteed some slots, as determined bycluster policy, but can use idle slots of other jobs.Hence, judicious usage of resources while mitigatingoutliers has collateral bene�t.

• Recomputations a�ect jobs disproportionately. �eymanifest in select faulty machines and during times ofheavy resource usage. Nonetheless, there are no indi-cations of faulty racks.

Mantri Design

Mantri identi�es points at which tasks are unable to makeprogress at the normal rate and implements targeted solu-tions. �e guiding principles that distinguish Mantri fromprior outlier mitigation schemes are cause awareness andresource cognizance.

Distinct actions are required for di�erent causes. Fig-ure speci�es the actions Mantri takes for each cause. If atask straggles due to contention for resources on the ma-chine, restarting or duplicating it elsewhere can speed itup (§.). However, not moving data over the low band-width cross rack links, and if unavoidable, doing so whileavoiding hotspots requires systematic placement (§.).To speed up tasks that wait for lost input to be recom-puted, we �nd ways to protect task output (§.). Finally,for tasks with a work imbalance, we schedule the large

time

1 2

t t

t t

2t t

2t

time

1 2

t t t

t 2t t 2t

time

slots

1 2

t 5t

t t

2t t 2t baseline,

kill, restart

duplicate

w/o early

Figure : A stylized example to illustrate our main ideas. Tasksthat are eventually killed are �lled with stripes, repeat instancesof a task are �lled with a lighter mesh.

tasks before the others to avoid being stuck with the largeones near completion (§.).

�ere is a subtle point with outlier mitigation: reduc-ing the completion time of a task may in fact increase thejob completion time. For example, replicating the outputof every task will drastically reduce recomputations–bothcopies are unlikely to be lost at the same time, but canslow down the job because more time and bandwidth areused up for this task denying resources to other tasks thatare waiting to run. Similarly, addressing outliers early ina phase vacates slots for outstanding tasks and can speedup completion. But, potentially uses more resources pertask. Unlike Mantri, none of the existing approaches actearly or replicate output. Further, naively extending cur-rent schemes to act early without being cognizant of thecost of resources, as we show in §, leads to worse perfor-mance.

Closed-loop action allows Mantri to act optimisticallyby bounding the cost when probabilistic predictions goawry. For example, even when Mantri cannot ascertainthe cause of an outlier, it experimentally starts copies. Ifthe cause does not repeatedly impact the task, the copycan �nish faster. To handle the contrary case, Mantri con-tinuously monitors running copies and kills those whosecost exceeds the bene�t.

Based on task progress reports, Mantri estimates foreach task the remaining time to �nish, trem, and the pre-dicted completion time of a new copy of the task, tnew .Tasks report progress once every s or ten times in theirlifetime, whichever is smaller. We use ∆ to refer to thisperiod. We defer details of the estimation to §. and pro-ceed to describe the algorithms for mitigating each of themain causes of outliers. All that matters is that trem bean accurate estimate and that the predicted distributiontnew account for the underlying work that the task has todo, the appropriateness of the network location and anypersistent slowness of the new machine.

. Resource-aware Restart

We begin with a simple example to help exposition. Fig-ure shows a phase that has seven tasks and two slots.

: let ∆ = period of progress reports: let c = number of copies of a task: periodically, for each running task, kill all but the fastestα copies

a er ∆ time has passed since begin: while slots are available do: if tasks are waiting for slots then: kill, restart task if trem > E(tnew)+∆, stop at γ restarts: duplicate if P(trem > tnew

c+1

c) > δ

: start the waiting task that has the largest data to read: else ⊲ all tasks have begun: duplicate i� E(tnew − trem) > ρ∆: end if

: end while

Pseudocode : Algorithm for Resource-aware restarts (simpli-

�ed).

Normal tasks run for times t and 2t. One outlier has aruntime of 5t. Time increases along the x axes.�e timeline at the top shows a baseline which ignores

outliers and �nishes at 7t. Prior approaches that only ad-dress outliers at the end of the phase also �nish at 7t.Note that if this outlier has a large amount of data to

process letting the straggling task be is better than killingor duplicating it, both of which waste resources.If however, the outlier was slowed down by its loca-

tion, the second and third timelines compare duplicationto a restart that kills the original copy. A er a short timeto identify the outlier, the scheduler can duplicate it atthe next available slot (the middle time-line) or restart itin-place (the bottom timeline). If prediction is accurate,restarting is strictly better. However, if slots are going idle,it may be worthwhile to duplicate rather than incur therisk of losing work by killing.Duplicating the outlier costs a total of 3t in re-

sources (2t before the original task is killed and t for theduplicate) which may be wasteful if the outlier were to�nish in sooner than 3t by itself.

Restart Algorithm: Mantri uses two variants of restart,the �rst kills a running task and restarts it elsewhere,the second schedules a duplicate copy. In either method,Mantri restarts only when the probability of success, i.e.,P(tnew < trem) is high. Since tnew accounts for the sys-tematic di�erences and the expected dynamic variation,Mantridoes not restart tasks that are normal (e.g., runtimeproportional to work). Pseudocode summarizes the al-gorithm. Mantri kills and restarts a task if its remainingtime is so large that there is a more than even chance thata restart would �nish sooner. In particular, Mantri does sowhen trem > E(tnew)+∆ 4. To not thrash on inaccurateestimates, Mantri kills a task no more than γ = 3 times.�e “kill and restart" scheme drastically improves the

job completion time without requiring extra slots as weshow analytically in []. However, the current job sched-uler incurs a queueing delay before restarting a task, that

4Since the median of the heavy tailed task completion timedistribution is smaller than the mean, this check implies thatP (tnew < trem) > P (tnew < E(tnew)) ≥ .5

(a) Ad-hoc placement (b) Even spread

Figure : �ree reduce tasks (rhombus boxes) are to be placedacross three racks. �e rectangles indicate their input. �e typeof the rectangle indicates the map that produced this data. Eachreduce task has to process one shard of each type. �e ad-hocplacement on the le creates network bottlenecks on the cross-rack links (highlighted). Tasks in such racks will straggle. If thenetwork has no other tra�c, the even placement on the rightavoids hotspots.

can be large and highly variant. Hence, we considerscheduling duplicates.

Scheduling a duplicate results in the minimum com-pletion time of the two copies and provides a safety netwhen estimates are noisy or the queueing delay is large.However, it requires an extra slot and if allowed to runto �nish, consumes extra computation resource that willincrease the job completion time if outstanding tasks areprevented from starting. Hence, when there are outstand-ing tasks and no spare slots, we schedule a duplicate onlyif the total amount of computation resource consumeddecreases. In particular, if c copies of the task are cur-rently running, a duplicate is scheduled only if P(trem >

tnewc+1

c) > δ. By default, δ = .25. For example, a

task with one running copy is duplicated only if tnew isless than half of trem. For stability, Mantri does not re-duplicate a task for which it launched a copy recently. Anycopy that has run for some time and is slower than thesecond fastest copy of the task will be killed to conserveresources. Hence, there are never more than three run-ning copies of a task5. When spare slots are available, ashappens towards the end of the job, Mantri schedules du-plicates more aggressively, i.e., whenever the reduction inthe job completion time is larger than the start up time,E(tnew − trem) > ρ∆. By default, ρ = 3. Note that in allthe above cases, if more than one task satis�es the neces-sary conditions, Mantri breaks ties in favor of the task thatwill bene�t the most.

Mantri’s restart algorithm is independent of the valuesfor its parameters. Setting γ to a larger and ρ, δ to asmaller value trades o� the risk of wasteful restarts forthe reward of a larger speed-up. �e default values thatare speci�ed here err on the side of caution.

By scheduling duplicates conservatively and pruningaggressively, Mantri has a high success rate of its restarts.As a result, it reduces completion time and conserves re-sources ( §.).

5�e two fastest copies and the copy that has recently started.

. Network-Aware Placement

Reduce tasks, as noted before (§.), have to read dataacross racks. A rack with too many reduce tasks is con-gested on its downlink and such tasks will straggle. Fig-ure illustrates such a scenario.

Given the utilization of all the network links and thelocations of inputs for all the tasks (and jobs) that arewaiting to run, optimally placing the tasks to minimizejob completion time is a form of the centralized tra�cengineering problem [, ]. However achieving up-to-date information of network state and centralized co-ordination across all jobs in the cluster are challenging.Instead, Mantri approximates the optimal placement bya local algorithm that does not track bandwidth changesnor require co-ordination across jobs.

With Mantri, each job manager places tasks so asto minimize the load on the network and avoid self-interference among its tasks. If every job manager takesthis independent action, network hotspots will not causeoutliers. Note that the sizes of the map outputs in eachrack are known to the job manager prior to placing thetasks of the subsequent reduce phase. For a reduce phasewith n tasks running on a cluster with r racks, let its inputmatrix In,r specify the size of input in each rack for eachof the tasks6. For any placement of reduce tasks to racks,let the data to be moved out (on the uplink) and read in(on the downlink) on the ith rack bedi

u, div , and the corre-

sponding available bandwidths be biu and bi

d respectively.

For each rack, we compute two terms c2i−1 =di

u

biu

and

c2i =di

v

bi

d

. �e �rst term is the ratio of outgoing tra�c

and available uplink bandwidth, and the second term isthe ratio of incoming tra�c and available downlink band-width. �e algorithm computes the optimal value over allplacement permutations, i.e., the rack location for eachtask that minimizes the maximum data transfer time, asarg min maxj cj , j = 1, · · · , 2n,.

Rather than track the available bandwidths biu and bi

d

as they change with time and as a function of other jobsin the cluster, Mantri uses these estimates. Reduce phaseswith a small amount of data �nish quickly, and the band-widths can be assumed to be constant throughout the ex-ecution of the phase. For phases with a large amount ofdata, the bandwidth averaged over their long lifetime canbe assumed to be equal for all links. We see that with theseestimates Mantri’s placement comes close to the ideal inour experiments (see §.).

For phases other than reduce, Mantri complements theCosmos policy of placing a task close to its data []. Byaccounting for the cost of moving data over low band-width links in tnew , Mantri ensures that no copy is started

6In I , the row sum indicates the data to be read by the task, whereasthe column sum indicates the total input present in that rack.

M1

M2

M0

t2redo = r2(t2 +t1redo)

…

t0redo=r0t0

(a)

M1orig

M2

M1replica

r1=r1replica

*r1orig

(b)

M11 M1n

M2

….

t1redo= max(t11redo, …t

1nredo)

(c)

replication cost

recompute cost

tredo > trep tredo

trep

In budget? yes yes

replicate

(d) Decision Process

Figure : Avoiding costly recomputations: �e cost to redo atask includes the recursive probability of predecessor tasks hav-ing to be re-done (a). Replicating output reduces the e�ectiveprobability of loss (b). Tasks with many-to-one input patternshave high recomputation cost and are more valuable (c).

at a location where it has little chance of �nishing earlierthereby not wasting resources.

. Avoiding Recomputation

To mitigate costly recomputations that stall a job, Mantri

protects against interim data loss by replicating task out-put. It acts early by replicating those outputswhose cost torecompute exceeds the cost to replicate. Mantri estimatesthe cost to recompute as the product of the probabilitythat the output will be lost and the time to repeat the task.�e probability of loss is estimated for a machine over along period of time. �e time to repeat the task is tredo

with a recursive adjustment that accounts for the task’sinputs also being lost. Figure illustrates the calcula-tion of tredo based on the data loss probabilities (ri’s), thetime taken by the tasks (ti’s) and recursively looks at priorphases. Replicating the output reduces the likelihood ofrecomputation to the case when all replicas are unavail-able. If a task reads input frommany tasks (e.g., a reduce),tredo is higher since any of the inputs needing to be re-computed will stall the task’s recomputation 7. �e costto replicate, trep, is the time to move the data to anothermachine in the rack.

In e�ect, the algorithm replicates tasks at key places ina job’s work�ow – when the cumulative cost of not repli-cating many successive tasks builds up or when tasks ranon very �aky machines (high ri) or when the output is sosmall that replicating it would cost little (low trep).

Further, to avoid excessive replication, Mantri limits theamount of data replicated to 10% of the data processed bythe job. �is limit is implemented by granting tokens pro-portional to the amount of data processed by each task.Task output that satis�es the above cost-bene�t check is

7In Fig. (c), we assume that if multiple inputs are lost, they are re-computed in parallel and the task is stalled by the longest input. Sincerecomputes are rare (Fig. (a)), this is a fair approximation of practice.

replicated only if an equal number of tokens are available.Tokens are deducted on replication.

Mantri proactively recomputes tasks whose output andreplicas, if any, have been lost. From §, we see that re-computations on a machine cluster by time, hence Mantri

considers a recompute to be the onset of a temporal prob-lem which will cause future requests for data on this ma-chine to fail and pre-computes such output. Doing so de-creases the time that a dependent task will have to waitfor lost input to be regenerated. As before, Mantri im-poses a budget on the extra cluster cycles used for pre-computation. Together, probabilistic replication and pre-computation approximate the ideal scheme in our evalu-ation (§.).

. Data-aware Task Ordering

Workload imbalance causes tasks to straggle. Mantri doesnot restart outliers that take a long time to run becausethey have more work to do. Instead, Mantri improves jobcompletion time by scheduling tasks in a phase in de-scending order of their input size. Given n tasks, s slotsand input sizes d[1 · · ·n], if the optimal completion timeisTO, scheduling tasks in inverse order of their input sizeswill take T , where T

TO≤ 4

3− 1

3s[]. �is means that

scheduling tasks with the longest processing time �rst isat most 33%worse than the optimal schedule; computingthe optimal is NP-hard [].

. Estimation of trem and tnew

Periodically, every running task informs the job sched-uler of its status, including how many bytes it has read,dread, thus far. Mantri combines the progress reports withthe size of the input data that each task has to process,d, and predicts how much longer the task would take to�nish using this model:

trem = telapsed

d

dread

+ twrapup. ()

�e �rst term captures the remaining time to processdata. �e second term is the time to compute a er allthe input has been read and is estimated from the be-havior of earlier tasks in the phase. Tasks may speed upor slow down and hence, rather than extrapolating fromeach progress report, Mantri uses a moving average. Tobe robust against lost progress reports, when a task hasn’treported for a while, Mantri increases trem by assumingthat the task has not progressed since its last report. �islinear model for estimating the remaining time for a taskis well suited for data-intensive computations like Map-Reduce where a task spends most of its time reading theinput data. We seldom see variance in computation timeamong tasks that read equal amounts of data [].

Mantri estimates tnew , the distribution over time that anew copy of the task will take to run, as follows:

tnew = processRate ∗ locationFactor ∗ d + schedLag. ()

�e �rst term is a distribution of the process rate, i.e.,∆time∆data

, of all the tasks in this phase. �e second term isa relative factor that accounts for whether the candidatemachine for running this task is persistently slower (orfaster) than other machines or has smaller (or larger) ca-pacity on the network path to where the task’s inputs arelocated. �e third term, as before, is the amount of datathe task has to process. �e last term is the average delaybetween a task being scheduled and when it gets to run.We show in §. that these estimates of trem and tnew aresu�ciently accurate for Mantri’s functioning.

Evaluation

We deployed and evaluated Mantri on Bing’s productioncluster consisting of thousands of servers. Mantrihas beenrunning as the outlier mitigation module for all the jobsin Bing’s clusters since May . To compare against awider set of alternate techniques, we built a trace drivensimulator that replays logs from production.

. Setup

Clusters: �e production cluster consists of thousandsof server-class multi-core machines with tens of GBs ofRAM that are spread roughly servers to a rack. �iscluster is used by Bing product groups. �e data we an-alyzed earlier is from this cluster, so the observationsfrom § hold here.

Workload: Mantri is the default outlier mitigation solu-tion for the production cluster. �e jobs submitted tothis cluster are independent of us, enabling us to evalu-ateMantri’s performance in a live cluster across a variety ofproduction jobs. We compare Mantri’s performance on alljobs in themonth of June with prior runs of the samejobs in April-May that ran with the earlier build ofCosmos.

In addition, we also evaluate Mantri on four hand-picked applications that represent common buildingblocks. Word Count calculates the number of uniquewords in the input. Table Join inner joins two tables eachwith three columns of data on one of the columns. GroupBy counts the number of occurrences of each word in the�le. Finally, grep searches for string patterns in the input.We vary input sizes from GB to GB.

Prototype: Mantri builds on the Cosmos job schedulerand consists of about lines of C++ code. To com-pute trem, Mantri maintains an execution record for eachof the running tasks that is updated when the task reports

progress. A phase-wide data structure stores the neces-sary statistics to compute tnew . When slots become avail-able, Mantri runs Pseudocode and restarts or duplicatesthe task that would bene�t the most or starts new tasksin descending order of data size. To place tasks appropri-ately, name builds on the per-task a�nity list, a preferredset of machines and racks that the task can run on. Atrun-time the job manager attempts to place the task atits preferred locations in random order, and when noneof them are available runs the task at the �rst availableslot. �e a�nity list for map tasks has machines that havereplicas of the input blocks. For reduce tasks, to obtainthe desired proportional spread across racks (see §.),we populate the a�nity list with a proportional numberof machines in those racks.

Trace-driven Simulator: �e simulator replays the logsshown in Table . For each phase, it faithfully repeatsthe observed distributions of task completion times, dataread by each task, size and location of inputs, probabilityof failures and recomputations, and fairness based evic-tions. Restarted tasks have their execution times and fail-ure probabilities sampled from the same distribution oftasks in their phase. �e simulator also mimics the jobwork�ow including semantics like barriers before phases,the permissible concurrent slots per phase and the in-put/output relationships between phases. It mimics clus-ter characteristics like machine failures, network conges-tion and availability of computation slots. For the net-work, it uses a �uid model rather than simulating indi-vidual packets. Doing the latter, at petabyte scale, is outof scope for this work.

Compared Schemes: Our results on the production clus-ter uses the current Dryad implementation as the base-line (§.). It contains state-of-the-art outlier mitigationstrategies and runs thousands of jobs daily.

Our simulator performs a wider and detailed com-parison. It compares Mantri with the outlier mitigationstrategies in Hadoop [], Dryad [], Map-Reduce [],LATE [], and a modi�ed form of LATE that acts onstragglers early in the phase. As the current Dryad buildalready has modules for straggler mitigation, we com-pare all of these schemes to a baseline that does not miti-gate any stragglers (§.). On the other hand, since theseschemes do not do network-aware placement or recom-pute mitigation, we use the current Dryad implementa-tion itself as their baseline (§. and §.).

We also compare Mantri against some ideal bench-marks. NoSkewmimics the case when all tasks in a phasetake the same amount of time, set to the average overthe observed task durations. NoSkew + ChopTail goeseven further, it removes the worst quartile of the observeddurations, and sets every task to the average of remain-ing durations. IdealReduce assumes perfect up-to-date

40

60

80

100

CD

F %

Jo

b T

ime

0

20

40

0 20 40 60 80 100

CD

F %

Jo

b T

ime

% Reduction in Completion Time

(32.1%)

(a) Completion Time

40

60

80

100

CD

F %

Jo

b T

ime

0

20

40

-20 0 20 40 60 80

CD

F %

Jo

b T

ime

% Reduction in Job Resources

(b) Resource Usage

Figure : Evaluation of Mantri as the default build for all jobson the production cluster for twenty-�ve days.

knowledge of available bandwidths and places reducetasks accordingly. IdealRecompute uses future knowledgeof which tasks will have their inputs recomputed and pro-tects those inputs.

Metrics: As our primary metrics, we use the reduction incompletion time and resource usage8, where

Reduction =Current − Modi�ed

Current. ()

Summary: Our results are summarized as follows:• In live deployment in the production cluster Mantri

sped up the median job by . of the jobs ex-perienced a net reduction in resources used. FurtherMantri’s network-aware placement reduced the com-pletion times of typical reduce phases by .

• Simulations driven from production logs show thatMantri’s restart strategy reduces the completion time ofphases by (and ) at the th (and th) per-centile. Here, Mantri’s reduction in completion timeimproves on Hadoop by .x while using fewer re-sources than Map-Reduce, each of which are the cur-rent best on those respective metrics.

• Mantri’s network-aware placement of tasks speeds uphalf of the reduce phases by at least each.

• Mantri reduces the completion times due to recomputa-tions of jobs that constitute (or ) of the work-load by at least (or ) each while consumingnegligible extra resources.

. Deployment Results

Jobs in the Wild: We compare one month of jobs in theBing production cluster that ran a er Mantri was turnedlive with runs of the same job(s) on earlier builds. Weuse only those recurring jobs that have roughly similaramounts of input and output across runs. Figure (a)plots the CDF of the improvement in completion time.�e y axes weighs each job by the total time its tasks taketo run since improvement on larger jobs adds more value

8A reduction of implies that the property in question, com-pletion time or resources used, decreases by half. Negative values ofreduction imply that the modi�cation uses more resources or takeslonger.

21.7

31.026.2

21.420

30

40

50%

Re

du

ctio

n i

n

Co

mp

leti

on

Tim

e

0

10

20

% R

ed

uct

ion

in

Co

mp

leti

on

Tim

e

Word

Count

Table

Join

Group

By

Grep

(a) Completion Time

13.4

20

30

40

% R

ed

uct

ion

in

Job

Re

sou

rce

s

2.5

13.4

7.6 9.5

0

10

% R

ed

uct

ion

in

Job

Re

sou

rce

s

Word

Count

Table

Join

Group

By

Grep

(b) Resource Usage

Figure : Comparing Mantri’s straggler mitigation with thebaseline implementation on a production cluster of thousandsof servers for the four representative jobs.

reduction in completion time

avg min max

Phase . . .Job . . .

Table : Comparing Mantri’s network-aware spread of taskswith the baseline implementation on a production cluster ofthousands of servers.

to the cluster. Jobs that occupy the cluster for half the timesped up by at least .. Figure (b) shows that ofjobs see a reduction in resource consumption while theothers use up a few extra resources. �ese gains are dueto Mantri’s ability to detect outliers early and accurately.�e success rate ofMantri’s copies, i.e., the fraction of timethey �nish before the original copy, improves by .x overthe earlier build. At the same time, Mantri expends fewerresources, it starts .x fewer copies. Further, Mantri actsearly, over of its copies are started before the originaltask has completed of its work as opposed to with the earlier build.

Straggler Mitigation: To cross-check the above resultson standard jobs, we ran four prototypical jobs with andwithout Mantri twenty times each. Figure shows thatjob completion times improve by roughly and re-source usage falls by roughly . �e histograms plotthe average reduction, error bars are the th and th

percentiles of samples. Further, we logged all the progressreports for these jobs. We �nd that Mantri’s predictor,based on reports from the recent past, estimates trem towithin a . error of the actual completion time.

Placement of Tasks: To evaluate Mantri’s network-awarespreading of reduce tasks, we ran Group By, a job with along-running reduce phase, ten times on the productioncluster. Table shows that the reduce phase’s completiontime reduces by . on average causing the job to speedup by an average of .. To understand why, we mea-sure the spread of tasks, i.e., the ratio of the number ofconcurrent reduce tasks to the number of racks they ranin. High spread implies that some racks have more taskswhich interfere with each other while other racks are idle.Mantri’s spread is . compared to . for the earlier build.

To compare against alternative schemes and to pieceapart gains from the various algorithms in Mantri, we

40

60

80

100

Dryad

Hadoop

LATE

CD

F %

Ph

ase

Du

rati

on

0

20

-20 0 20 40 60 80 100

LATE

MapReduce

Mantri

CD

F %

Ph

ase

% Reduction in Completion Time(a) Change in Completion Time

40

60

80

100

Dryad

Hadoop

LATE

MapReduce

CD

F %

Ph

ase

Du

rati

on

0

20

40

-40 -20 0 20 40 60 80 100

MapReduce

Mantri

CD

F %

Ph

ase

% Reduction in Resource Usage(b) Change in Resource Usage

Figure : Comparing straggler mitigation strategies. Mantriprovides a greater speed-up in completion time while usingfewer resources than existing schemes.

present results from the trace-driven simulator.

. Can Mantri mitigate stragglers?

Figure compares stragglermitigation strategies in theirimpact on completion time and resource usage. �e y-axes weighs phases by their lifetime since improving thelonger phases improves cluster e�ciency. �e �gures plotthe cumulative reduction in these metrics for the Kphases in Table with each repeated thrice. For this sec-tion, our common baseline is the scheduler that takes noaction on outliers. Recall from §. that the simulator re-plays the task durations and the anomalies observed inproduction.

Figures (a) and (b) show that Mantri improvescompletion time by and at the th and th

percentiles and reduces resource usage by and atthese percentiles. From Figure (a), at the th per-centile, Mantri sped up phases by an additional .x overthe . improvement of Hadoop, the next best scheme.To achieve the smaller improvement Hadoop uses .more resources (Fig. (b)). Map-Reduce and Dryadhave no positive impact for and of the phasesrespectively. Up to the th percentile Dryad increasesthe completion time of phases. LATE is similar in its timeimprovement to Hadoop but uses fewer resources.

�e reason for poor performance is that they miss out-liers that happen early in the phase and by not knowingthe true causes of outliers, the duplicates they schedule aremostly not useful. Mantri and Dryad schedule . restartsper task for the average phase (. and . for LATE and

40

60

80

100

LATE + Early

Ph

ase

Du

rati

on

0

20

40

-20 -10 0 10 20 30 40

LATE + Early

LATE

CD

F %

Ph

ase


(a) Time

40

60

80

100

LATE

LATE + Early

Ph

ase

Du

rati

on

0

20

-20 -10 0 10 20 30 40

% Reduction in Resource Usage

CD

F %

Ph

ase

(b) Resources

Figure : Extending LATE to speculate early results in worseperformance

40

60

80

100NoSkew

NoSkew + ChopTail

Mantri

CD

F %

Ph

ase

Du

rati

on

0

20

40

-10 0 10 20 30 40 50

CD

F %

Ph

ase

Du

rati

on

% Reduction in Completion Time(a) Time

40

60

80

100

NoSkew + ChopTail

CD

F %

Ph

ase

Du

rati

on

% Reduction in Resource Usage

0

20

40

-10 0 10 20 30 40 50

NoSkew + ChopTail

NoSkew

Mantri

CD

F %

Ph

ase

Du

rati

on

(b) Resources

Figure : Mantri is on par with an ideal NoSkew benchmarkand slightly worse than NoSkew+ChopTail (see end of §.)

40

60

80

100 Start

Equal

Mantri

IdealReduce

Ph

ase

Du

rati

on

0

20

40

0 20 40 60 80 100


CD

F %

Ph

ase

(59.1%)

Figure : By being network aware, Mantri speeds up the me-dian reduce phase by over the current placement.

Hadoop). But, Mantri’s restarts have a success rate of compared to the for LATE. �e other schemes havelower success rates.

While the insight of early action on stragglers is valu-able, it is nonetheless non trivial. We evaluate this in Fig-ures (a) and (b) that present a form of LATE thatis identical in all ways except that it addresses stragglersearly. We see that addressing stragglers early increasescompletion time up to the th percentile, uses more re-sources and is worse than vanilla LATE. Being resourceaware is crucial to get the best out of early action (§.).

Finally, Fig. shows thatMantri is on parwith the idealbenchmark that has no variation in tasks, NoSkew, and isslightly worse than the variant that removes all durationsin the top quartile, NoSkew+ChopTail. �e reason is thatMantri’s ability to substitute long running tasks with theirfaster copies makes up for its inability to act with perfectfuture knowledge of which tasks straggle.

. Does Mantri improve placement?

Figure plots the reduction in completion time due toMantri’s placement of reduce tasks as a CDF over all re-duce phases in the dataset in Table . As before, the y-axesweighs phases by their lifetime. �e�gure shows thatMantri provides a median speed up of or a .x im-provement over the current implementation.

�e �gure also compares Mantri against strategies thatestimate available bandwidths di�erently. �e IdealRe-duce strategy tracks perfectly the changes in availablebandwidth of links due to the other jobs in the cluster. �eEqual strategy assumes that the available bandwidths areequal across all links whereas Start assumes that the avail-able bandwidths are the same as at the start of the phase.We see a partial order between Start and Equal (the twosolid lines). Short phases are impacted by transient dif-ferences in the available bandwidths and Start is a goodchoice for these phases. However, these di�erences evenout over the lifetime of long phases forwhomEqualworksbetter. Mantri is a hybrid of Start and Equal. It achieves agood approximation of IdealReduce without re-samplingavailable bandwidths.

To capture how Mantri’s placement di�ers from Dryad,Figure plots the ratio of the throughput obtained by themedian task in each reduce phase to that obtained by theslowest task. With Mantri, this ratio is . at median andnever larger than . In contrast, with Dryad’s policy ofplacing tasks at the �rst available slot, this ratio is . (or.) at the th (or th) percentile. Note that duplicat-ing tasks that are delayed due to network congestionwith-out considering the available bandwidths or where othertasks are located would be wasteful.

. Does Mantri help with recomputations?

�e best possible protection against loss of output would(a) eliminate all the increase in job completion time dueto tasks waiting for their inputs to be recomputed and (b)do so with little additional cost. Mantri approximates bothgoals. Fig. shows that Mantri achieves parity with Ideal-Recompute. Recall that IdealRecompute has perfect futureknowledge of loss. �e improvement in job completiontime is () at the th (th) percentile.

�e reason is that Mantri’s policy of selective replica-tion is both accurate and biased towards the more expen-sive recomputations. �e probability that task output thatwas replicated will be used because the original data be-comes unavailable is . Similarly, the probability that apre-computation becomes useful is , which increasesto if pre-computations are triggered only when tworecomputations happen at a machine in quick succes-sion. Figure shows the complementary contributionsfrom replication and pre-computation– each contribute

20

40

60

80

100

Dryad

MantriC

DF

% P

ha

se D

ura

tio

n

0

20

1 6 11 16 21 26 31

Mantri

Ratio of median to slowest

read throughputs of tasks

CD

F %

Ph

ase

Du

rati

on

Figure : Unlike Dryad, Mantri’s placement provides moreconsistent throughput to tasks in reduce phases.

40

60

80

100

Mantri

IdealRecomputeJob

Du

rati

on

0

20

0 20 40 60 80 100

IdealRecompute


CD

F %

Job

Figure : By probabilistically replicating task output andrecomputing lost data before it is needed Mantri speeds upjobs by an amount equal to the ideal case of no data loss.

40

60

80

100 Pre-computation

Replication

Mantri

CD

F %

Jo

b D

ura

tio

n

0

20

40

0 20 40 60 80 100

(78%)(53%)(25%)

% Recomputes Eliminated

CD

F %

Jo

b D

ura

tio

n

Figure : Fraction of recomputations that are eliminated dueto Mantri’s recomputation mitigation strategy, along with indi-vidual contributions from replication and pre-computation.

roughly and to the total. Cumulatively, the �g-ure shows that Mantri eliminates of recomputationsfor the median job. We note that Mantri ignores ofthe recomputations in the bottom quartile of jobs sincetheir impact on job completion time is small.

Fig. (a) shows that the extra network tra�c due toreplication is (overall negligible and) comparable to Ide-alReduce. Mantri sometimes replicates more data than theideal, and at other times misses some tasks that should bereplicated. Fig. (b) shows that pre-computations takeonly a few percentage extra resources.

RelatedWork

Much recent work focuses on large scale data parallelcomputing. Following on the Map-Reduce [] paper,there has been work in improving work�ows [, ], lan-guage design [, ], and fair schedulers []. Our work

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Replication

IdealRecompute

CD

F %

of

To

tal

Tra

ffic

Increase in Traffic (%)

(a) Cost: Network Tra�c

40

60

80

100

Clu

ste

r R

eso

urc

es

0

20

40

0 0.5 1 1.5 2 2.5 3

Increase in Cluster Resource(%)

CD

F %

Clu

ste

r R

eso

urc

es

(b) Cost: Cluster Time

Figure : �ecost to protect against recomputes is fewer thana few percentage points in both the extra tra�c on the networkand cluster time for pre-computation.

here takes the next step of understanding how such pro-duction clusters behave and can be improved.

Run-time stragglers have been identi�ed by pastwork [, ]. However, we are the �rst to character-ize the prevalence of stragglers in production and theircauses. By understanding the causes, addressing strag-glers early and scheduling duplicates only when there isa fair chance that the speculation saves both time and re-sources, our approach provides a greater reduction in jobcompletion time while using fewer resources than priorstrategies that duplicate tasks towards the end of a phase.Also, we uniquely avoid network hotspots and protectagainst loss of task output, two further causes of outliers.

By only acting at the end of a phase, current schemes [,, ] miss early outliers. �ey vary in the choice ofwhich tasks to duplicate. A er a threshold number oftasks have �nished, Map-Reduce [] duplicates all thetasks that remain. Dryad [] duplicates those that havebeen running for longer than the th percentile of taskdurations. A er all tasks have started, Hadoop [] usesslots that free up to duplicate any task that has read lessdata than the others, while LATE [] duplicates onlythose reading at a slow rate.

�ough some recent proposals do away with capacityover-subscription in data centers [, ], today’s networksremain over-subscribed albeit at smaller levels. It is com-mon to place tasks near their input (same machine, racketc.) for map and at the �rst free slot for reduce [, , ].Our approach to eliminate outliers by a network-awareplacement is orthogonal to recent work that packs tasksrequiring di�erent resources on to a machine [], ortrades-o� fairness with e�ciency []. Quincy accountsfor capacity but not for runtime variations in bandwidthdue to competition from other tasks.

ISS [] protects intermediate data by replicatinglocally-consumed data. In particular, this does not in-clude map output, since Hadoop transfers map output toreduce tasks as it is produced. ISS’s replication strategyruns the risk of being both wasteful (when very few ma-chines are error-prone) and insu�cient (when the trans-fer of map output fails). In contrast, Mantri presents abroader solution that (a) replicates task output based onthe probability of data loss and the recursive cost of re-

computing inputs and (b) pre-computes lost data.

�e Map-Reduce paradigm is similar to paralleldatabases in its goal of analyzing large data [] and todedicated HPC clusters and parallel programs [] bypresenting similar optimization opportunities. In thecontext of multiple processors, studies have been done onthe classic problem of dynamic task scheduling [, ] aswell as task duplication []. Star-MPI [] adapts param-eters like network topology between a set of communi-cating processors by observing performance over time.Prior work has also focused on modeling and optimiz-ing the communication in parallel programs [, , ]that have one-to-all or all-to-all tra�c, i.e., where ev-ery receiver processes all of the output of tasks in earlierstages. In the context of the many-to-many tra�c, typicalofMap-Reduce, we present practical techniques for band-width estimation and task placement that realizes near-optimal performance.

Conclusion

Mantri delivers e�ective mitigation of outliers in Map-Reduce networks. It is motivated by, what we believe is,the �rst study of a large production Map-Reduce clus-ter. �e root of Mantri’s advantage lies in integratingstatic knowledge of job structure and dynamically avail-able progress reports into a uni�ed framework that iden-ti�es outliers early, applies cause-speci�c mitigation anddoes so only if the bene�t is higher than the cost. In ourimplementation on a cluster of thousands of servers, we�nd Mantri to be highly e�ective.Outliers are an inevitable side-e�ect of parallelizing

work. �ey hurt Map-Reduce networks more due to thestructure of jobs as graphs of dependent phases that passdata from one to the other. �eir many causes re�ect theinterplay between the network, storage and, computationin Map-Reduce. Current systems shirk this complexityand assume that a duplicatewould speed things up. Mantri

embraces it to mitigate a broad set of outliers.

Acknowledgments– For feedback on dra s, we thankmembers of the RAD lab, the Cosmos product group andthe OSDI reviewers. Alexei Polkhanov and Juhan Leewere invaluable in taking Mantri to production clusters.

References

[] Hadoop distributed �lesystem. http://hadoop.apache.org.[] A. Faraj, X. Yuan, D. Lowenthal. STAR-MPI: Self Tuned Adap-

tive Routines for MPI Collective Operations. In SC, .[] A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz,

P. Patel, and S. Sengupta. VL: A Scalable and Flexible DataCenter Network. In SIGCOMM, .

[] I. Ahmad and M. K. Dhodhi. Semi-distributed load balancingfor massively parallel multicomputer systems. In IEEE TSE.,.

[] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, andY. Lu. Reigning in the outliers in map-reduce clusters. TechnicalReport MSR-TR--, Microso Research, .

[] B. Ucar, C. Aykanat, K. Kaya, M. Ikinci. Task assignment in Het-erogeneous Computing Systems. In JPDC, .

[] L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.Arpaci-Dusseau, and R. H. Arpaci-Dusseau. An analysis of datacorruption in the storage stack. In FAST, .

[] R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib,S. Weaver, and J. Zhou. SCOPE: Easy and E�cient Parallel Pro-cessing of Massive Datasets. In VLDB, .

[] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmleegy,and R. Sears. Mapreduce online. In NSDI, .

[] D. Culler et al. LogP: Towards a Realistic Model of Parallel Com-putation. In SIGPLAN PPoPP, .

[] J. Dean and S. Ghemawat. Mapreduce: Simpli�ed data process-ing on large clusters. In OSDI, .

[] R. L. Graham. Bounds on multiprocessing timing anomalies.SIAM Journal on Applied Mathematics, (), .

[] M. Isard et al. Dryad: Distributed Data-parallel Programs fromSequential Building Blocks. In Eurosys, .

[] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking theTightrope: Responsive Yet Stable Tra�c Engineering. In SIG-COMM, .

[] S. Ko, I. Hoque, B.Cho, and I. Gupta. Making cloud intermediatedata fault-tolerant. In SOCC, .

[] A. Krishnamurthy and K. Yelick. Analysis and optimizations forshared address space programs. JPDC, .

[] M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commod-ity Data Center Network Architecture. In SIGCOMM, .

[] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, A.Goldberg. Quincy: Fair scheduling for distributed computingclusters. In SOSP, .

[] M. Lauria and A. Chien. MPI-FM: High Performance MPI onWorkstation Clusters. In JPDC, .

[] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, I. Stoica. Im-proving MapReduce Performance in Heterogeneous Environ-ments. In OSDI, .

[] P. Patarasuk, A. Faraj, X. Yuan. Pipelined Broadcast on EthernetSwitched Clusters. In IEEE IPDPS, .

[] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. R.Madden, and M. Stonebraker. A comparison of approaches tolarge scale data analysis. In SIGMOD, .

[] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, R. Chaiken. Na-ture of Datacenter Tra�c: Measurements and Analysis. In IMC,.

[] S. Manoharan. E�ect of task duplication on assignment of de-pendency graphs. In Parallel Comput., .

[] T. Sandholm and K. Lai. Mapreduce optimization using regu-lated dynamic prioritization. In SIGMETRICS, .

[] Y. Kwon, M. Balazinska, B. Howe, J. Rolia. Skew-Resistant Par-allel Processing of Feature-Extracting Scienti�c User-De�nedFunctions. In SOCC, .

[] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda,J. Currey. DryadLINQ: A System for General-Purpose Data-Parallel Computing Using a High-Level Language. In OSDI,.

[] Y. Yu, P. K. Gunda, and M. Isard. Distributed Aggregation forData-Parallel Computing: Interfaces, Impl. In SOSP, .

http://hadoop.apache.org

Documents

ReiningintheOutliersinMap-ReduceClustersusingMantri · 2017. 9. 12. · velop Mantri, that takes early actions based on under-standing the causes and the opportunity cost of actions