25
1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et Technique CERIST Laboratoire d’Informatique Fondamentale de Lille A.Bendjoudi [email protected] N. Melab and E-G. Talbi {melab,talbi}@lifl.fr

1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

Embed Size (px)

Citation preview

Page 1: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

1

Fault-Tolerant Mechanism for Hierarchical Branch and Bound

Algorithm

Université A/Mira de Béjaïa

CEntre de Recherche sur l’Information

Scientifique et Technique CERIST

Laboratoire d’InformatiqueFondamentale de Lille

[email protected]

N. Melab and E-G. Talbi{melab,talbi}@lifl.fr

Page 2: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

2

Outline

Motivations FTH-B&B: Fault Tolerant hierarchical

Branch and Bound algorithm Performance Evaluation Conclusion and Perspectives

Page 3: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

3

Motivations Combinatorial Optimization Problems are Np-Hard and CPU-

Intensive Their exact resolution requires a huge amount of computing

resources Branch and Bound B&B algorithms are the best known methods

Implicit enumeration of the search space They use a pruning technique to reduce the search space

however they remain not sufficient for very large instances High performance computing (Computational Grids)

Computational Grids offer large amount of computing resources Issue: Scalability Traditional Master/Worker-based B&B (MW-B&B) algorithms are

inefficient Master process becomes rapidly bottleneck

Solution: Hierarchical Master/Worker-based B&B (HMW-B&B)8

Page 4: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

4

Branch and Bound Algorithm

Branching Original problem is partitioned into smaller sub problems.

Bounding Compute the estimated optimal solution (lower bound) of the considered problem and compare to the best known solution (upper bound).

Elimination Identify nodes which don’t lead to The best solution and eliminate them

Selection Exploration strategy(best first search, depth first search, best bound,…)

Problem to be solved

Sub-problem

Solutions

P0

8Optimal solution

Branching

Elimination

Selection

Page 5: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

5

Motivations

Resources of computational grids are highly unreliable and volatile Issue: Fault-tolerance

Fault tolerance at middleware level: ProActive, XtremWeb, Condor, …

They are costly in terms of execution time

Solution: Application level fault-tolerant algorithm Fault detection Task recovery Minimization of Redundant work Maintenance of the hierarchy

Page 6: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

6

Outline

Motivations FTH-B&B: Fault Tolerant Hierarchical Branch and Bound

algorithm Architecture Work management with task recovery Maintenance of the hierarchy Distributed checkpointing

Performance Evaluation Conclusion et Perspectives

Page 7: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

7

Architecture Based on HMW paradigm dealing with Scalability & Fault

tolerance Several FT MW-B&Bs organized hierarchically into groups Each FT MW-B&B is composed of:

Single Master (on the top and inner nodes of the hierarchy): parallel recursive branching

Workers (Leaves): parallel exploration or

Masters (inner nodes): parallel branching Each master owns a single work pool

Multiple work pools

Collegial strategy

Workers

Masters

Fault Tolerant M/W-based B&Bs

Multiple work pools

Group of mastersand workers

Page 8: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

8

Fault detection

A master heartbeats its workers Workers heartbeat their master

Master

Heartbeats

Workers

Page 9: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

9

Work management & task recovery

Each master manages: A list of assigned sub-problems

(mapping between the assigned sub-problem and the child process assigned to it)

A list of branched sub-problems LBS (sub-problems being explored by its children)

When a failure is detected only the unexplored sub-problems of the lost root problem in the LBS are rescheduled to another safe process

R

AB

PA PB PC PD

PA

Work pool

List of branched sub-problems LBS

PA1PA2

Page 10: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

10

3-phase communication mechanism

Phase 1 (between a master and its children): A master assigns a problem to its children It receives back the branched sub-problems

Phase 2 (between a master, its children, and its parent):

A master updates explored sub-problems in the LBS each time a child finishes the exploration of its sub-problem.

the master knows at any time the unexplored parts of a given problem.

Phase 3 (between a master and a new free process):

When a process fails The parent of the failed process detects its failure and saves the unexplored part of its sub-problem

When a new safe process connects, the parent reschedules it the unexplored part of the sub-problem

R

AB

PA PB PC PD

PA

PA1PA2

Connection of a new worker

Page 11: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

11

Maintenance of the hierarchy

Motivations Failures can cause a loss of data and/or computing power Master’s failures can isolate some parts of processes from

the rest of the hierarchy and cause orphan branches Loss of processes can cause unsafe execution of the

algorithm (non termination, redundant work, ...) Failures can unbalance the hierarchy if only the parent of

the failed processes is involved in the fault recovery process

New bottlenecks can be created on parts of the hierarchy Proposed techniques

Simple connection to Ascendants SCA Master Election ME Balanced Hierarchy BH

Page 12: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

12

Simple connection to Ascendants (SCA)

Children of the failed master connect to the closest safe ascendant

Each process holds the list of its ascendants

No process reassignment When a process receives a

connection from an orphan process, it considers it as a new child

Drawback: Risk of creation of new bottlenecks If f children fail, the safe master

receives gxf connections from its grandchildren

If l levels of masters fail, it receives gl+1 connections

Convergence to MW-B&B

Page 13: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

13

Master Election (ME)

Orphan processes elect a new master among them using the bully algorithm

The new selected master considers the other electors as its children

It connects to its closest safe ascendant using SCA

It is informed by its new parent about its neighbors

Page 14: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

14

Balanced Hierarchy (BH)

Orphan processes migrate to a safe sub-B&B

The migration strategy respects the maximum group size threshold

Assigns orphan processes to a non-full sub-B&B

A master receiving orphan processes:

Holds them only if the group it manages is non-full

Otherwise, it dispatches them among its children

Avoids the convergence to a MW

Page 15: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

15

Distributed Checkpointing

Several levels of checkpointing Each sub-MW-B&B performs distributed checkpointing

independently There is no global backup file and each master handles its own local

backup file A master saves unexplored sub-problems and updates its file each

time a new sub-problem is explored.

Clusters

Backup FilesWorkers

Masters

A new considered master checks the buck-up file

Reconstitutes the unexplored sub-problems

Carries on the last consistent global state.

Page 16: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

16

Performance Evaluation (Objectives)

Evaluate the ability of FTH-B&B to deal with Fault Tolerance issue

Measure the efficiency of FTH-B&B in terms of execution time

Measure the benefit of task recovery on FTH-B&B in terms of minimizing the redundant work

Measure the impact of failures on the hierarchy of FTH-B&B using (SCA, ME, and BH)

Page 17: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

17

Performance Evaluation (Environment)

FTH-B&B has been experimented on the Flowshop Scheduling Problem FSP (Taillard instances)

Application developed on top of the ProActive middleware

GRID’5000 : https://www.grid5000.fr Between 1900 and 8900 processes have been deployed To obtain deeper hierarchy, several processes have been

deployed on a same host and the size of the sub-B&Bs is fixed to 10

Failures are obtained killing processes chosen uniformly during their lifetime

Page 18: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

18

N jobs to be scheduled on M machines Each machine can not be simultaneously assigned to two

jobs (colors) Jobs (colors) must be scheduled in the same order on all

machines One objective must be minimized

Cmax: Makespan (Total completion time)

M1

M2

M3

The Flow-Shop Scheduling Problem

4 jobs on 3 machines

Page 19: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

19

Efficiency of FTH-B&B

Efficiency = ratio between the exploration time and the total execution time including Idle times.

Idle time = communication time + 3-phase time + management time

3-phase mechanism is not costly in terms of execution time 3-phase consumes between 0,20% – 0,40% of the total execution time

The workers of FTH-B&B spend on average 98,92% of their time solving sub-problems

Page 20: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

20

Minimizing Redundant work

The masters of FTH-B&B reschedule less sub-problems when using 3-phase mechanism less redundant work

On average when using the 3-phase mechanism the algorithm is 6,36 times more efficient.

Page 21: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

21

Maintenance of the hierarchy using SCA

The figure presents the four most loaded masters in the hierarchy

Degree of a process = number of children

x-axis = Time y-axis = degree of masters Degrees increase with the number of failed processes The root-master process rapidly becomes a bottleneck (passes

from 10 children at the beginning of the execution to 1368 children at the end)

Convergence to MW paradigm

Page 22: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

22

Maintenance of the hierarchy using ME

The masters are less loaded than compared with SCA When a new master is designated, it receives the highest identifier

it will not be designated in the next election session. Nevertheless, the average number of children has doubled after the

failure of 25% of processes.

Page 23: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

23

Maintenance of the hierarchy using BH

The figure shows the average degree masters in each level of the hierarchy

Up to 8 masters are killed every 1 second

The whole masters resist to the failures and do not become bottlenecks

The root masters maintains the same degree during all the lifetime

The degrees decreases softly for the 3 first levels and more rapidly for the last level caused by the new converted masters that have only one worker as a children

Page 24: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

24

Conclusion FTH-B&B: Application level Fault tolerant algorithm Several FT-MW-based B&Bs are launched hierarchically 3-phase communication mechanism (between a master, its

children, and its grand-children) Distribute, store, and recover work units in case of failures Minimizing redundant work Efficiency = 98% 6,36 times more efficient than without using 3-Phase mechanism

3 strategies to maintain the hierarchy SCA, ME, and BH BH has proved its efficiency in terms of balancing the hierarchy

Distributed checkpointing mechanism Several checkpointing levels

Page 25: 1 Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et

25

Perspectives

Take into account best effort exploit the maximum number of computing resources

Improve the realism of FTH-B&B use a realistic model of failures based on real statistics of grids