18
13 mai 14 : AtelierSupervision Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS •IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL •IDRIS : outils existants et contraintes •TGCC : outils existants et contraintes •Discussion •Plan d'action Après-Midi : les jobs entrants •IPSL : besoins de relance de jobs de calcul et/ou de post- traitements depuis l’IPSL •IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes? •TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes? •Discussion •Plan d’action

13 mai 14 : AtelierSupervision

Embed Size (px)

DESCRIPTION

13 mai 14 : AtelierSupervision. Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS IPSL : besoins et esquisses de solutions pour l'envoi de messages d ’ informations depuis les centres de calcul vers l ’ IPSL - PowerPoint PPT Presentation

Citation preview

13 mai 14 : AtelierSupervisionMatin : les messages sortants•IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS•IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL•IDRIS : outils existants et contraintes•TGCC : outils existants et contraintes•Discussion•Plan d'action

Après-Midi : les jobs entrants•IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL•IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•Discussion•Plan d’action

T0 : management

T1 : platform T2 : towards a high-resolution coupled model

T3 : runtime environments

T4 : Big Data management and analytics of Climate Simulations

T5 : CliMAF: a framework for climate models evaluation and analysis

ensemble of toolsdifferent configurationsdifferent resolutionset of simulationsset of diagnosticsassessment

•Improving coupled model parallelism in terms of computing and memory•Managing efficiently input and restart files•Integrating parallel interpolation mechanisms in XIOS•Parallel component coupling

•Process assignment•Optimization, Load balancing•Climate Simulations Supervision

•XIOS implemented within project models •XIOS a bridge towards standardisation •Data and metadata services•Big Data Analytics

•General driver and upstream user interface•Services layer•Visualization tools •Evaluation and monitoring diagnostics

IPSL implementation

GAME-CERFACS implementation

ANR MN2013 CONVERGENCE

Task 3.3 : Runtime Environment

Leaders : Arnaud Caubel and Marie-Alice FoujolsContributors : IPSL, CERFACS, IDRIS, CNRS-GAMEHelp and expertise : TGCC, MDLS

task IPSL CERFACS CNRS-GAME

IDRIS TGCC MDLS

3.1 Process assignment X X X

3.2 Optimisation, load balancing

X X X

3.3 Climate Simulations Supervision

X X X

Task 3.3 : Climate Simulations Supervision

Launch a simulation

with libIGCM

IDRIS or TGCC

Computing Post-traitment

Supervisor User

Jussieu?

Commands

Objective : libIGCM self-healing application : more reliability, less human intervention

Task 3.3 : Climate Simulations SupervisionContext

– one simulation• 3 weeks running, 100 000 files, 25 TB, 1000 jobs : 40 computing and post-processing

– static workflow vs dynamic workflow

Development of a supervisor agent– detect and understand failure event– understand the ultimate goals of the workflow– re-plan, re-schedule, re-map the workflow

Tasks for the supervisor – events log in a comprehensive call tree (job sub., work to be done, each cp, ....)– reliable lightweight communication channel between client agents and server

agents (RabbitMQ implementation of AMPQ)– call tree traversal capabilities to determine checkpoint restart– autonomous rescheduling of necessary jobs– monitoring capabilities : coloured graphs with all jobs and status– regression tests handling capabilities

Task 3.3 : Climate Simulations Supervision

Milest. Date Description IPSL CERFACS CNRS-GAME

IDRIS TGCC MDLS

MS3.3a M12 : 10/2014 Supervisor agent Architectural Design X X X x

MS3.3b M24 : 10/2015 Supervisor agent release candidate: enabling control channel, full events logs, call tree traversal capabilities and regression test handling

X X X x

MS3.3c M48 : 10/2017 Supervisor agent final release : succesful rescheduling for known failure

X X X x

Task 3.3 : Climate Simulations Supervision

Additional manpower :• CDD 21 pm IPSL (tasks 3.1, 3.2, 3.3) + CDD/IDRIS 6 pm• Subcontractor IPSL 42 pm (tasks 3.3)• TGCC/CEA : prestation ?

Success criteria :• A significant number of “standard” (ie “nonexpert”) users of Earth System model

launch typical climate simulation (including development done in this WP) using libIGCM runtime environment on HPC centres (IDRIS and TGCC)

Identified risks :• if it's not possible to install supervisor agent : lighter installation with warning

instead of correction• the supervisor must be as transparent as possible : lighter usage ie des/activation

of main tasks/secondary tasks

Planning for next 6 months :• Meeting/workshop to plan to discuss “Supervisor Design” (task 3.3)

RebuildFrequency

PackFrequency

SeasonalFrequency

TimeSeriesFrequency

Com

putin

g jo

b

Post

-pro

cess

ing

jobs

PackFrequency

PackFrequency

PeriodLength , PeriodNb

Generical job: AA_Job

PeriodLength

TGCC computers and file system in a nutshell

curiehybrid nodes

-q hybrid

curiehybrid nodes

-q hybrid

curiethin nodes-q standard

curiethin nodes-q standard

curielarge nodes

-q xlarge

curielarge nodes

-q xlarge

dods/storedods/store

$HOME

$CCCSTOREDIR

$CCCWORKDIR

$SCRATCHDIR

HPSS : Robotic tapes

curiefront-end

curiefront-end

Computers

sourcessmall results IGCM_OUT :

MONITORING/ATLAS

temporary REBUILDIGCM_OUT :

files to be packedoutputs of post-proc

jobs

IGCM_OUT : Packed results

Output, Analyse SE and TS

Small precious filesSaved space

File system

dods_cp

cp

ccc_hsm get

airainfront-end

airainfront-end

airainnodesairainnodes

cpdods/workdods/workdods_cp

October 2013Temporary

spaceSaved space

Non saved space

Space on tapes

computecompute

loginlogin

Visible from www

quotasquotas

Job_EXP00Job_EXP00

Com

pute

curie

Job_EXP00Job_EXP00 Job_EXP00Job_EXP00

TGCC PeriodLength PeriodLength

$SCRATCHDIR/IGCM_OUT/.../REBUILD

$SCRATCHDIR/IGCM_OUT/XXX/Restart Debug

DodsCopy=TRUE/FALSE

ncrcat

PackFrequency

$CCCSTOREDIR/IGCM_OUT/XXX/Output

pack_outputpack_output

PackFrequency

$CCCSTOREDIR/IGCM_OUT/.../RESTART DEBUG

Post

curietarpack_restart

pack_debugpack_restartpack_debug

create_tscreate_ts

curiemonitoringmonitoring

Post

TimeSeriesFrequency

TS et SE : $CCCSTOREDIR/IGCM_OUT/… dods/storeMONITORING et ATLAS : $CCCWORKDIR dods/work

create_secreate_se

SeasonalFrequency

atlasatlas

$SCRATCHDIR/IGCM_OUT/XXX/Output

Post

RebuildFrequency

rebuildrebuild

curie

IDRIS computers and file system in a nutshell

dodsdods

$HOME

$HOME

$WORKDIR $WORKDIR

Robotic tapesIGCM_OUT :

Output, AnalyseMONITORING/

ATLAS

$HOME

$TMPDIR

sourcessmall results

temporary REBUILDIGCM_OUT :

files to be packedoutputs of post-proc

jobs

gayagaya

mfput/mfget

dods_cp

mfput/mfget

dmput/dmget

adappcomputeadapp

computeada

computeada

computeadapp

front-endadapp

front-endturing

front-endturing

front-endturingcalculturingcalcul

$TMPDIR $TMPDIR

October 2013Temporary

spaceSaved space

Non saved space

Space on tapes

Visible from www

File system

computecompute

loginlogin

Small precious filesSaved space

Job_EXP00Job_EXP00

Com

pute

ada

Job_EXP00Job_EXP00 Job_EXP00Job_EXP00

IDRIS PeriodLength PeriodLength

$WORKDIR/IGCM_OUT/.../REBUILD

$WORKDIR/IGCM_OUT/XXX/Restart Debug

DodsCopy=TRUE/FALSE

ncrcat

PackFrequency

gaya:IGCM_OUT/XXX/Output

pack_outputpack_output

PackFrequency

gaya:IGCM_OUT/.../RESTART DEBUG

Post

adapptarpack_restart

pack_debugpack_restartpack_debug

create_tscreate_ts

adappmonitoringmonitoring

Post

TimeSeriesFrequency

gaya:IGCM_OUT/… dods.idris.fr

create_secreate_se

SeasonalFrequency

atlasatlas

Post

RebuildFrequency

rebuildrebuild

$WORKDIR/IGCM_OUT/XXX/Output

adapp

CM5AEH01 : 1850-2350

CM5AEH01 – 500 ans : 1850-2349 • 500 ans• PeriodLength=1M 240 jobs de calcul, PeriodNb=12, 60• RebuildFrequency=1Y 432 rebuild• PackFrequency=10Y 43 pack_debug, 43 restart, 43 output• SeasonalFrequency=50Y 8 create_se et 32 atlas • TimeSeriesFrequency=10Y 757 create_ts et 43 monitoring

• 12 interventions manuelles 12/1641 = 0,73%

Incident Détection Remède Supervision

1- Fatal calcul Un mail Clean_month et relance

3 tentatives

2- Fatal post Fatal calcul

Deux mail Clean_year et relance

3 tentatives

3- Job de calcul absent

Manuel : se connecter et RunChecker.job

Clean_month et relance

heartbit

CM5AEH01 : RunChecker.job

CM5AEH01 : erreurs rencontrées Erreur job calcul :•2123-11 et 2130-07: Fatal : error writing restartphy, job bloqué qq heures,

clean_month et relance•2206-04 : Fatal : erreur SLURM ,

clean_month et relance•2249-03 : Fatal : 3h de blocage, killed,

clean_month et relance

Erreur job post-traitements :•1999, 2000, 2118 et 2127 : pack_restart (1999) et rebuild parti en time limit

pack_r et rebuild relancé et, si besoin, pack_output (2119, 2129)•2166 et 2174 : rebuild KO IGCM_sys_rebuild[1860]: /ccc/cont003/home/dsm/p86ipsl/X64_CURIE/bin/rebuild: cannot execute [Permission denied],

rebuild relancé •2059, 2079, 2119 et 2129 : pack_output lancé trop tôt pack_output relancéAutres erreurs :•13 monitoring KO : problèmes d’environnement instable (nco) entre 10/3 et 30/31 •1 sub rebuild KO, resource temporarily unav : 3 tentatives dans libIGCM v2.2 •IDRIS : disparition tous jobs

13 mai 14 : Atelier SupervisionMatin : les messages sortants•IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS•IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL•IDRIS : outils existants et contraintes•TGCC : outils existants et contraintes•Discussion•Plan d'action

Après-Midi : les jobs entrants•IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL•IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes?•Discussion•Plan d’action