22
Data Management for Large-Scale Scientific Computations in High Performance Distributed Systems A. Choudhary, M. Kandemir, J. No, G. Memik, X. Shen, W. Liao, H. Nagesh S. More, V. Taylor, R. Thakur, and R. Stevens Center for Parallel and Distributed Computing Department of Electrical and Computer Engineering Northwestern University Evanston, IL 60208 choudhar,memik,xhshen,wkliao,harsha,ssmore,taylor @ece.nwu.edu October 25, 1999 Abstract With the increasing number of scientific applications manipulating huge amounts of data, effective high-level data management is an increasingly important problem. Unfortunately, so far the solutions to the high-level data management problem either require deep understanding of specific storage ar- chitectures and file layouts (as in high-performance file storage systems) or produce unsatisfactory I/O performance in exchange for ease-of-use and portability (as in relational DBMSs). In this paper we present a novel application development environment which is built around an active meta-data management system (MDMS) to handle high-level data in an effective manner. The key components of our three-tiered architecture are user application, the MDMS, and a hierarchical storage system (HSS). Our environment overcomes the performance problems of pure database-oriented solutions, while maintaining their advantages in terms of ease-of-use and portability. The high levels of performance are achieved by the MDMS, with the aid of user-specified, performance-oriented directives. Our environment supports a simple, easy-to-use yet powerful user interface, leaving the task of choosing appropriate I/O techniques for the application at hand to the MDMS. We discuss the importance of an active MDMS and show how the three components of our environment, namely application, the MDMS, and the HSS, fit together. We also report performance numbers from our ongoing implementation and illustrate that significant improvements are made possible without undue programming effort. 1 Introduction Many large-scale scientific applications are data intensive, processing large data sets ranging from megabytes to terabytes. These include applications from the data analysis and mining, image processing, large archive maintenance, real-time processing and so on. As an example, consider a typical computational science anal- ysis cycle, shown in Figure 1. In this cycle there are several steps involved. These include mesh generation, domain decomposition, simulation, visualization, and interpretation of results, archival of data and results Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, [email protected]. Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, e-mail: jano,thakur,stevens @mcs.anl.gov.

Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

DataManagementfor Large-ScaleScientificComputationsin High PerformanceDistributedSystems

A. Choudhary, M. Kandemir�, J.No

�, G. Memik, X. Shen,W. Liao, H. Nagesh

S.More,V. Taylor, R. Thakur�, andR. Stevens

�Centerfor ParallelandDistributedComputing

Departmentof ElectricalandComputerEngineeringNorthwesternUniversity

Evanston,IL 60208�choudhar,memik,xhshen,wkliao,harsha,ssmore,taylor� @ece.nwu.edu

October25,1999

Abstract

With the increasingnumberof scientificapplicationsmanipulatinghugeamountsof data,effectivehigh-level datamanagementis an increasinglyimportantproblem. Unfortunately, so far the solutionsto the high-level datamanagementproblemeither requiredeepunderstandingof specificstoragear-chitecturesandfile layouts(asin high-performancefile storagesystems)or produceunsatisfactoryI/Operformancein exchangefor ease-of-useandportability (asin relationalDBMSs).

In this paperwe presenta novel applicationdevelopmentenvironmentwhich is built aroundanactive meta-datamanagementsystem(MDMS) to handlehigh-level datain an effective manner. Thekey componentsof our three-tieredarchitectureare userapplication,the MDMS, and a hierarchicalstoragesystem(HSS).Ourenvironmentovercomestheperformanceproblemsof puredatabase-orientedsolutions,while maintainingtheir advantagesin termsof ease-of-useandportability. Thehigh levelsofperformanceareachievedby theMDMS, with theaidof user-specified,performance-orienteddirectives.Ourenvironmentsupportsasimple,easy-to-useyetpowerful userinterface,leaving thetaskof choosingappropriateI/O techniquesfor theapplicationat handto theMDMS. We discussthe importanceof anactiveMDMS andshow how thethreecomponentsof ourenvironment,namelyapplication,theMDMS,andtheHSS,fit together. We alsoreportperformancenumbersfrom our ongoingimplementationandillustratethatsignificantimprovementsaremadepossiblewithoutundueprogrammingeffort.

1 Intr oduction

Many large-scalescientificapplicationsaredataintensive,processinglargedatasetsrangingfrommegabytesto terabytes.Theseincludeapplicationsfrom thedataanalysisandmining, imageprocessing,largearchivemaintenance,real-timeprocessingandsoon. As anexample,consideratypicalcomputationalscienceanal-ysiscycle,shown in Figure1. In thiscycle thereareseveralstepsinvolved.Theseincludemeshgeneration,domaindecomposition,simulation,visualization,andinterpretationof results,archival of dataandresults�

Departmentof Computer Scienceand Engineering, The Pennsylvania State University, University Park, PA 16802,[email protected].�

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, e-mail:�jano,thakur,stevens� @mcs.anl.gov.

Page 2: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

VisualizationAnalysis

Simulation

Archive Data

Decomposition

Mesh

Adjust

Generation

Parameters

Domain

Cycle

Figure1: A typicalcomputationalscienceanalysiscycle.

for post-processingandcheck-pointing,andadjustmentof parameters.Thus,it is not sufficient to considersimulationalonewhendetermininghow to storeor accessdatasetsbecausethedatasetsin questionareusedin otherphasesaswell. In addition,thesestepsmayneedto beperformedin a heterogeneousdistributedenvironment. Theseconsiderationsrequirestoringvisualizationandcheckpointdata(which canrun into100sof megabytesto terabytesrange)in a canonicalform so thatothertoolscanusethemeasilywithouthaving to re-organizethedata. Furthermore,the re-startof computationwith differentnumberof proces-sorsrequiresthatdatastoragebeindependentof numberof processorsthatproducedit. SuchrequirementspresentchallengingI/O intensiveproblemsandanapplicationprogrammermaybeoverwhelmedif requiredto solve theseproblems.

Designingefficient I/O schemesfor suchI/O intensive problemsdemandsexpertknowledgeandis notsuitablefor a computationalscientist. Consequently, with the increasingnumberof applicationsthat ma-nipulatehugeamountsof data,theeffective datamanagementproblemis becomingincreasinglyimportant.Althoughthisproblemhasbeenaddressedthroughouttheyearsatdifferentlevels,thereis still little consen-susoverhow to balancetheease-of-useandefficiency.

In oneextremeof thespectrumtherearehigh-performanceparallelfile systems(e.g.,Intel’s PFS[25]andIBM’ s Vesta[8]) thathave beenbuilt to exploit theparallelI/O capabilitiesprovidedby modernarchi-tectures.They achieve this goal by adoptingsmartI/O optimizationtechniquessuchasprefetching[17],caching[22, 5], andparallelI/O [15, 10]. However, thereareseriousobstaclespreventingthefile systemsfrom becominga realsolutionto the high-level datamanagementproblem. First of all, userinterfacesofthe file systemsare low-level [21]. They force the usersto expressaccesspatternsof their codesusingfile pointers,byteoffsets,etc.,whichdonotdirectlymatchtheapplications’datastructures,whicharelargemulti-dimensionalarrays,imagesandsoforth. Second,everyfile systemcomeswith its own setof I/O calls,which rendersensuringprogramportabilitya very difficult task.Thethird problemwith thefile systemsisthatthefile systempoliciesandrelatedoptimizationsarein generalhard-codedin it andaretunedto workwell for a few commonlyoccurringcasesonly. As notedby Karpovich et al. [19] amongtheothers,eveniftheprogrammerhasfull-knowledgeof accesspatternsof hercode,it is difficult to convey this informationto thefile systemin aconvenientway. Overall,high level I/O performancefrom parallelfile systemsis pos-sibleonly if significantsacrificesaremadefrom portability, codereuse,andease-of-programming.NoticealsothatalthoughtheparallelI/O libraries(e.g.,[6]) built on top of parallelfile systemshave thepotentialto providebothease-of-useandhighperformancethroughtheuseof advancedI/O optimizationtechniquessuchascollective I/O [14, 20], andarraychunking[26], their extensibility is severelylimited by thedesignprinciplesandtheprogramminglanguageusedin their implementation[19].

At theotherendof thespectrumaredatabasemanagementsystems(DBMS). They provide a layerontop of file systems,which is portable,extensible,easyto useandmaintain,and that allows a clearand

Page 3: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

������������������������������������������

������������������������������������������������������������������������������������

������������������������������������������������������������������������������

������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������

���������������������������������������������������������������������������������������������������������

���������������������������������������������

��������������������������������������������������

������������

������������������������������������������������������������������������������������

�! "�!#

(Argonne National Lab)

$%##(San Diego Supercomputer Center)

&('*)+'-,.'0/2134'6567819);:=<

>?'*@A'-B B*19BCEDGF 1

H'*,6B=1G/&I:*/ J K(L );:*<='*B&I:*/ J KIM M2N B :=781H' L 1O D , D )H' L 1

(Northwestern University)

PRQEQES;TVU6WYXZTV[]\

Figure2: Three-tieredarchitecture(Threesitesin this figure illustrate the currentexperimentalsetuptoevaluatethearchitecture).

naturalinteractionwith theapplicationsby abstractingout thefile namesandfile offsets. However, theseadvantagesdonot comefor free. Sincetheir maintarget is to begeneralpurpose,they cannotprovide highperformanceon a specificplatform. Additionally, thedataconsistenceandintegrity semanticsprovidedbyalmostall DBMS put anaddedobstacleto high performance.Applicationsthat processlarge amountsofread-onlydatasuffer unnecessarilyasa resultof theseintegrity constraints[19]. And finally, mostDBMSsupportonly avery limited setof datatypesanddatamanipulationmodels.

In this paperwe presenta novel approachto the high-level datamanagementproblem. Our approachtriesto combinetheadvantagesof file systemsanddatabases,while avoidingtheir respective disadvantages.It provides a user-friendly programmingenvironmentwhich allows easyapplicationdevelopment,codereuse,andportability; at thesametime, it extractshigh I/O performancefrom theunderlyingparallelI/Oarchitectureby employing advancedI/O optimizationtechniqueslike datasieving and collective I/O. Itachieves thesegoalsby using an active meta-datamanagementsystem(MDMS) that interactswith theparallelapplicationin questionaswell aswith theunderlyinghierarchicalstorageenvironment.

The proposedprogrammingenvironmenthasthreekey components:(1) userprogram;(2) meta-datamanagementsystem(MDMS); and(3) hierarchicalstoragesystem(HSS).Thesethreecomponentscanex-ist in thesamesiteor canbefully-distributedacrossdistantsites.For example,aspartof our experimentswe run a parallelvolumerenderingapplicationon theSP-2at ArgonneNationalLab thatinteractswith theMDMS locatedatNorthwesternUniversityandaccessesits datafiles(currently)usingTCP/IPstoredontheHPSS(High PerformanceStorageSystem)[11] installedat SanDiego SupercomputerCenter. Theexperi-mentalconfigurationis depictedin Figure2. Functionalitiesof thethick double-arrows will beexplainedinsubsequentsections.

Theremainderof this paperis organizedasfollows. In section2 we presentthedetailsof theMDMSwhich is built usingObject-RelationalDBMS (OR-DBMS) technology[27]. In section3 we discusstheHSS,focusingin particularon the advancedI/O optimizationsandon how they areactivatedusinguserdirectives. In section4 we discussbasicI/O commandsusedin applications(i.e., theuserinterfaceof ourarchitecture)andmake a casefor a simplersetof well-implementedI/O functionalities. In section5 wepresentperformancenumbersfrom our initial implementationusingseveral applications.In section6 webriefly discussrelatedwork onhigh-level scientificdatamanagementandI/O optimizationsandin section7

Page 4: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

weconcludethepaperandbriefly discussongoingwork.

2 Meta-data ManagementSystem(MDMS)

Our MDMS is an active middle-ware currently being built at NorthwesternUniversity with the aim ofproviding a uniform interfaceto data-intensive applicationsandhierarchicalstoragesystems.ApplicationsandHSScancommunicatewith theMDMS toobtainhighperformanceI/O fromtheunderlyingarchitecture.Themainfunctionsfulfilled by theMDMS canbesummarizedasfollows:^ It storesinformationabouttheabstractstoragedevices(ASDs)thatcanbeaccessedby applications.By queryingthe MDMS, the applicationscanlearnwherein the HSStheir datareside(i.e., in what partof thestoragehierarchy)without theneedof specifyingfile names.They canalsoaccesstheperformancecharacteristics(e.g.,speed)of the ASDs andselecta suitableASD (e.g.,a disk sub-systemconsistingofeightseparatediskarrays)to storetheir datasets.^ It storesinformationaboutthe storage patternsand accesspatternsof datasets. For example,aspecificmulti-dimensionalarraythatis stripedacrossfour disk devicesin round-robinmannerwill have anentry in theMDMS. TheMDMS utilizesthis informationin a numberof ways.Themostimportantusageof this information,however, is to decidea parallel I/O methodbasedon accesspatterns(hints) providedby theapplication.By comparingthestoragepatternandaccesspatternof a dataset,theMDMS can,forexample,advisetheHSSto performcollective I/O [14] or prefetching[17] for thedatasetin question.^ It storesinformationaboutthe pendingaccesspatterns. It utilizes this informationin taking someglobaldecisions,possiblyinvolving datasetsfrom multiple applications(e.g.,staginga numberof relatedfiles from tapesub-systemto disk sub-system,or migratinga numberof files from disk sub-systemto tapesub-system[11]).^ It storesinformationabouttheusersandapplications.A (user, application) pair is calleda context.Whena userstartsto run an application,the MDMS determinesthe portionsof its meta-datathat sheisallowedto see.As theapplicationruns,theMDMS accumulatesaccessstatisticsabouttheuseraswell asherdatasetsandutilizesthis informationin successive runsto enhancetheI/O performancefurther. In otherwords,theMDMS alsokeepsmeta-datafor specifyingaccesshistoryandtrail of navigation, thoughtheyarenotcoveredin thispaper.

NoticethattheMDMS is not merelya datarepositorybut alsoanactivecomponentin theoverall datamanagementprocess.It communicateswith applicationsaswell astheHSSandcaninfluencethedecisionstakenby theboth.

2.1 Dir ective Categories

TheMDMS is built usingtheOR-DBMStechnology[27] thatallows highexpressivenessandextensibility.It keepsbothsystem-level [4, 7] anduser-levelmeta-data[16] aboutthedatasets,datafiles,ASDs,physicalstoragedevices(PSDs),accesspatterns,andusers.Its communicationwith theuserapplicationis throughso calleduserdirectives. Although thesedirectivescomein a variety of flavors, therearetwo importantgroups:^ LayoutDirectives: Theapplicationcanhave somecontrolover how its dataarelaid out in theHSS.Thesedirectivesarestrongin thesensethat(unlessthey areinconsistentwith eachother)theMDMS shouldtake careof themandshouldadvisethe HSSto take necessarysteps. An examplewould be a (storagepattern)directive thattellstheMDMS thattheapplicationwantsaspecificdatasetto bestoredin aparticularfashion.Anotherexamplewould bea (usagepattern)directive that tells theMDMS to advisetheHSStomigratea specificdataset from disk to tape,probablybecauseit will not be usedin the remainderpartof the application. Thesedirectivesareimportantindicationsaboutparticularusesof datain subsequent

Page 5: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

computationsandin mostcaseseithertell how adatasetshouldbecreated(if thedatasetdoesnotexist) intheHSSor how adatasetshouldbere-organized, e.g.,re-stripedor re-distributed(if it existsalready).^ AccessPatternDirectives: Thesedirectivesareusedashintswhich indicatethattheuser’s applicationis aboutto start a specifiedsequenceof I/O operationson the HSS. In responseto sucha directive theMDMS can,for example,advisetheHSSto performa specificI/O optimizationin accessingthe relevantdata. Theseoptimizationsincludeprefetching,caching,collective I/O etc.,andaremild in the sensethatthey donot imply amajordatare-organizationontheHSSpartbut ratherenableaspecificI/O optimizationto beperformedin orderto reducetheI/O bottleneck.

Currentlyboth typesof directivesarebeingimplementedusingembeddedSQL (E-SQL) functions. Itshouldbementionedthatadirective (whetherstrongor mild) mayberejectedat eitherof two points.First,theMDMS maydeclineto take thenecessaryactiondueto, for example,the fact that thedirectivesusedtogetherarenot consistentwith eachother. Second,even if theMDMS advisesanI/O optimizationto theHSS,theHSSmayrejectit dueto currentstateof it (e.g.,overloaded).It shouldbestressedthattheHSSistheonly componentin theproposedarchitecturethatknows thedetailsaboutits physicalI/O resourcesandis freeto take any actionif doingsohelpsto improve theoverall I/O performance.

Onemightwonderat thispointwhy insteadof usingaMDMS (andincurringits overhead)theapplica-tion codedoesnotdirectlynegotiatewith theHSS.This is nota reasonablesolutionfor at leasttwo reasons.First, theapplicationprogramdoesnot needto know thedetailsof the HSS.Otherwise,it would be verydifficult to decideappropriateI/O optimizations. In the proposedarchitecturethe userdoesnot needtoknow whereherdatasetsresideon theHSSandwhattheir storagepatternsare,thoughshecanobtainthisinformationby queryingtheMDMS. Thesecondfactorthatpreventstheusercodefrom communicatingdi-rectlywith theHSSis thefactthatauser’s codein generalcannothaveaglobalinformationabouttheotherapplicationsconcurrentlyusingtheHSS.In orderto managetheoverall I/O activity effectively it requiresaglobal knowledge of all users’accesspatternsandI/O resources;that informationis availableasmeta-datain theMDMS. It shouldbenoted,however, theactualdatatransferoccursalwaysbetweentheapplicationandtheHSSdirectlyoncetheappropriateI/O methodhasbeendecided.It shouldalsobementionedthatitis a viablealternative to implementtheMDMS asa partof (or on top of) theHSSaslong asthesemanticsof their interactionarepreserved.

2.2 Indi vidual Dir ectives

Usingdirectives,anapplicationcanconvey informationaboutits expectedI/O activity to theMDMS. Asa minimum, we expect the applications’userto know how her datawill be usedby parallelprocessors(henceforthwe call this informationaccesspattern). However, in generalthemoreinformationis providedto theMDMS, thebetterI/O optimizationswill beenabled.Table1 shows thetypesof informationthatcanbeprovidedby applicationusingdirectivesto theMDMS. Thesedirectivescanbecombinedin meaningfulwaysandcanbeappliedto anumberof datasetssimultaneouslyasexplainedbelow.

In this environment,anaccesspatternfor a datasetis specifiedby indicatinghow thedatasetis to bedividedandaccessedbyparallelprocessors.Forexample,anaccesspatternsuchas(BLOCK,*) saysthatthedatasetin questionis divided(logically) into groupsof rowsandeachgroupof rowswill bemostlyaccessedby a singleprocessor. Thenumberof row-groupscanalsobespecifiedusinganotherdirective. Noticethatthisaccesspatterninformationdoesnothave to beveryaccurate,astheprocessorsmayoccasionallyaccessthe dataelementsoutsidetheir assignedportionsandthe exact setof elementsaccessedwill be specifiedin the actualread/writeI/O calls. A few frequentlyusedaccesspatternsaredepictedin Figure3(a) fora four processorcase.Eachprocessor’s portion areshadedusinga differentstyle. A (BLOCK,BLOCK)accesspatternindicatesthateachprocessorwill mostlyaccessarectangularblockanda(*,CYCLIC) patterninvolving _ processorsimpliesthateachprocessorwill mostlyaccessevery _a`Zb columnof thedataset.A

Page 6: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

Table1: Userdirectives.

directiveexplanation usage

storagepatterndirective c(dfehg(ikjmlon A( p ptrnq , p ptrnq ,...)declaresastoragepatternfor theHSS-residentdatasetAeachp ptrnq canbeBLOCK, CYCLIC, BLOCK-CYCLIC(b),or *accesspatterndirective ghrfr(ntsfs A( p ptrnq , p ptrnq ,...)declaresanaccesspatternfor theHSS-residentdatasetAeachp ptrnq canbeBLOCK, CYCLIC, BLOCK-CYCLIC(b),or *abstractstoragedirective smuhcdhgeon D( p npq , p npq ,...)declaresanabstractstoragedevice (ASD) andthenumberof processorsinvolvede.g.DISK(4,4) indicatesa vawxv processorarraywill accessadiskstorageI/O typedirective jycufz|{tn A( p typeq )declaresthetypeof I/O thatwill beperformedondatasetAp typeq canberead-only(RO), write-only (WO), or read-writemix (RW)sequentialitydirective sy}iout~ A( p seq,bq , p seq,bq ,...)informsaboutaccesspatternfor eachdimensionof datasetAeachp seq,bq canbe(sequential,*),(strided,B),or (variable,*),whereB is thestriderepetitiondirective don({tnfgu A( p repq )informsabouthow many timesdatasetA will beaccessedp repq canbeonly-once(OT), or multiple-times(MT)usagepatterndirective �Rs(gehn A( p usgq )informsaboutwhatto dowith datasetA after thispoint in programp usgq canbepurge(PG)or migrate(MG)association(abstractdatasetspace)directive ghsfs(ctrojyguhn (A,B,C,...) �kjmu|� TdeclaresthatdatasetsA,B,C,... areassociatedwith Tanassociationimpliesthattheconcerneddatasetswill betreatedsimilarlydatasetsizedirective �tsojylon A( p size-in-bytesq )declaresanapproximatesizefor datasetAe.g., sojylon A(16,777,216)indicatesthatdatasetA is approximately16MBrequestsizedirective d�sojylon A( p rsq )declaresanapproximatesizefor datasetAp rsq canbesmallrequest(SR),largerequest(LR), or variablerequest(VR)meta-dataquerydirective }(�tndfz (entity,name,parameter)queriesa parameterof anentity (dataset,ASD, association,etc.)e.g.,query(dataset,A,storage-pattern)returnsthestoragepatternfor datasetA

Page 7: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

star ‘*’, on the otherhand,indicatesthat the dimensionin questionis not partitionedacrossprocessors;dependingon the parametersof the accompanying ��k�h�k���R� directive, suchan accesspatternmay implyeitherreplication(if multipleprocessorsareinvolved)or exclusiveaccess(if asingleprocessoris involved).

In our framework, thesepatternsis alsousedasstorage patterns.For example,a (BLOCK,*) storagepatterncorrespondsto row-major storagelayout (as in C), a (*,BLOCK) storagepatterncorrespondstocolumn-majorstoragelayout (asin Fortran),anda (BLOCK,BLOCK) storagepattern,on theotherhand,correspondsto blocked storagelayout which might be useful for large-scalelinear algebraapplicationswhosedatasetsareamenableto blocking[30].

As an example,considerthe following scenario.An I/O-intensive applicationexecutesin threestepsmanipulatingfive two-dimensionaldatasets(arrays)P, Q1, Q2, R1 andR2 whosedefault disk layoutsareassumedto be row-major(BLOCK,*): Step(1): a singleprocessorreadsthedatasetP andbroadcastsitscontentsto otherprocessors;Step (2): the datasetQ1 is createdby four processorscollectively in row-major order (BLOCK,*) on disk sub-system;also the dataset Q2 is createdby the sameprocessorsin(BLOCK,BLOCK) manner;andfinally, Step (3): two datasets,R1 andR2, arereadby four processorscollectively in row-majororderfrom disk sub-systemandthentheapplicationdoessomecomputationandterminates.In thefollowing weshow how theseI/O activitiescanbespecifiedusingouruser-level directives.For Step(1), theI/O activity canbecapturedby thedirective:�����h����� P(BLOCK,*).Hereit is assumedthatthedatasetP is accessedin row-wise;alsosinceno ��k�t�����R� directive appears,it isassumedasingleprocessor.For theI/O activity of Step(2), wecanuse�t�R���o���o�R� Q1(BLOCK,*) �|���t�k�t��� DISK(4)�t�R���o���o�R� Q2(BLOCK,BLOCK) ��k�t�����R� DISK(2,2).Theuserindicatesthat four processorswill (mostprobably)write onto four disjoint partsof Q1 (i.e., fourrow-blocks). The systemnow hasseveral optionsin responseto this directive. It can, for example,usefour disksandstoreeachprocessor’s portion in a separatedisk (asshown in Figure3(b)). This will alloweachprocessorto do I/O independentlyfrom others,maximizingthe I/O parallelism.Or alternatively, thewhole datasetQ1 (actually the file containsit) canbe stripedover four disks(asshown in Figure3(c),assumingthateachprocessors’portioncontainsfour stripeunitsof data).Althoughthis storagestyledoesnotnecessarilyleadto conflict-freediskaccesses,in mostpracticalcasesit allowssufficient I/O parallelism.Also,moreintelligentstripingmethodssuchastheoneshown in Figure3(d)caneliminatethepotentialdiskconflicts.

As for thedirective that involvesQ2, againthesystemhasa numberof options. Themostinterestingone,however, is theonethatusescollective I/O [14]. Sincethedefault(thefinal) disk layoutis row-major(BLOCK,*) andtheprocessorswill write datain (BLOCK,BLOCK) fashionadatare-organizationbetweenprocessorsmightbenecessaryfor highperformance.Wewill elaborateon this issuelaterin thispaper.Finally, for Step(3) thefollowing directivescanbeused:�����o�����h�h�k� (R1,R2) ���f�h� T�����h����� T(BLOCK,*) �|���t������� DISK(4)First, the �����o�k�R�h�h�k� directive indicatesthat R1 and R2 will be treatedtogether (i.e., accessed,staged,migratedin similar fashions)asshown in Figure3(e).Thentheseconddirective conveys theaccesspattern.Note that sincethe storagepatternandthe accesspatternarethe same,no collective I/O is required. Anabstractdatasetspace(T, in our example)is a dummydatasetvariablewhich helpsto specifythedesiredaccesspatternof a dataset by describingits relationshipto other datasets. For example,‘ �����o�k�R�h�h�k�(A,B,C) �"�f�h� T’ impliesthatwhenever thedatasetA is accessed,thedatasetsB andC will alsolikely beaccessed.This informationcanthenbeusedto pre-stagethedatasetsB andC from tapeto disk (if theyarenot alreadyon disk) whenever A is accessed.The ���R�f���R�o�t��� directive alsoprovidesconveniencein

Page 8: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �

� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �

� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � ��8�8�8�8�8�8�8�8�8��8�8�8�8�8�8�8�8�8��8�8�8�8�8�8�8�8�8� 8 8 8 8 8 8 8 8 8  8 8 8 8 8 8 8 8 8  8 8 8 8 8 8 8 8 8 ¡8¡8¡8¡8¡8¡8¡8¡8¡8¡¡8¡8¡8¡8¡8¡8¡8¡8¡8¡¡8¡8¡8¡8¡8¡8¡8¡8¡8¡¡8¡8¡8¡8¡8¡8¡8¡8¡8¡¡8¡8¡8¡8¡8¡8¡8¡8¡8¡¡8¡8¡8¡8¡8¡8¡8¡8¡8¡¢8¢8¢8¢8¢8¢8¢8¢8¢8¢¢8¢8¢8¢8¢8¢8¢8¢8¢8¢¢8¢8¢8¢8¢8¢8¢8¢8¢8¢¢8¢8¢8¢8¢8¢8¢8¢8¢8¢¢8¢8¢8¢8¢8¢8¢8¢8¢8¢¢8¢8¢8¢8¢8¢8¢8¢8¢8¢£8£8£8£8£8£8£8£8£8££8£8£8£8£8£8£8£8£8££8£8£8£8£8£8£8£8£8£¤8¤8¤8¤8¤8¤8¤8¤8¤8¤¤8¤8¤8¤8¤8¤8¤8¤8¤8¤¤8¤8¤8¤8¤8¤8¤8¤8¤8¤¥8¥8¥8¥8¥8¥8¥8¥8¥8¥¥8¥8¥8¥8¥8¥8¥8¥8¥8¥¥8¥8¥8¥8¥8¥8¥8¥8¥8¥¦8¦8¦8¦8¦8¦8¦8¦8¦8¦¦8¦8¦8¦8¦8¦8¦8¦8¦8¦¦8¦8¦8¦8¦8¦8¦8¦8¦8¦§8§8§8§8§8§8§8§8§8§8§8§¨8¨8¨8¨8¨8¨8¨8¨8¨8¨8¨8¨©8©8©8©8©8©8©8©8©8©8©8©ª8ª8ª8ª8ª8ª8ª8ª8ª8ª8ª8ª

« « « « « « « « « « «« « « « « « « « « « «« « « « « « « « « « «¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬

­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­® ® ® ® ® ® ® ® ® ® ®® ® ® ® ® ® ® ® ® ® ®® ® ® ® ® ® ® ® ® ® ®¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° °± ± ± ± ± ± ± ± ± ± ±± ± ± ± ± ± ± ± ± ± ±± ± ± ± ± ± ± ± ± ± ±² ² ² ² ² ² ² ² ² ² ²² ² ² ² ² ² ² ² ² ² ²² ² ² ² ² ² ² ² ² ² ²

³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ³³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ³³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ³´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´µ8µ8µ8µ8µ8µ8µ8µ8µ8µµ8µ8µ8µ8µ8µ8µ8µ8µ8µµ8µ8µ8µ8µ8µ8µ8µ8µ8µ¶8¶8¶8¶8¶8¶8¶8¶8¶8¶¶8¶8¶8¶8¶8¶8¶8¶8¶8¶¶8¶8¶8¶8¶8¶8¶8¶8¶8¶·8·8·8·8·8·8·8·8·8··8·8·8·8·8·8·8·8·8··8·8·8·8·8·8·8·8·8··8·8·8·8·8·8·8·8·8··8·8·8·8·8·8·8·8·8··8·8·8·8·8·8·8·8·8·¸8¸8¸8¸8¸8¸8¸8¸8¸8¸¸8¸8¸8¸8¸8¸8¸8¸8¸8¸¸8¸8¸8¸8¸8¸8¸8¸8¸8¸¸8¸8¸8¸8¸8¸8¸8¸8¸8¸¸8¸8¸8¸8¸8¸8¸8¸8¸8¸¸8¸8¸8¸8¸8¸8¸8¸8¸8¸¹8¹8¹8¹8¹8¹8¹8¹8¹8¹¹8¹8¹8¹8¹8¹8¹8¹8¹8¹¹8¹8¹8¹8¹8¹8¹8¹8¹8¹º8º8º8º8º8º8º8º8º8ºº8º8º8º8º8º8º8º8º8ºº8º8º8º8º8º8º8º8º8º»8»8»8»8»8»8»8»8»8»»8»8»8»8»8»8»8»8»8»»8»8»8»8»8»8»8»8»8»¼8¼8¼8¼8¼8¼8¼8¼8¼8¼¼8¼8¼8¼8¼8¼8¼8¼8¼8¼¼8¼8¼8¼8¼8¼8¼8¼8¼8¼

½8½8½8½8½8½8½8½8½8½½8½8½8½8½8½8½8½8½8½½8½8½8½8½8½8½8½8½8½¾8¾8¾8¾8¾8¾8¾8¾8¾8¾¾8¾8¾8¾8¾8¾8¾8¾8¾8¾¾8¾8¾8¾8¾8¾8¾8¾8¾8¾¿8¿8¿8¿8¿8¿8¿8¿8¿8¿¿8¿8¿8¿8¿8¿8¿8¿8¿8¿¿8¿8¿8¿8¿8¿8¿8¿8¿8¿¿8¿8¿8¿8¿8¿8¿8¿8¿8¿¿8¿8¿8¿8¿8¿8¿8¿8¿8¿¿8¿8¿8¿8¿8¿8¿8¿8¿8¿À8À8À8À8À8À8À8À8À8ÀÀ8À8À8À8À8À8À8À8À8ÀÀ8À8À8À8À8À8À8À8À8ÀÀ8À8À8À8À8À8À8À8À8ÀÀ8À8À8À8À8À8À8À8À8ÀÀ8À8À8À8À8À8À8À8À8ÀÁ8Á8Á8Á8Á8Á8Á8Á8Á8ÁÁ8Á8Á8Á8Á8Á8Á8Á8Á8ÁÁ8Á8Á8Á8Á8Á8Á8Á8Á8ÁÂ8Â8Â8Â8Â8Â8Â8Â8Â8ÂÂ8Â8Â8Â8Â8Â8Â8Â8Â8ÂÂ8Â8Â8Â8Â8Â8Â8Â8Â8ÂÃ8Ã8Ã8Ã8Ã8Ã8Ã8Ã8Ã8ÃÃ8Ã8Ã8Ã8Ã8Ã8Ã8Ã8Ã8ÃÃ8Ã8Ã8Ã8Ã8Ã8Ã8Ã8Ã8ÃÄ8Ä8Ä8Ä8Ä8Ä8Ä8Ä8Ä8ÄÄ8Ä8Ä8Ä8Ä8Ä8Ä8Ä8Ä8ÄÄ8Ä8Ä8Ä8Ä8Ä8Ä8Ä8Ä8Ä

Å Å Å Å Å Å Å Å Å Å ÅÅ Å Å Å Å Å Å Å Å Å ÅÅ Å Å Å Å Å Å Å Å Å ÅÆ Æ Æ Æ Æ Æ Æ Æ Æ Æ ÆÆ Æ Æ Æ Æ Æ Æ Æ Æ Æ ÆÆ Æ Æ Æ Æ Æ Æ Æ Æ Æ Æ

Ç Ç Ç Ç Ç Ç Ç Ç Ç Ç ÇÇ Ç Ç Ç Ç Ç Ç Ç Ç Ç ÇÇ Ç Ç Ç Ç Ç Ç Ç Ç Ç ÇÈ È È È È È È È È È ÈÈ È È È È È È È È È ÈÈ È È È È È È È È È ÈÉ É É É É É É É É É ÉÉ É É É É É É É É É ÉÉ É É É É É É É É É ÉÊ Ê Ê Ê Ê Ê Ê Ê Ê Ê ÊÊ Ê Ê Ê Ê Ê Ê Ê Ê Ê ÊÊ Ê Ê Ê Ê Ê Ê Ê Ê Ê ÊË Ë Ë Ë Ë Ë Ë Ë Ë Ë ËË Ë Ë Ë Ë Ë Ë Ë Ë Ë ËË Ë Ë Ë Ë Ë Ë Ë Ë Ë ËÌ Ì Ì Ì Ì Ì Ì Ì Ì Ì ÌÌ Ì Ì Ì Ì Ì Ì Ì Ì Ì ÌÌ Ì Ì Ì Ì Ì Ì Ì Ì Ì Ì

Í Í Í Í Í Í Í Í Í Í ÍÍ Í Í Í Í Í Í Í Í Í ÍÍ Í Í Í Í Í Í Í Í Í ÍÎ Î Î Î Î Î Î Î Î Î ÎÎ Î Î Î Î Î Î Î Î Î ÎÎ Î Î Î Î Î Î Î Î Î Î

Ï8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8ÏÏ8Ï8Ï8Ï

Ð8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8ÐÐ8Ð8Ð8Ð

Ñ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8ÑÑ8Ñ8Ñ

Ò8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8ÒÒ8Ò8Ò

Ó8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8ÓÓ8Ó8Ó8Ó

Ô8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8ÔÔ8Ô8Ô

Õ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8ÕÕ8Õ8Õ

Ö8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8ÖÖ8Ö8Ö

×8×8×8×8××8×8×8×8××8×8×8×8××8×8×8×8××8×8×8×8××8×8×8×8××8×8×8×8×Ø8Ø8Ø8Ø8ØØ8Ø8Ø8Ø8ØØ8Ø8Ø8Ø8ØØ8Ø8Ø8Ø8ØØ8Ø8Ø8Ø8ØØ8Ø8Ø8Ø8ØØ8Ø8Ø8Ø8ØÙ8Ù8Ù8Ù8ÙÙ8Ù8Ù8Ù8ÙÙ8Ù8Ù8Ù8ÙÙ8Ù8Ù8Ù8ÙÙ8Ù8Ù8Ù8ÙÙ8Ù8Ù8Ù8ÙÙ8Ù8Ù8Ù8Ù

Ú8Ú8Ú8Ú8ÚÚ8Ú8Ú8Ú8ÚÚ8Ú8Ú8Ú8ÚÚ8Ú8Ú8Ú8ÚÚ8Ú8Ú8Ú8ÚÚ8Ú8Ú8Ú8ÚÚ8Ú8Ú8Ú8ÚÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÛ8Û8Û8Û8ÛÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8ÜÜ8Ü8Ü8Ü8Ü

Ý8Ý8Ý8Ý8ÝÝ8Ý8Ý8Ý8ÝÝ8Ý8Ý8Ý8ÝÝ8Ý8Ý8Ý8ÝÝ8Ý8Ý8Ý8ÝÝ8Ý8Ý8Ý8ÝÝ8Ý8Ý8Ý8ÝÞ8Þ8Þ8Þ8ÞÞ8Þ8Þ8Þ8ÞÞ8Þ8Þ8Þ8ÞÞ8Þ8Þ8Þ8ÞÞ8Þ8Þ8Þ8ÞÞ8Þ8Þ8Þ8ÞÞ8Þ8Þ8Þ8Þß8ß8ß8ß8ß8ß8ß8ß8ß8ß8ßß8ß8ß8ß8ß8ß8ß8ß8ß8ß8ßß8ß8ß8ß8ß8ß8ß8ß8ß8ß8ßß8ß8ß8ß8ß8ß8ß8ß8ß8ß8ßà8à8à8à8à8à8à8à8à8à8àà8à8à8à8à8à8à8à8à8à8àà8à8à8à8à8à8à8à8à8à8àà8à8à8à8à8à8à8à8à8à8à

á8á8á8á8á8á8á8á8á8á8áá8á8á8á8á8á8á8á8á8á8áá8á8á8á8á8á8á8á8á8á8áá8á8á8á8á8á8á8á8á8á8áá8á8á8á8á8á8á8á8á8á8áá8á8á8á8á8á8á8á8á8á8áâ8â8â8â8â8â8â8â8â8â8ââ8â8â8â8â8â8â8â8â8â8ââ8â8â8â8â8â8â8â8â8â8ââ8â8â8â8â8â8â8â8â8â8ââ8â8â8â8â8â8â8â8â8â8ââ8â8â8â8â8â8â8â8â8â8âã8ã8ã8ã8ã8ã8ã8ã8ã8ã8ãã8ã8ã8ã8ã8ã8ã8ã8ã8ã8ãã8ã8ã8ã8ã8ã8ã8ã8ã8ã8ãã8ã8ã8ã8ã8ã8ã8ã8ã8ã8ãä8ä8ä8ä8ä8ä8ä8ä8ä8ä8ää8ä8ä8ä8ä8ä8ä8ä8ä8ä8ää8ä8ä8ä8ä8ä8ä8ä8ä8ä8ää8ä8ä8ä8ä8ä8ä8ä8ä8ä8äå8å8å8å8å8å8å8å8å8å8åå8å8å8å8å8å8å8å8å8å8åå8å8å8å8å8å8å8å8å8å8åæ8æ8æ8æ8æ8æ8æ8æ8æ8æ8ææ8æ8æ8æ8æ8æ8æ8æ8æ8æ8ææ8æ8æ8æ8æ8æ8æ8æ8æ8æ8æ ç8ç8çç8ç8çç8ç8çç8ç8çç8ç8çç8ç8çç8ç8çç8ç8çç8ç8çç8ç8çç8ç8çç8ç8ç

è8è8èè8è8èè8è8èè8è8èè8è8èè8è8èè8è8èè8è8èè8è8èè8è8èè8è8èè8è8è

é8é8éé8é8éé8é8éé8é8éé8é8éé8é8éé8é8éé8é8éé8é8éé8é8éé8é8éé8é8é

ê8ê8êê8ê8êê8ê8êê8ê8êê8ê8êê8ê8êê8ê8êê8ê8êê8ê8êê8ê8êê8ê8êê8ê8ê

ë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ëë8ë

ì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ìì8ì

í8íí8íí8íí8íí8íí8íí8íí8íí8íí8íí8íí8í

î8îî8îî8îî8îî8îî8îî8îî8îî8îî8îî8îî8î

ï8ïï8ïï8ïï8ïï8ïï8ïï8ïï8ïï8ïï8ïï8ïï8ï

ð8ðð8ðð8ðð8ðð8ðð8ðð8ðð8ðð8ðð8ðð8ðð8ð

ñ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ññ8ñ8ñ

ò8ò8òò8ò8òò8ò8òò8ò8òò8ò8òò8ò8òò8ò8òò8ò8òò8ò8òò8ò8òò8ò8òò8ò8ò

ó8óó8óó8óó8óó8óó8óó8óó8óó8óó8óó8óó8ó

ô8ôô8ôô8ôô8ôô8ôô8ôô8ôô8ôô8ôô8ôô8ôô8ô

õ ö=÷

ø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8øø8ø

ù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ùù8ù ú8ú8ú8ú8ú8ú8ú8ú8ú8úú8ú8ú8ú8ú8ú8ú8ú8ú8úú8ú8ú8ú8ú8ú8ú8ú8ú8úû8û8û8û8û8û8û8û8û8ûû8û8û8û8û8û8û8û8û8ûû8û8û8û8û8û8û8û8û8û

ü8ü8ü8ü8ü8ü8ü8ü8ü8üü8ü8ü8ü8ü8ü8ü8ü8ü8üý8ý8ý8ý8ý8ý8ý8ý8ý8ýý8ý8ý8ý8ý8ý8ý8ý8ý8ýþ8þ8þ8þ8þ8þ8þ8þ8þ8þþ8þ8þ8þ8þ8þ8þ8þ8þ8þÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿ8ÿ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������ � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� � � � � � � � � � � � � � � � � � � � � � � � � � � � !�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!!�!�!�!�!�!�!�!�!�!"�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�""�"�"�"�"�"�"�"�"�"

# # # # # # # # # # ## # # # # # # # # # ## # # # # # # # # # #$ $ $ $ $ $ $ $ $ $ $$ $ $ $ $ $ $ $ $ $ $$ $ $ $ $ $ $ $ $ $ $

% % % % % % % % % % %% % % % % % % % % % %% % % % % % % % % % %& & & & & & & & & & && & & & & & & & & & && & & & & & & & & & &' ' ' ' ' ' ' ' ' ' '' ' ' ' ' ' ' ' ' ' '' ' ' ' ' ' ' ' ' ' '( ( ( ( ( ( ( ( ( ( (( ( ( ( ( ( ( ( ( ( (( ( ( ( ( ( ( ( ( ( () ) ) ) ) ) ) ) ) ) )) ) ) ) ) ) ) ) ) ) )) ) ) ) ) ) ) ) ) ) )* * * * * * * * * * ** * * * * * * * * * ** * * * * * * * * * *

+ + + + + + + + + + ++ + + + + + + + + + ++ + + + + + + + + + +, , , , , , , , , , ,, , , , , , , , , , ,, , , , , , , , , , ,-/.1032 4 -/5 2768-/2 -90 :3270 ;�-<53:

=?>

=�@

(e)

AB�B�B�B�B�B�B�B�B�BB�B�B�B�B�B�B�B�B�BB�B�B�B�B�B�B�B�B�BC�C�C�C�C�C�C�C�C�CC�C�C�C�C�C�C�C�C�CC�C�C�C�C�C�C�C�C�CD�D�D�D�D�D�D�D�D�DD�D�D�D�D�D�D�D�D�DD�D�D�D�D�D�D�D�D�DD�D�D�D�D�D�D�D�D�DD�D�D�D�D�D�D�D�D�DD�D�D�D�D�D�D�D�D�DE�E�E�E�E�E�E�E�E�EE�E�E�E�E�E�E�E�E�EE�E�E�E�E�E�E�E�E�EE�E�E�E�E�E�E�E�E�EE�E�E�E�E�E�E�E�E�EE�E�E�E�E�E�E�E�E�EF�F�F�F�F�F�F�F�F�FF�F�F�F�F�F�F�F�F�FF�F�F�F�F�F�F�F�F�FG�G�G�G�G�G�G�G�G�GG�G�G�G�G�G�G�G�G�GG�G�G�G�G�G�G�G�G�GH�H�H�H�H�H�H�H�H�HH�H�H�H�H�H�H�H�H�HH�H�H�H�H�H�H�H�H�HI�I�I�I�I�I�I�I�I�II�I�I�I�I�I�I�I�I�II�I�I�I�I�I�I�I�I�I J J J J J J J J J J J JJ J J J J J J J J J J JJ J J J J J J J J J J JK K K K K K K K K K K KK K K K K K K K K K K KK K K K K K K K K K K KL L L L L L L L L L L LL L L L L L L L L L L LL L L L L L L L L L L LM M M M M M M M M M M MM M M M M M M M M M M MM M M M M M M M M M M M

N N N N N N N N N N N NN N N N N N N N N N N NN N N N N N N N N N N NO O O O O O O O O O O OO O O O O O O O O O O OO O O O O O O O O O O O

P�P�P�P�P�P�P�P�P�P�P�P�PQ�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q�Q

R�R�R�R�R�R�R�R�R�R�R�R�RS�S�S�S�S�S�S�S�S�S�S�S�S

(b) (c) (d)

T UWVYX[ZW\^] _a` T _7] U�VbXcZW\[` T U�VbX[Z�\d] UWVYX[Z�\[` T _7] Z�efZWVYg Z�` T ZWeWZWVYg Z?] _h`

(a)

Figure3: (a) Differentaccesspatterns. (b-d) Non-stripingvs. striping. (e) Abstractdatasetspace. (f)Contentionondisks.

specifyingtheaccesspatternof datasetswith respectto eachotherasin Step(3) of thescenariodiscussedabove. It impliesthatthecorrespondingportionsof R1andR2will beused(mostly)by thesameprocessor,andthereforeshouldpreferably bestoredon thesamedisk.

It shouldbeemphasizedthatthemainideabehindusingdirectivesis to helpthesystemmatch theaccesspattern(i.e., how datais accessed)andthestorage pattern(i.e., how datais stored).When,for example,a DISK(4) directive is sentto the MDMS, what the applicationprogramindicatesis that four processorswill accessthedatasetin questionin parallelfrom disk(s).While thebestI/O parallelismcanbeobtainedby allocatingeachprocessors’portionon a separatedisk, thesystemdoesnot necessarilyhasto do so. Itshouldbe noticed,though,in caseof lessthanfour disks(or I/O nodes)areusedthe I/O parallelismwillsuffer becauseof the possiblecontentionon disks(asshown in Figure3(f)). Thus,a DISK(4) directiveessentiallyrevealsto thesystemthat for maximumI/O parallelismat leastfour disk devicesareneeded.Itshouldalsobenotedthat thedirectivesexplainedabove arehigh-level andconstructedusingthenamesofthedatasetsusedin theapplicationwhichareintuitive to theuser. Contrastthis with a classicalfile systeminterfacethatboilsdown everythingto linearstreamsof bytesandoffsets.

2.3 Shorthand Notations

In many casesthereareanumberof datasetsthatcontainthesametypeof datadifferingonly in thedatethatthedatahave beencollected.As anexample,a broadcastagency canusesimilardatasetsfor thebroadcastdatacollectedin differentdays. It can,for example,usethenameslike BC001,BC002,...,BC365asdatasetnames,whereBC001is thedatasetfor thefirst dayof theyearandsoon. In casessuchasthis it mightbe very convenientto have somewild-card notationsin order to relieve the userfrom enteringthe samedirective for eachdataset.Ourcurrentnotationcloselyfollows theoneproposedby Musicketal. :

‘*’ Thesinglestarnotationasin ‘BC*’ indicatesthat theoperationin questionshouldbeappliedto alldatasetswith prefix ‘BC’. This is exactly thesameway thatthestaris usedin theUNIX file system.

Page 9: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

ikjml

inlinl

inl

ohp q r s t uwv1r xao yzs t u{ohp q r s |}q ~ ��s u���~ �

inlinl inlinl

i�l

�/� � � � � � �h� �� � � �<� �8�1� �

�8� � � � � � �h� �� � � �<� �8�<� �

�/� � � � � � �h� �� � � �<� �8�1� �

�8� � � � � � �h� �� � � �1� �/�<� �

�7�<�7�Y�?�a�b�7�8�

�hp p x � s t u��hp p x3� s �1��|[qm�hp p x3� s u<t |�p q ~ s x � s u1��~ �1s t u1p�p q ~ s x � s u1��~ ��s t ��u1p

u1��~ ��s p q ~ s t u�u1��~ ��s p q ~ s �1��|[q�u1��~ ��s p q ~ s u1t |�u���~ ��s p q ~ s |[q ~ �

� t � q s t u�� t � q s �1��|[q � t � q s |[q ~ ��s u1��~ � p ~ x3r q s t u¡p ~ x3r q s �1��|[q�p ~ x3r q s |}q ~ ��s u���~ �

¢£ �8¤ ¥ i £ ¦/£ § j ¨ ¦ ¨ § l

u���~ ��s p q ~ s � t � q©ª£ �3¥£ �8« ¥¥

£ �/¬£ �/­ ® £ � � � ©® £ � � � ª £ �8¯ � £ � °1¬� t � q s p ~ x3r q

p ~ x3r � v<q s y8��~ ~ q r ��h� � q p p s t u±u1��~ ��s p q ~ s t u£ �8²£ �8³

£ �/´

�h� � q p p s y8��~ ~ q r �µohp q r s t u

£ � ¥£ �8«

¶ ·W¸c¹[º1» j ¼ ½¶ ·f¸¾¹}º<» j ·W¸c¹[º1» ½¶ ·f¸¾¹}º<» j ·W¸c¹[º1» ½

�f�/�7¿YÀ<Á ����Á ¿zÂ}�9�a�Y�7�/�

Ã7�}�a�^�1�W�/�Ä�a�Y�7�/�

Å/Á �8�Y�Ä�h�b�7�/� �7�8¿z�z��Æ8�?Ãf�fÇ<Á À<�Y�?�a�b�7�8�

È/�7ÂWÃfÁ ÂcÆÉ��À7À<�Y�8�1�Y�?�a�b�7�/�

Figure4: Internalrepresentationin theMDMS.

For example, ���R�f���R�o�t��� R* �"�f�t� T will causeall thedatasetswhosenamesstartingwith prefix Rto getassociatedto theabstractdatasetspaceT.

‘**’ Thedoublestarnotationconnectsanumberof namesin thesamedirective. For example,�����o�k�R�h�h�k�(R**,S**) �"�f�t� T** meansthat we want to apply associationsto the instancesof datasetswhosenamesstartingwith R andS with matching suffixes.If we have thedatasetswith namesR1,S1,R2,andS2,thentheeffect of the �����o�k�R�h�h�k� directive givenabove will beasif we entered�����o�k�R�h�h�k�(R1,S1)���|�t� T1 followedby �����o�k�R�h�h�k� (R2,S2)���|�t� T2.

2.4 Implementation of UserDir ectives

Internally the MDMS keepsits meta-datain the form of databasetables(relations). Figure4 shows themostimportantpartsfor eachtablein our ongoingimplementation.Using an OR-DBMS [27] insteadofa purerelationalDBMS bringsthe advantageof usingpointers(henceavoiding duplicationof meta-datain different tables)aswell asextendingmeta-dataasthe needarises(usinginheritanceand/orcollectiondatatypes[27] for instance).Notice that almostin every tablethereis a field (attribute) calleduser-levelmeta-data. Actually, sucha field containsa pointerto a separatetablewheretheMDMS storesthe user-level meta-data(i.e., themeta-datathathelpuserto find herdataor to obtaininformationaboutthestoragesub-systemscurrentlyavailablein theHSS).For example,theuser-level meta-datafield for afile entitycancontaininformation(meta-data)on who createdthefile, whenit wascreated,whatits currentsize,whenitwaslastmodifiedetc.

The examplemeta-dataentriesshown in Figure4 indicatethat two datasets,P andQ, areassociatedwith anabstractdatasetspaceT andarestoredas(BLOCK,*) fashion(i.e., row-major) in files file-P andfile-Q, respectively. Thesefiles resideonanASD calleddisk4(which, in turn,canbeimplementedusing4physicaldiskdevices).Also, therearetwo pendingaccesspatternsof style(BLOCK,BLOCK) onthesedatasetsthathavebeeninitiatedby auserwhoseidentity is id9. It shouldbementionedthatthemeta-datashownin Figure4 donotcontainall thedatarequiredbut ratheranimportantsubsetfor illustrative purposes.

2.5 CommonAccess(Sharing) Patterns

We studiedseveral I/O-intensive parallelprogramsto identify commonlyoccurringdataaccesspatterns,payingspecialattentionto selecta setof programsthat clearly reflectsthe accessbehaviors whennot a

Page 10: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

Table2: Commonlyoccurringaccesspatterns.

pattern explanation

read-parallel-mostly thisdatasetis repeatedlyreadby multipleprocessorssignificantlymorefrequentlythanit is writtengtrfr ntsfs A(...) + smuhcdhgeon (...) + j cu|z|{tn A(RO) + s }iout~ A(...) + dhn({tnfg(u A(MT)read-serial-mostly thisdatasetis readby a singleprocessorsignificantlymorefrequentlythanit is writtengtrfr ntsfs A(...) + jycufz|{tn A(RO) + s }iouh~ A(...) + dhn({hnfgu A(...) + �Rs(gehn A(...)private thisdatasetis accessed(readand/orwritten)exclusively by asingleprocessor(exclusivelyaccessed) gtrfr ntsfs A(...) + jycufz|{tn A(...) + s }iouh~ A(...) + dhn({hnfgu A(...) + �Rs(gehn A(...)write-parallel-mostly thisdatasetis repeatedlywrittenby multipleprocessorssignificantlymorefrequentlythanit is readgtrfr ntsfs A(...) + smuhcdhgeon (...) + j cu|z|{tn A(WO) + sy}iout~ A(...) + dhn({tnfg(u A(MT)write-serial-mostly thisdatasetis writtenby a singleprocessorsignificantlymorefrequentlythanit is readgtrfr ntsfs A(...) + jycufz|{tn A(WO) + s }iouh~ A(...) + dhn({tn|gu A(...) + ��s gehn A(...)producer-consumer thisdatasetis writtenonceby asingleprocessorandthenreadby multipleprocessors

in producerpart: gtr|r(ntsfs A(...) + j cufz|{hn A(WO) + s }ifut~ A(...) + dhn({hnfgu A(OT) + �Rs(gehn A(...)in consumerspart: gtr|r(ntsfs A(...) + syuhcdhg(ehn (...) + j c(ufz|{tn A(RO) + s }iouh~ A(...) + dhn({tn|gu A(MT)

read-write thisdatasetis accessed(readand/orwritten)multiple timesby multipleprocessors(mix-shared) gtrfr ntsfs A(...) + smuhcdhgeon (...) + j cu|z|{tn A(RW) + s }ifut~ A(...) + dhn {tnfgu A(MT)

specialattentionis paidto improve thehigh-level dataaccesses[18].Although, as can be expected,thereare greatvariancesin accesspatternsof datasetsusedin I/O-

intensive codes,we have identifieda limited setof commonlyoccurringsharingpatterns.Table2 showsthesepatternsandhow they canbespecifiedby theusersthroughourdirectives.A (...) meansthatasuitableparametershouldbeentered.If a directive is missing,thatmeansit shouldnot beentered.Although,ourapplicationsmay not reflectall possibleaccesspatternsthat may be exhibitedby scientificcomputations,we believe that it would be relatively straightforward to captureany uncommonpatternaswell usingtheappropriatecombinationsof thedirectivesgivenin Table1.

3 HSSand Utilization of Meta-data

Although in our future experimentswe intendemploy HPSS[11] asour primary HSS,any hierarchicalstoragesystemwith asuitableAPI canbeusedfor thatpurpose.Currently, wealsouseparallelfile systemssuchasPFSandPIOFSto conductexperimentswith thedisk-residentdatasets.Basically, theHSSin ourenvironmenthastwo maintasks:^ It keepsthestoragerelatedmeta-dataupdatedin theMDMS. This is importantin orderto presenttheusersaccurateinformationaboutthe availableI/O resources.In a way, the responsibilityof updatingthemeta-datain theMDMS isdividedbetweentheapplicationandtheHSS.Any datare-organizationperformedby independentlytheHSSshouldbereflectedon theMDMS.

Ê^ It honorsthe I/O optimizationrequestsfrom the MDMS andI/O requestsfrom the userapplication

andreturnsresultsto theapplication.ËIn future, we intendto usethe Datalinkssoftware [13] from IBM. This will relieve us from the responsibilityof updating

independentHSSdatare-organizationsas all the I/O activity on the datasetsregisteredwith the DB2 will be interceptedandcheckedfor securityandconsistency.

Page 11: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

ÌcÍÎÌdÏ

ÐfÑ<ÑWÒ Ó Ô3Õ<Ö Ó ×hØ

ÙÚÏbÏ

Õ<Ô Ô Û Ü3Ü1ÑWÕ<Ö Ö Û Ý3Ø

Ô ×aÒ Ò Û Ô Ö Ó ÞhÛ<ßfà3áãâ8Ó Ø8ÖÝ3Û Õ<äÄÝ Û3Ü å8Ò Ö ÜÝ3Û Õ<äÄÔ Õ<Ò Ò

Õ<Ô Ô Û3Ü Ü<ÑWÕ<Ö Ö Û Ý3ØÄæWÜ Ö ×hÝ Õ çYÛ<ÑWÕ<Ö Ö Û3Ý Ø

ÌcÍÎÌdÏ

ÐfÑ<ÑWÒ Ó Ô3Õ<Ö Ó ×hØ

ÙÚÏbÏ Ì[ÍÎÌ^Ï

ÐfÑ<ÑWÒ Ó Ô Õ<Ö Ó ×aØ

ÙÚÏbÏ

Õ<Ô Ô3Û Ü Ü<ÑWÕ<Ö Ö Û3Ý ØÝ Û Õ<ä?Ý Û Ü å8Ò Ö ÜÝ Û Õ<ä?Ô Õ<Ò Ò

Õ<Ô Ô3Û Ü Ü<ÑWÕ<Ö Ö Û Ý ØÄæWÜ Ö ×aÝ Õ çYÛ<ÑWÕ<Ö Ö Û Ý ØÑWÝ Û èéÛ Ö Ô âÄâ8Ó Ø/Ö

Ì[ÍÎÌ^Ï

ÐfÑ<ÑWÒ Ó Ô Õ<Ö Ó ×aØ

ÙÚÏbÏê ë3ì

Ý Û Õ<ä?Ý Û Ü å8Ò Ö Üí å8Û Ý î[èéÓ Ò ÛWØ8Õ<ïÄÛèéÓ Ò ÛfØ8Õ<ï?ÛðÝ Û3Õ<äÄÔ Õ<Ò Ò

Ì[ÍdÌdÏ

ÐYÑ<ÑWÒ Ó Ô Õ<Ö Ó ×aØ

ÙñÏbÏ

Ì[ÍdÌdÏ

ÐYÑ<ÑWÒ Ó Ô Õ<Ö Ó ×aØ

ÙñÏbÏä8Ó Ý3Û Ô Ö Ó ÞhÛ Üí å8Û Ý3Ó Û Ü ò

ßfà3áãâ8Ó Ø/Ö ÜÜ Ö ×aÝ3Õ çYÛfå3ÑWä8Õ<Ö Û

ßfà3áãÔ Õ<Ò Ò Ü ßfà3áãÝ Û3Ü å8Ò Ö Üí å8Û3Ý îÝ3Û Ü å8Ò Ö ÜÜ3Ö ×aÝ Õ çYÛ<ÑWÕ<Ö Ö Û Ý Ø

ä8Õ<Ö ÕÉÝ Û ä8Ó Ü Ö Ý3ó1â8Ó Ø8Ö

å8Ü3Õ çYÛ1ÑWÕ<Ö Ö Û Ý3ØïÄÓ çYÝ3Õ<Ö ÛWâ8Ó Ø8Ö

ê3ô�ì

ê3õ�ì

ê ö/ì

ê ÷[ì

ê3øbì

Figure5: Proposedarchitecture(a)anddifferentscenarios(b) through(f).

3.1 ExampleScenarios

The sketchof the proposedarchitectureis shown in Figure5(a). In this three-tieredarchitecturean ap-plication canquerymeta-dataandsendI/O directives to the MDMS. The MDMS, in turn, cansendtheapplicationthequeryresultsandcansendI/O hintsto theHSS(afterevaluatingthedirectivesit takesfromtheapplication).TheHSShonorstheI/O requestsfrom theapplicationandtheI/O hints from theMDMSandsendI/O resultsto theapplication(whenrequired).It canalsoupdatethedynamicstorageinformationkept in theMDMS (whenit is necessaryto do so). Figures5(b) through(f) show five examplescenarios.In Figure5(b) the MDMS obtainsan accesspatternfrom the applicationanddecides(after comparingitwith thestoragepatterninformation)to senda collective I/O hint to theHSS.Thentheactualdatatransfer(which is a readcall in this case)occursdirectly betweenthe applicationcodeandthe HSS.Figure5(c)depictsa similar situation,only this time theMDMS decidesto senda prefetchhint (probablybecausethestoragepatternandthe accesspatternin questionareidentical). Figure5(d) shows socalled independentmodeof operation wheretheapplicationin questionnegotiateswith theMDMS andobtainsthefile namesandtheir locationsin theHSS.Thenit accessestheHSSto performits I/O; no negotiationoccursbetweentheMDMS andtheHSS.In Figure5(e) theapplicationdemandsa new storagepatternfor oneof its datasetsandtheMDMS sendstherequireddatare-distribution hint to theHSS.And finally in Figure5(f) theapplicationsendsa usagepatternwhich in turn causestheMDMS to instructtheHSSto migratethedatasetin questionfrom thedisksub-systemto thetapesub-system.Notethatall theseinteractionsbetweentheapplicationandMDMS canbespecifiedusingthedirectivesin Table1.

3.2 AdvancedI/O Optimizations

In orderto exploit thecapabilitiesof modernparallelI/O architectures,it is imperative to useadvancedI/Otechniques[15]. In principle,thesetechniqueshave two mainobjectives:^ enhancingI/O parallelism; that is, to maximizethenumberof storageunits (e.g.,disks)thatcanbekeptbusyatany giventime interval, and^ improving locality of I/O accesses;thatis, to accessasmany consecutive dataaspossibleusingasfewI/O callsaspossible(spatiallocality) or to maximizethenumberof dataaccessesthatcanbesatisfiedfromthefastcomponents(i.e.,higherlevels)of thestoragehierarchy(temporal locality).

Notice that theseobjectivescanonly berealizedby carefuldataplacementacrossstoragedevicesandby carefulcomputationdecompositionacrossprocessors.ThroughouttheyearsseveralI/O techniqueshavebeendesignedand implemented[15]. Table3 briefly summarizesthe optimizationtechniquescurrently

Page 12: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

Table3: I/O optimizationtechniquesandthecorrespondingtriggerrules.

optimization brief explanation suggestedif

parallelI/O performingI/O usinga numberof processorsin parallel accessptrn ùú (*,*,...,*) ûýüþùú ÿto improvethebandwidth

collective I/O distributing theI/O requestsof differentprocessorsamongthem accessptrn ùú storageptrn� üþùú ÿ

sothateachprocessoraccessesasmany consecutivedataaspossibleit involvessomeextracommunicationbetweenprocessors

sequential bringingconsecutivedatainto higherlevelsof storagehierarchy SQ�

RO�

SRprefetching beforeit is needed.it helpsto overlaptheI/O timeandcomputation

time,thushiding theI/O latencystrided sameassequentialprefetchingexceptthatdatais ST

�RO

�SR

prefetching broughtin fixedstrides(someelementsareskipped)caching& keepingthedatato beusedin nearfuturein thehigher SQ

�RO

�SR

�MT � LRU

replacement levelsof storagehierarchy(currentlytwo replacementpolicy SQ�

WO�

SR�

MT � MRUpolicy is used(LRU-leastrecentlyusedandMRU-mostrecentlyused)settingstriping to selecta stripingunit suchthatasmany storagedevices ��sojylon usedunit size (e.g.,disks)aspossiblewill beutilizeddata migratingdatafrom higherlevelsof storagehierarchy(e.g.,disks) MG

���MT

� �OT

migration to lower levelsof storagehierarchy(e.g.,tapes)data removing datafrom thestoragehierarchy PG

���MT

� �OT

purging usefulfor temporalfileswhoselifetime is overpre-staging fetchingdatafrom tapesub-systemto disksub-system gtsfs ctroj guon used

beforeit is requireddisabling usedwhenthebenefitof cachingor prefetchingis notclear LRcache& prefetch

employed by our proposedframework. More detaileddescriptionscanbe found in respective referencescitedat theendof thispaper;hereweonly discusscollective I/O.

In many parallelapplications,the storagepatternof a dataset is in generaldifferent from its accesspattern. The problemhereis that if eachprocessorattemptsto readits portion of data(specifiedin theaccesspattern),it mayneedto issuealargenumberof I/O calls.Supposethatfour processorswantto access(read)a two-dimensionalarrayin (BLOCK,BLOCK) fashion.Assumingthat thearrays’storagepatternis(BLOCK,*), eachprocessorwill have to issuemany I/O calls (to bespecific ���� readcallseachfor ���consecutive dataitems,if weassumethatthearrayis � �� ). Whatcollective I/O does,instead,is to readthe datain (BLOCK,*) fashion(i.e., usingminimum numberof I/O calls) andthen re-distribute the dataacrossprocessors’memoriesto obtainthedesired(BLOCK,BLOCK) pattern.Thatis, takingtheadvantageof knowing the accessandstoragepatternof the array, we canrealizethe desiredaccesspatternin twophases.In thefirst phase,theprocessorsaccessthedatain a layoutconformantway(i.e.,(BLOCK,*) in ourexample),andin the secondphasethey re-distribute the datain memoryamongthemselvessuchthat thedesiredaccesspatternis obtained.Consideringthefact that I/O in large-scalecomputationsis muchmorecostlythancommunication,hugeperformanceimprovementscanbeachievedthroughcollective I/O.

The lastcolumnin Table3 givestheconditionsunderwhich therespective optimizationswill besug-gestedby theMDMS to theHSS.For example,collective I/O is consideredonly if accesspattern �� storagepatternandmultiple processorsareinvolved ( _ denotesthenumberof processors).Thesymbols� and �areusedfor logical or and logical and operations,respectively. SQ denotes‘sequential’and‘ST’ means‘strided’; otherabbreviationsarefrom Table1.

Page 13: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

3.3 Tape-SpecificI/O Optimizations

Anotherimportantconstituentof our I/O optimizationframework is a run-timelibrary thatcanbeusedtofacilitatetheexplicit controlof dataflow for tape-residentdata.Our library is activatedautomaticallywhenthe userinvokesan I/O call that involves tape-residentdatasets(seethe next section).Alternatively, thisrun-timelibrary canbedirectlyusedby applicationprogrammersandoptimizingcompilersthatmanipulatelarge-scaletape-residentdata. Theobjective is to allow programmersto accessdatalocatedon tapevia aconvenientinterfaceexpressedin termsof arraysandarrayportions(regions)ratherthanfiles andoffsets.The library implementsa datastoragemodelon tapesthat enablesour architectureto accessportionsofmulti-dimensionaldatain a fastandsimpleway. In order to eliminatemostof the latency in accessingtape-residentdata,we employ a sub-filingstrategy [23] in which a large multi-dimensionaltape-residentglobalarrayis storednotasasinglefile but asanumberof smallersub-files, whoseexistenceis transparentto theprogrammer. Themainadvantageof doingthis is that thedatarequestsfor relatively smallportionsof theglobalarraycanbesatisfiedwithout transferringtheentireglobalarrayfrom tape-subsystemto disk-subsystemin the HSSasis customaryin many hierarchicalstoragemanagementsystems.In additiontoread/writeaccessroutines,thelibrary alsosupportspre-stagingandmigrationcapabilitieswhich canprovevery usefulin environmentswherethedataaccesspatternsarepredictableandtheamountof disk spaceislimited.

Within the library, eachglobal tape-residentarrayis divided into chunks,eachof which is storedin aseparatesub-fileontape.Thechunksareof equalsizesin mostcases.A typicalaccesspatternmightaccessa smallportionof a very largetape-residentdataset.In receiving sucha request,thelibrary performsthreeimportanttasks:^ Determiningthesub-filesthatcollectively containtherequestedportion,^ Transferringthesub-filesthatarenotalreadyondisk from tapeto disk,and^ Extractingtherequireddataitems(arrayelements)from therelevantsub-filesfrom disk andcopyingtherequestedportionto abuffer in memoryprovidedby theusercall.

In thefirst step,thesetof sub-filesthatcollectively containtherequestedportion is calledcover [23].Assumingfor now that all of the sub-filesthat make up the cover are currently residingon tape,in thesecondstep,thelibrary bringsthesesub-filesto disk. In thethird step,therequiredportionis extractedfromeachsub-fileandreturnedto theuserbuffer. Notethat the laststepinvolvessomecomputationaloverheadincurredfor eachsub-file. Instead,hadwe usedjust onefile perglobalarraythis computationaloverheadwould be incurredonly once. Therefore,theperformancegainobtainedby dividing the globalarrayintosub-filesshouldbe carefullyweighedagainstthe extra computationaloverheadincurredin extractingtherequestedportionsfrom eachsub-file. Our experimentsshow that this computationaloverheadis not toomuch.

4 User Interface

Oneof themainproblemswith currentparallelfile systemsandparallelI/O librariesis theexcessivenumberof functions(calls)presentedto theuser. Thenit becomesthetaskof usertochoosethesuitableI/O functionsthatexpressheraccesspatternsascloselyaspossible.For example,in thelatestMPI-IO standard[9], thereareover 30 read/writecalls alonewhich rendersthe job of selectingthe right onesa dauntingtask. Webelieve thatamajorityof thesecallscanbeeliminatedif theuseris allowedto expressaccesspatternsusingdirectivesatahigherlevel.

Page 14: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

4.1 SupportedI/O Functions

In ourinitial implementation(whichtargetsonlyscientificcodesthatuselargemulti-dimensionalarrays)wesupportthefunctionsshown in Table4. Noticethat thesecommandsaredifferent fromuserdirectivesandaretheonly commandsthatcanbesentto theHSSdirectly from theapplicationcode.Queriesaboutdatasetsandstoragedevicesareperformedby negotiatingwith theMDMS throughdirectives. Noticehoweverthat the useof directives is optional. Contrastthis with the currentfile systemandI/O library interfaceswhich demandthateachandevery parameterin theparameter-list of thecommandshouldbesuppliedbytheuser. In Table4 namecanbea datasetnameor nameof anabstractdatasetspace,in which casealltheassociateddatasetsareopened.Thedatain memoryis specifiedby buffer thatcanbeeithera pointeror a multi-dimensionalmemoryregion (e.g., an array). It is assumedthat eachinvolved processorwillhave enoughspacein their respective buffer areasin orderto hold its portionof dataaccessed.Theportionparameter, on theotherhand,denotestheregionof datasetto beaccessedin theglobalnamespace;the‘*’symbolis usedto denotethe‘whole dataset’.

The opt parameteris the optimizationpointer that is setmainly by the MDMS dependingon the di-rectivescollectedsofar from theapplication.It pointsto a structurethatcontainssufficient informationtocarryoutanI/O optimization.Currentlywearein theprocessof implementingthesehigh-level functionsontopof MPI-IO [9] andSRB(StorageResourceBroker) [1] from SanDiegoSupercomputerCenter(SDSC).TheStorageResourceBroker (SRB)is a middle-warethatprovidesdistributedclientswith uniform accessto diversestorageresourcesin a distributedheterogeneouscomputingenvironment.We areexperimentingwith the MPI-IO to evaluatetheoptimizationsinvolving mainly disk-residentdatain parallelfile systemsandwith theSRBto evaluatetheoptimizationsinvolving tape-residentdata.

4.2 Examplesfor Useof Dir ectivesand I/O Calls

Considerthefollowing example.�� �o���k�t�������h� (P,opt)�����h����� P(*,BLOCK) �|���t������� DISK(8)� �������k�t�������h� (P,bf,*,opt)In thisexample,theapplicationfirst opensthedataset(herethearrayP).It alsogivesanoptimizationpointeropt later to beused.Thenit sendsan �����h����� directive to theMDMS which declaresthat8 processorwillaccessthedatasetin acolumn-majorfashion.TheMDMS, in turn,comparesthestoragepattern(thedefaultis (BLOCK,*)) with this accesspatternanddecidesthatcollectiveI/O needsto beperformed.It passesthisadviceto theHSSby filling out the relevant entriesof the datastructurepointedby opt. When,later, theapplicationissuesthe

� �R�����k�t�k�����h� command,acollective readoperationmightbeperformed(consideringthe contentsof the structurepointedby opt andexact parametersin the

� ���������t�����t�t� command).Nowsupposethatthedirectivewasinstead�����h���R� P(BLOCK,*) �|�k�h�k�t��� DISK(8). In thatcasesincetheaccesspatternandthestoragepatternarethesame,theMDMS mayadviseprefetching to theHSSby settingtheappropriateentriesof thestructurepointedby opt. Notice that in eithercasethesyntaxof theactualreadcall doesnot change.Theonly differenceis thecontentsof thedatastructurepointedby opt. Consideringthe fact that a typical large-scaleapplicationswill have only a few directives, the function performedbytheapplicationin questioncanbechangedby modifying only a few programlines. This helpsreadability,reusability, andmaintainability.Let usnow considerthefollowing examplefragment.�����o�����h�h�k� (P,Q) ���|�t� T�����h����� T(BLOCK,*) �|���t������� TAPE(16)�� �o���k�t�������h� (P,opt)� �������k�t�������h� (P,bf1,*,opt)

Page 15: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

Table4: SupportedI/O functions.

� {tn(i �oguhg!fn(u (name,opt)" ~|cts(n#�hguhg!|nu (name,opt)$ nfg|�%�oguhg!fn(u (name,buffer,portion,opt)& d�jmuhn#�hguhg!|nu (name,buffer,portion,opt)

� �������k�t�������h� (Q,bf2,*,opt)In this case,theapplicationfirst associatestwo tape-residentarrayswith anabstractdatasetspaceT. Thenthe �k�R�o���R� directive indicatesthat 16 processorsaregoing to accessthe respective portionsof P andQin row-wise. TheapplicationafterwardsopensP, anactivity which mostprobablyforcestheHSSto stagethe datasetP from tapesub-systemto disk sub-system.This alsotriggersthe MDMS to advisethe HSSto pre-stage the datasetQ aswell from tapeto disk asthis arrayis associatedwith P andmostprobablythe two arrayswill beusedtogether. Assumingthat thedefault layout is row-major, theMDMS alsosetsthenecessaryparametersin thestructurepointedby opt to enableprefetching;nocollective I/O is required.Consequently, the following two calls to

� �R�����k�t�k�����h� take advantageof prefetching. Notice that ourdirectivesallow two levelsof prefetchingin this smallexample:first (throughtheuseof �����o�k�R�h�h�k� ) fromtapeto disk (this is calledpre-staging),andthen(throughtheuseof �k�R�h���R� ) from disk to memory(this iscalledprefetching).Also,notethat(whennecessary)the

� ���������t�����t�t� call usessub-filingto stageonly therelevantpartsof datafrom tapeto disk.

Finally, considerthe following examplein which theaccesspatternof a datasetP changesduringthecourseof execution.In thisexample,thereare32processorsinvolved.�� �o���k�t�������h� (P,opt)�����h����� P(BLOCK,*) �|���t������� DISK(32)� �������k�t�������h� (P,bf1,*,opt)�t�R���o���o�R� P(*,BLOCK) �|���t�k�t��� DISK(32)�����h����� P(*,BLOCK)� �������k�t�������h� (P,bf2,*,opt)' ���f�k� ���h�k���t�t� (P,bf3,*,opt)' ���f�k� ���h�k���t�t� (P,bf4,*,opt)Theapplicationfirst opensthedatasetP, andthendisclosesthatits accesspatternis row-major. Assumingagaintherow-majorstoragelayoutasdefault,nocollective I/O is required.But then,theapplicationissuesan �t�R���f� �o�R� directivewhichstronglyadvisesastoragelayoutchange(ondisks)fromthedefault row-majorlayoutto column-majorlayout. Thereasonfor this becomesclearwhenwe take a look at thenext �k�R�h���R�directive in thesequencewhich saysthattheremaining

� ���������t�����t�t� and' �"�f�����k�t�k�����h� commandswill

accessthedatain column-majorfashion. If the layoutof datais not changedfrom row-major to column-major, eachof the remainingthreeI/O calls will have to usecollective I/O; that involves interprocessorcommunicationwhich (dependingon the typeof the network anddistributedmedia)canbecostly in thiscaseas32processorsareinvolved.Therefore,theprogrammerthinksthatit mightbebetterto storethedatain thewaythatit will beaccessedlater. Thatis, storingthedataascolumn-major, thelastthreecallscantaketheadvantageof prefetchingastheaccesspatternandthestoragepatternareidenticalnow. Notice,howeverthat the useof the �t�R���f� �o�R� directive for a dataset that hasalreadybeencreated(as in this example)impliesa datare-distribution on disksandcanbequitecostly, soit shouldbeusedwith discretion.Similarto thetablesusedto keeptrackof userdirectives(seeFigure4), we usedatabasetablesto hold necessaryinformationabouttheread/writecalls.

Page 16: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

Table5: Total I/O times(in seconds)for Astro-2Dapplication(Datasetsizeis 256MB).

( gdogehc(i ! (�)%*64procs 128procs 32procs 64procs� d�jEe�jEitgo~ 64.95 87.02 23.46 39.67� {ouRj,+"jylon� 27.44 49.37 14.05 11.23

Table6: Total I/O times(in seconds)for Astro-3Dapplication(Datasetsizeis 8 MB).

( gdogehc(i ! (�)%*64procs 128procs 32procs 64procs� d�jEe�jEitgo~ 51.04 81.45 109.93 211.47� {ouRj,+"jylon� 36.41 68.02 3.33 3.51

5 Experiments

In thissectionwepresentsomeperformancenumbersfrom anongoingimplementationof ourMDMS. Theexperimentswererun on anIBM SP-2andIntel Paragon.Eachnodeof SP-2is RS/6000Model 390with256megabytesmemoryandhasanI/O sub-systemcontainingfour 9 gigabytesSSAdisksattachedto it. Thenodeson Paragon(Intel i860 XP), on theotherhand,aredividedinto threegroups:computenodes,HIPPInodesandservicenodes.Thetotalmemorycapacityof thecomputepartitionis around14.4gigabytes.Theplatform wherethe experimentswereconductedhas3 servicenodeseachwith a RAID SCSIdisk arrayattachedto it.

We usedfour differentapplications:threeof themareusedto measurethebenefitsof collective I/O fordisk-residentdatasetsin parallelfile systems;thelastoneis usedto seehow pre-stagingperformsin HPSS[11] for tape-residentdata.We alsoperformedexperimentswith sub-filingusingsyntheticaccesspatternsandpresentherethefirst resultsfrom theseexperiments.

Table5 shows the total I/O timesfor a 2D astrophysicstemplate(Astro-2D)on the Intel ParagonandIBM SP-2.Here,‘Original’ refersto thecodewithoutcollective I/O, and‘Optimized’ denotesthecodewithcollective I/O. In all cases,theMDMS is runatNorthwesternUniversity. Theimportantpointhereis thatinboththe‘Original’ andthe‘Optimized’ versionstheusercodeis essentiallythesame;theonly differenceisthat in the ‘Optimized’ case,theusercodesendsdirectivesto theMDMS. TheMDMS thenautomaticallydeterminesthatcollective I/O shouldbeperformed;this hint is thensentby theMDMS to theHSS.As aresult,impressive reductionsin I/O timesareobserved.Sincethenumberof I/O nodesarefixedonboththeParagonandtheSP-2,increasingthenumberof processorscausesin generalan increasein the I/O times.Tables6 and7 reportsimilarresultsfor a3D astrophysicscode(Astro-3D)andfor anunstructured(irregular

Table7: Total I/O times(in seconds)for unstructuredcode(Datasetsizeis 64MB).

( gdogehc(i ! (�)%*64procs 128procs 32procs 64procs� d�jEe�jEitgo~ 76.30 142.73 547.61 488.13� {ouRj,+"jylon� 1.68 0.94 1.25 2.13

Page 17: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

Table8: Total I/O times(in seconds)for volumerenderingcodeon4 processors(Datasetsizeis 64MB).

File No � 1 2 3 4

Original 31.18 19.20 61.86 40.22Optimized 11.90 11.74 20.10 18.38

Table9: Total I/O times(in seconds)for volumerenderingcodeon8 processors(Datasetsizeis 64MB).

File No � 1 2 3 4

Original 18.79 37.69 21.02 14.70Optimized 10.74 6.23 4.49 6.42

dataaccesspattern)code,respectively. Sincethesecodeshave not beenmodifiedyet to work throughtheMDMS, the resultsreportedhereareobtainedby hand. Nevertheless,they indicatethe orderof potentialsavingsoncecollective I/O is used.

Ournext exampleis aparallelvolumerenderingapplication.As in previousexperiments,theMDMS isrun at NorthwesternUniversity. Theapplicationitself, on theotherhand,is executedat ArgonneNationalLab’sSP-2andtheHPSSatSanDiegoSupercomputerCenter(SDSC)is usedastheHSS.In the‘Original’codefour datafilesareopenedandparallelvolumerenderingis performed.In the‘Optimized’ code(‘Origi-nal’ code+ userdirectives)thefourdatasets(correspondingto four datafiles)areassociatedwith eachother,andpre-staging(from tapeto disk) is appliedfor thesedatasets.Tables8 and9 give thetotal readtimesforeachof thefour files for the‘Original’ and‘Optimized’ codesfor 4 and8 processorcase,respectively. Theresultsrevealthat,for both4 and8 processorcases,pre-stagingreducestheI/O timessignificantly. Weneedto mentionthat in every applicationwe experimentedwith thetime spentby theapplicationin negotiatingwith theMDMS is lessthan1 second.Whenconsideringthefactthatfor large-scaleapplicationsI/O timesarelikely to behuge(evenwhenoptimized),anoverheadin this rangeis acceptable.

In orderto measuretheusefulnessof sub-filing,we performedexperimentsusingtheHPPSat SDSC.Wehave usedthelow-level routinesof theSDSCStorageResourceBroker (SRB)to accesstheHPSSfiles.SRBis a client-server middle-warethatprovidesa uniform interfacefor connectingto heterogeneousdataresourcesover anetwork andaccessingreplicateddatasets[1].

We experimentedwith differentaccesspatternsin orderto evaluatethebenefitsof the library that im-plementssub-filing.Table10givesthestartandendcoordinates(ona two-dimensionalglobalarray)aswellasthenumberelementsread/writtenfor eachaccesspattern(A throughH). Notethat thecoordinate-/.�0,.�1correspondsto theupper-left cornerof thearray. In eachcase,theaccessedarrayconsistsof 2. . . .3�42. . . .floatingpointelements(10GB totaldata).Weusedtwo differentsub-file(chunk)sizes: �65���7�7 ( 89. . .:�;89. . .elements),and 7t�t�R��� ( �. . .<�=�. . . elements).

Thecolumns5 through10of Table10show theperformanceresultsobtained.For eachoperation(reador write) wegive theresponsetimes(in seconds) for anaiveaccessstrategy andthegainsobtainedagainstitusingour library whichemploys sub-filing.Thenaive strategy reads/writestherequiredportionfrom/to thearraydirectly, i.e., it doesnot usesub-filingandtheentire 2. . . .��>2. . . . arrayis storedasa singlelargefile. For thesub-filingcaseswe show thepercentage reductionin responsetime of thenaive scheme.Forexample,in accesspatternA, thesub-filingwith smallchunksizeimproved(reduced)theresponsetime forthe readoperationby ? 2�@A��B . Figures6 and7 show the resultsobtainedin graphicalform. Note that they-axeson thefiguresarelogarithmicallyscaled.

Page 18: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

Table10: Accesspatterns,executiontimes,andpercentagegains.

Pattern Information Write Operations ReadOperationsCEDFD%G HJI,KJLFI MJN,O PRQSI,K,T P6UWV6XRY HZV6K,TFT [,KJLF\,X P6UWV6XRY HZV6K,TFT [,KJLF\,X] IFL�G ^,QFQ_L�G ^,QFQ_L�G `RTFQJKJI�G aEbFQ ^_cFdFNFe ^_cFdFNFe aEbFQ ^_cFdFNFe ^_cFdFNFef QEUgN,IEY DhcFdFNFe9UgN,\ i,KjUgN�kmlon i,KjUgN�kmlpn DhcFdFNFe9UgN,\ i,KjUgN�kmlpn i,KjUgN�kqlpnA

ksrjtur,n k Ê rFrFrjt Ê rFrFr,n Ê � Ê rFv w,xFx_y6z r {F|jz Ê {Jy6z } xJ~Jy6z x ~F}jz w xFxEz ÊB

ksrjtur,n k�y,rFrFrjt Ê rFrFr,n y � Ê r v wF~FrF}jz { ~F�jz ~ ~Jy6z { ~ Ê rjz Ê y,�jz w }F}jz |C

ksrjtur,n kqwJy,rFrFrjt Ê rFrFr,n wJy � Ê rFv wF{F|Frjz � ~jz ~ �,xEz { xJ{F�jz � ��wJy,rjz } � Ê xJwjz yD

kq}FrFrFrjt/}FrFrFr,n ks|FrFrFrjtg|FrFrFr,n Ê � Ê r v �F�Fw Ê z w {F|jz x {F}jz y xJ{F~jz y ~Jy6z Ê xJ{jz xE

ksrjtur,n kq}FrFrFrFrjt/~Fr,n y � Ê rFv Ê } Ê z x ���F}FwF}jz Ê ��wJy,�,xEz | Ê |F}jz w ���FwFwF{jz � ��wF|FwF�jz {F

ksrjtur,n kq~Frjt/}FrFrFrFr,n y � Ê r v Ê �F~,xJwF�jz � {F|jz r {,xEz w �F{Fw Ê y6z Ê ~F}jz { ~F~jz }G

ksrjtur,n k Ê rFrFrjt/y,rFrFr,n y � Ê r v ÊcÊ rF{F|jz � {F}jz { {F|jz y �FwJy,wjz { ~F~jz � ~F~jz |H

kq|FrFrFrjt/|FrFrFr,n ks~FrFrFrjtg~FrFrFr,n y � Ê r v }FrF{F}jz w { Ê z w {F|jz } Ê | Ê wjz { xJ|jz | ~F{jz {

In the patternsA andD, wherea 4 MB squarechunk is accessedon the left cornerandaroundthemiddle,respectively, thesmallchunksizeoutperformsthelargechunksizeasthelatteraccessesextra dataelementsthatdonotbelongto therequiredportion.In thepatternH, ontheotherhand,increasingthechunksizereducesthe numberof I/O calls which in turn resultsin the bestresponsetime. In B andG, 16 MBof dataareaccessedin orthogonaldirections.In G, sincewe accessa sub-columnportionof a row-majorarray, we needto issue��. . . I/O calls in thenaive case.In B, thenaive strategy issuesonly 89. . . I/O callsto accessthesamevolumeof data.Consequently, the impactof sub-filing is morepronouncedin G. In E,thenaive strategy outperformsthesub-filingastheentireportioncanbeaccessedin a singleI/O call. NotethattheHPSSallows theusersto accessportionsof thedataresidingin tape.Theresponsetimeof thenaivesolutionfor theaccesspatternE will increasedramaticallyfor thetapearchitectures,wherethegranularityof accessis a file. By comparingtheresponsetimesof A, B, C, andE, we notethattheresponsetimesaredominatedby the numberof I/O calls (in the naive version)andof chunks(in the sub-filedversions–thatalsocorrespondsto I/O calls–)ratherthanby thevolumeof dataaccessed.Finally, in thepatternF (whoseresponsetimein thenaivecasewascalculatedusinginterpolationfrom A andG), thesub-filingstrategy hasthebestperformanceof all andbringsa ����@A��B improvementin write calls.

Overall, thesub-filingstrategy performsvery well comparedto thenaive strategy whichperformsindi-vidual accessesto a large file, exceptfor thecaseswheretheaccesspatternandthestoragepatternof thearraymatchexactly. Even in that casea suitablechunkshapecanbe chosento matchthe responsetimeof the naive strategy. However, automaticselectionof optimal chunksizes(consideringthe futureaccesspatterns)is beyondthescopeof thiswork.

6 RelatedWork

Therearemany proposedtechniquesfor optimizing I/O accesses.Thesetechniquescanbe divided intothreemaingroups:theparallelfile systemandrun-timesystemoptimizations[21, 6, 9,17,19, 14], compileroptimizations[3, 18, 15], andapplicationanalysisandoptimization[18, 5, 24, 15].

The closestwork to oursis the onedoneby Brown et al. [4]. They proposea similar architecturetoours;however, they do not handletheadvancedI/O optimizationsproposedin this paper. They build theirmeta-datasystemon top of HPSSusingDB2 from IBM. In our case,theHSSandtheMDMS arelooselycoupledallowing usto experimentwith differenthierarchicalstoragesystems.

Baruet al. [1] investigateuseof high-level unifiedinterfacesto datastoredon file systemsandDBMS.Their systemmaintainsmeta-datafor datasets,resources,users,andmethods(accessfunctions)andpro-videstheability to create,update,store,andquerythis meta-data.While thetypeof meta-datamaintained

Page 19: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

1

10

100

1000

10000

100000

1000000

A B C D E F G H

Access Pattern

Res

pons

e Ti

mes

[sec

]

Small Chunk Big Chunk Without Sub-Filing

Figure6: Executiontimesfor write operations.

1

10

100

1000

10000

100000

A B C D E F G H

Access Pattern

Res

pons

e Ti

mes

[sec

]

Small Chunk Big Chunk Without Sub-Filing

Figure7: Executiontimesfor readoperations.

Page 20: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

by them is an extensionof meta-datamaintainedby a typical operatingsystem,our meta-datainvolvesperformance-relatedmeta-dataaswell which enablesautomatichigh-level I/O optimizationsasexplainedin this paper.

Thereareseveralworkson I/O characterizationof largeapplications.ThreeI/O-intensive applicationsfrom theScalableI/O Initiative ApplicationSuitearestudiedin [12]. An I/O-intensive three-dimensionalparallelapplicationcodeis usedto evaluatetheI/O performancesof theIBM SPandIntel Paragonin [28].They foundIBM SPto befasterwith readoperationsandParagonfor writes. del RosarioandChoudhary[15] discussseveral grand-challengeproblemsand the demandsthat they placeon the I/O performanceof parallelmachines.The characterizationinformationcanbe very useful in our framework in selectingsuitableuser-level directivesto implementin orderto captureaccesspatternsin a bettermanner. Finally,someamountof work hasbeendonein theareaof out-of-corecompilation[3], [2].

7 Conclusions

In this paperwe presenta programdevelopmentenvironmentbasedon maintainingperformance-relatedsystem-level meta-data.Thisenvironmentconsistsof ausercode,ameta-datamanagementsystem(MDMS),anda hierarchicalstoragesystem(HSS)andprovidesa seamlessdatamanagementandmanipulationfacil-ity for useby large-scalescientificapplications.It combinesthe advantagesof file systemsandDBMSswithout incurringtheir respective disadvantagesandprovideslocationtransparency (throughtheuseof datasetnamesratherthanfile namesor URLs), resourcetransparency (throughthe useof ASDs), andaccessfunctiontransparency (throughtheautomaticinvocationof high-level I/O optimizationslike collective I/Oanddatasieving).

Also, by storing meta-dataand providing meansto manipulateit, our framework is able to managedistributedresourcesin a heterogeneousenvironment. Preliminaryresultsobtainedusingseveral applica-tions areencouragingand motivateus to completeour implementationandmake extensive experimentsusinglarge-scaledataintensive applications.Currently, wehave only asingleAPI (whichcanbeusedfromwithin theapplicationcode);wearealsoworkingonGUI andUNIX-style command-lineinterfaces.

Acknowledgments

Thisresearchwassupportedby Departmentof Energy undertheAcceleratedStrategic ComputingInitiative(ASCI)AcademicStrategic AllianceProgram(ASAP)Level2,undersubcontractNoW-7405-ENG-48fromLawrenceLivermoreNationalLaboratories.Wewouldliketo thankReaganMoorefor discussionsandhelpwith SDSCresources.We thank Mike Wan and Mike Gleicherfor helpingus with the implementationof the volumerenderingcodeandin understandingthe SRB andthe HPSS.We thankLarry SchoofandWilbur Johnsonfor providing theunstructuredcodeusedin this paper. We thankCelesteMatarazzo,JohnAmbrosiano,andSteve Louis for discussionsandtheir input. Finally, we gratefullyacknowledgeNPACIfor allowing usto usetheirmachinesat theSDSC.

References

[1] C. Baru,R. Moore,A. Rajasekar, andM. Wan.TheSDSCstorageresourcebroker. In Proc. CASCON’98Con-ference, Dec1998,Toronto,Canada.

[2] R. Bordawekar, J.M. del Rosario,andA. Choudhary. Designandimplementationof primitivesfor ParallelI/O,In Proceedingsof Supercomputing’93,November1993.

Page 21: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

[3] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel,andM. Paleczny. A modelandcompilationstrategy forout-of-coredataparallelprograms.Proceedingsof theACM Symposiumon PrinciplesandPracticeof ParallelProgramming, pages1–10,July1995.

[4] P. Brown, R. Troy, D. Fisher, S. Louis, J. R. McGraw, andR. Musick. Meta-datasharingfor balancedperfor-mance.In Proc. theFirst IEEEMeta-dataConference, SilverSpring,Maryland,1996.

[5] P. Cao,E. Felten,andK. Li. Application-controlledfile cachingpolicies.In Proc. the 1994SummerUSENIXTechnicalConference, pages171–182,June1994.

[6] A. Choudhary, R. Bordawekar, M. Harry, R. Krishnaiyer, R. Ponnusamy, T. Singh,andR. Thakur. PASSION:parallelandscalablesoftwarefor input-output.NPAC TechnicalReportSCCS-636,Sept1994.

[7] A. ChoudharyandM. Kandemir. System-level meta-datafor high performancedatamanagement.To appearinProc. theThird IEEEMeta-DataConference, Bethesda,Maryland,April 6–7,1999.

[8] P. F. Corbett,D. G. Feitelson,J-P. Prost,andS.J.Baylor. Parallelaccessto files in theVestafile system.In Proc.Supercomputing’93,pp.472–481,Nov 1993.

[9] P. Corbett,D. Fietelson,S.Fineberg,Y. Hsu,B. Nitzberg,J.Prost,M. Snir, B. Traversat,andP. Wong.Overviewof the MPI-IO parallel I/O interface, In Proc. Third Workshopon I/O in Parallel and Distributed Systems,IPPS’95,SantaBarbara,CA, April 1995.

[10] T. H. CormenandD. M. Nicol. Out-of-coreFFTswith paralleldisks.ACM SIGMETRICSPerformanceEvalua-tion Review, 25:3,December1997,pp.3–12.

[11] R. A. Coyne, H. Hulen, andR. Watson.The high performancestoragesystem.In Proc. Supercomputing93,Portland,OR,November1993.

[12] P. E. Crandall,R. A. Aydt, A. A. Chien,andD. A. Reed.Input/outputcharacteristicsof scalableparallelapplica-tions.In Proceedingsof Supercomputing’95.

[13] J. R. Davis. Datalinks: Managingexternaldatawith DB2 universaldatabase.IBM CorporationWhite Paper,August,1997.

[14] J.delRosario,R.Bordawekar, andA. Choudhary. ImprovedparallelI/O viaatwo-phaserun-timeaccessstrategy.In Proc. the1993IPPSWorkshopon Input/Outputin Parallel ComputerSystems, April 1993.

[15] J. del RosarioandA. Choudhary. High performanceI/O for parallelcomputers:problemsandprospects.IEEEComputer, March1994.

[16] M. Drewry, H. Conover, S. McCoy, andS. Graves.Meta-data:quality vs. quantity. In Proc. the SecondIEEEMeta-dataConference, SilverSpring,Maryland,1997.

[17] C. S.Ellis andD. Kotz. Prefetchingin file systemsfor MIMD multiprocessors.In Proc. the1989InternationalConferenceon Parallel Processing, pagesI:306–314,St. Charles,IL, August1989.PennsylvaniaStateUniv.Press.

[18] M. Kandaswamy, M. Kandemir, A. Choudhary, andD. Bernholdt.Performanceimplicationsof architecturalandsoftwaretechniqueson I/O-intensiveapplications.In Proc. theInternationalConferenceonParallel Processing(ICPP’98), Minneapolis,MN, Aug 1998.

[19] J.F. Karpovich,A. S.Grimshaw, andJ.C. French.Extensiblefile systems(ELFS):An object-orientedapproachto high performancefile I/O. In Proc. theNinth AnnualConferenceon Object-OrientedProgrammingSystems,Languages,andApplications,pp.191–204,Oct1994.

[20] D. Kotz. Disk-directedI/O for MIMD multiprocessors.In Proc. the 1994 Symposiumon Operating SystemsDesignandImplementation,pages61–74.USENIX Association,Nov 1994.

[21] D. Kotz. Multiprocessorfile systeminterfaces.In Proc. the SecondInternationalConferenceon Parallel andDistributedInformationSystems,pages194–201.IEEEComputerSocietyPress,1993.

[22] T. MadhyasthaandD. Reed.Intelligent,adaptive file systempolicy selection.In Proc. Frontiers of MassivelyParallel Computing, Annapolis,MD, pp.172–179,Oct1996.

Page 22: Data Management for Large-ScaleScientific Computations in High Performance ...thakur/papers/mahmut_sdm.pdf · 2002-01-27 · Data Management for Large-ScaleScientific Computations

[23] G. Memik, M. Kandemir, andA. Choudhary. A run-timelibrary for taperesidentdata.TechnicalReportCPDC-TR-9909-014,Centerfor ParallelandDistributedComputing,NorthwesternUniversity, September1999.

[24] R. H. Patterson,G. A. Gibson,and M. Satyanarayanan.A statusreporton researchin transparentinformedprefetching.ACM OperatingSystemsReview, V 27(2),pp21–34,April 1993.

[25] B. Rullman.Paragonparallelfile system.ExternalProductSpecification, Intel SupercomputerSystemsDivision.

[26] K. E.SeamonsandM. Winslett.MultidimensionalarrayI/O in Panda1.0.JournalofSupercomputing,10(2):191–211,1996.

[27] M. Stonebraker. Object-RelationalDBMSs: TrackingtheNext GreatWave. MorganKaufmanPublishers,ISBN:1558604529,1998.

[28] R. Thakur, W. Gropp,andE. Lusk. An experimentalevaluationof theparallelI/O systemsof the IBM SPandIntel Paragonusinga productionapplication.In Proc. of the3rd Int’l Conf. of theAustrianCenterfor ParallelComputation(ACPC)with specialemphasison Parallel DatabasesandParallel I/O, September1996.LectureNotesin ComputerScience1127,Springer-Verlag,pp.24–35.

[29] R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. To appearin Proc. the 7thSymposiumontheFrontiersof MassivelyParallel Computation, February1999.

[30] S.ToledoandF. G.Gustavson.Thedesignandimplementationof SOLAR,aportablelibrary for scalableout-of-corelinearalgebracomputations,In Proc.Fourth AnnualWorkshopon I/O in Parallel andDistributedSystems,May 1996.