Upload
kristopher-charles-short
View
214
Download
0
Embed Size (px)
Citation preview
CAPS teamCompilation et Architecture pour les Processeurs Superscalaires et Spcialiss
Compiler and Architecture for superscalar and embedded processors
CAPS project
CAPS members2 INRIA researchers: A. Seznec, P. Michaud 2 professors: F. Bodin, J. Lenfant
11 Ph D students: R. Amicel, R. Dolbeau, A. Monsifrot , L. Bertaux, K. Heydemann, L. Morin, G. Pokam, A. Djabelkhir, A. Fraboulet, O. Rochecouste, E.Toullec
3 engineers: S. Bihan, P. Villalon, J. Simonnet
CAPS project
CAPS themes
Two interacting activities
High performance microprocessor architecture
Performance oriented compilation
CAPS project
CAPS GrailPerformance at the best cost
Progress in computer science and applications are driven by performance
CAPS project
CAPS path to the GrailDefining the tradeoffs between:what should be done through hardwarewhat can be done by the compilerfor maximum performanceor for minimum costor for minimum size, power ..
CAPS project
Need for high-performance processorsCurrent applicationsgeneral purpose: scientific, multimedia, data bases embedded systems: cell phones, automotive, set-top boxes ..Future applicationsdont worry: users have a lot of imagination !
New software engineering techniques are CPU hungry: reusability, generalityportability, extensibility (indirections, virtual machines)safety (run-time verifications)encryption/decryption
CAPS project
CAPS (ancient) background ancient background in hardware and software management of ILPdecoupled pipeline architectures OPAC, an hardware matrix floating-point coprocessorsoftware pipeline for LIW
Supercomputing backgroundinterleaved memories Fortran-S
CAPS project
CAPS background in architectureSolid knowledge in microprocessor architecture technological watch on microprocessorsA. Seznec worked with Alpha Development Group in 1999-2000
Researches in cache architecture
Researches in branch prediction mechanisms
CAPS project
CAPS background in compilersSoftware optimizations for cache memories Numerical algorithms on dense structuresOptimizing data layout
Many prototype environments for parallel compilers:CT++ (with CEA): image processing C++ library for a SIMD architecture, Menhir: a parallel compiler for MatLabIPF (with Thomson-LER): Fortran Compiler for image processing on MasparSage (with Indiana): Infrastusture for source level transformation
CAPS project
We build on
SALTO: System for Assembly-Language Transformations and Optimizationsretargetable assembly source to source preprocessorErven Rohous Ph. D
TSF:Scripting language for program transformation on top of ForeSys (Simulog)Yann Mevels Ph. D
CAPS project
Salto overviewAssembly source to source preprocessor Fine grain machine descriptionIndependent from compilers
CAPS project
Compiler activitiesCode optimizations for embedded applicationsinfrastructures rather than compilersoptimizing compiler strategies rather than new code optimizationsGlobal constraintsperformance /code sizes/ low power (starting)Focus on interactive tools rather than automatic code tuningcase based reasoningassembly code optimizations
CAPS project
Computer aided hand tuningAutomatic optimization has many shortcomingsrather provide the user with a testbed to hand-tune applicationsTarget applicationsFortran codes and embedded C applicationsOur approachcase based reasoningstatic code analysis and pattern matchingprofilinglearning techniquesthe user is the ultimate responsible
CAPS project
CAHTPrototype built onForesys: Fortran interactive front-end (from Simulog)TSF: Scripting language for program transformationSage++: Infrastusture for source level transformation
CAPS project
Analysis and Tuning tool for Low Level Assembly and Source code (with Thomson Multimedia)ATLLAS objectives : Has the compiler done a good job ? Try to match source and optimized assembly at fine grainDevelopment/analysis environment:Models for both source and assemblyGlobal and local analysis (WCET, ) at both levelsInteractive environment for codes visualization and manual/ automatic analysis and optimizationBuilt using Salto and Sage++:Retargetable with compilers and architectures
CAPS project
ATLLAS - Analysis and Tuning tool for Low Level Assembly and Source code : Tuning methodGood ?Half-Automatic or Manual Source OptimisationsAtllascompilationprofilingYesHalf-Automatic or Manual Assembly OptimisationsSource CodeAssembly CodePost-ProcessingProcessingSupport
CAPS project
Assembly Level Infrastrure for Software Enhancement (with STmicroelectonics)ALISEenhanced SALTO for code optimization:better integration with code generationinterface with front-endinterface for profiling datatargets global optimizationbased on component software optimization enginesAnswer to a real need from industry: A retargetable infrastructure
CAPS project
ALISEEnvironment for:global assembly code optimizationproviding optimization alternatives
Support for new embedded processors ISAs with ILP support (VLIW, EPIC)Predicated instructionsFunctional unit clusters, ..
CAPS project
ALISEArchitectureDescriptionD to MArchitecture ModelIntermediate representationOpt 1Opt 2Opt nP to IRTextInputIR to Ass(Emit)OptimizedProgramHigh Level APIInterfacesExternal InfrastructureUser interfaceG.U.I.IntermediateCodeExternal Infrastructure
CAPS project
Preprocessor for media processors (MEDEA+ Mesa project)Multimedia instructions on embedded and general-purpose processors but :no consensus on MMD instructions among constructors:saturated arithmetic or not, different instructions,
Multimedia instructions are not well handled by compilers:but performance is very dependent
CAPS project
Preprocessor for media processors:our approachC source to source preprocessoruser oriented idioms recognition:easy to retarget target dedicated recognition
exploiting loop parallelismvectorization techniquesmultiprocessor systemsavailable soon
Collaboration with Stmicroelectonics
CAPS project
Iterative compilationEmbedded systems:Compile time is not criticalPerformance/code size/power are criticalOne can often relate on profiling
Classical compiler: local optimizationsbut constraints are GLOBAL
Proof of concept for code sizes (Rohous Ph. D)new Ph. D. beginning in september 2000
CAPS project
High performance instruction set simulationEmbedded processors:// development of silicon, ISA, compiler and applications Need for flexible instruction set simulation:high performancesimulation of large codesdebuggingretargetable to experiment: new ISA various microarchitecture optionsFirst results: up to 50x faster than ad-hoc simulator
CAPS project
ABSCISS: Assembly Based System for Compiled Instruction Set Simulation
C SourceTriMedia AssemblytmccTriMedia BinaryABSCISStmsimtmasgccC/C++ SourceCompiled simulatorArchitecture Description
CAPS project
Enabling superscalar processor simulationComplete O-O-O microprocessor simulation:10000-100000 slower than real hardwarecan not simulate realistic applications, but slices even fast mode emulation is slow (50-100x):simulation generally limited to slices at the beginning of the applicationrepresentativeness ?Calvin2 + DICE:combines direct execution with simulationreally fast mode: 1-2x slowdownenables simulating slices distributed over the whole application
CAPS project
Calvin2 + DICE
CAPS project
Moving tools to IA64New 64bit ISA from Intel/HP:Explicitly Parallel Instruction ComputingPredicated ExecutionAdvanced loads (i.e. speculative)A very interesting platform for research !!
Porting SALTO and Calvin2+DICE approach to IA64
Exploring new trade-offs enabled by instruction sets:predicting the predicates ?advanced loads against predicting dependenciesultimate out-of-order execution against compiler
CAPS project
Low power, compilation, architecture, (just beginning :=)
Power consumption becomes a major issue:Embedded and general purpose
Compilation (setting a collaboration with STmicroelectronics/Stanford/Milan):Is it different from performance optimization ?Global constraint optimizationInstruction Set Architecture support ?
Architecture:High order bits are generally null, registers and memoryALUs
CAPS project
Caches and branch predictors
International CAPS visibility in architecture =skewed associative cache + decoupled sectored cache+ multiple block ahead branch prediction+ skewed branch predictor
Continue recurrent work on these topics:multiple block ahead + tradeoffs complexity/accuracy
CAPS project
Simultaneous MultithreadingSharing functional units among several processesAmong the first groups working on this topicS. Hilys Ph. D.SMT behavior well understood for independent threadsnow, focus on // threads from a single application
Current research directions:speculative multithreadingultimate performance with a single thread through predicting threadsperformance/complexity tradeoffs: SMT/CMP/hybrid
CAPS project
Enlarging the instruction window (supported by Intel)In an O-O-O processor, fireable instructions are chosen in a window of a few tens of RISC-like instructions.Limitations are:size of the window number of physical registersPrescheduling: separate data flow scheduling from resource arbitration.coarser units of work ?Reducing the number of physical registers:how to detect when a physical register is dead ?Per group validation ? revisiting CISC/RISC war ?
CAPS project
Unwritten rule on superscalar processor designsFor general purpose registers:Any physical register can be the source or the result of any instruction executed on any functional unit
CAPS project
4-cluster WSRS architecture(supported by Intel) Half the read ports, onefourth the write portsRegister file: Silicon area x 1/8 Power x 1/2 Access time x 0.6Gains on:bypass networkselection logic
CAPS project
Multiprocessor on a chip
Not just replicating board level solutions !
A way to manage a large on-chip cache capacity:how can a sequential application use efficiently a distributed cache ?architectural supports for distributing a sequential application on several processors ?how should instructions and data be distributed ?
CAPS project
HIPSORHIgh Performance SOftware Random number generationNeed for unpredicable random number generation:sequences that cannot be reproduced
State of the art:< 100 bit/s using the operating system75Kbit/s using hardware generator on Pentium III
Internal state of a superscalar can not be reproduceduse this state to generate unpredictable random numbers
CAPS project
HIPSOR (2)1000s of unmonitorable states modified by OS interrupts
Hardware clock counter to indirectly probe these states
Combined with in-line pseudo-random number generation
100 Mbit/s unpredictable random numbers
ARC INRIA with CODES