CAPS team Compilation et Architecture pour les Processeurs Superscalaires et Spécialisés Compiler and Architecture for superscalar and embedded processors

CAPS teamCompilation et Architecture pour les Processeurs Superscalaires et Spcialiss

Compiler and Architecture for superscalar and embedded processors

CAPS project

CAPS members2 INRIA researchers: A. Seznec, P. Michaud 2 professors: F. Bodin, J. Lenfant

11 Ph D students: R. Amicel, R. Dolbeau, A. Monsifrot , L. Bertaux, K. Heydemann, L. Morin, G. Pokam, A. Djabelkhir, A. Fraboulet, O. Rochecouste, E.Toullec

3 engineers: S. Bihan, P. Villalon, J. Simonnet

CAPS project

CAPS themes

Two interacting activities

High performance microprocessor architecture

Performance oriented compilation

CAPS project

CAPS GrailPerformance at the best cost

Progress in computer science and applications are driven by performance

CAPS project

CAPS path to the GrailDefining the tradeoffs between:what should be done through hardwarewhat can be done by the compilerfor maximum performanceor for minimum costor for minimum size, power ..

CAPS project

Need for high-performance processorsCurrent applicationsgeneral purpose: scientific, multimedia, data bases embedded systems: cell phones, automotive, set-top boxes ..Future applicationsdont worry: users have a lot of imagination !

New software engineering techniques are CPU hungry: reusability, generalityportability, extensibility (indirections, virtual machines)safety (run-time verifications)encryption/decryption

CAPS project

CAPS (ancient) background ancient background in hardware and software management of ILPdecoupled pipeline architectures OPAC, an hardware matrix floating-point coprocessorsoftware pipeline for LIW

Supercomputing backgroundinterleaved memories Fortran-S

CAPS project

CAPS background in architectureSolid knowledge in microprocessor architecture technological watch on microprocessorsA. Seznec worked with Alpha Development Group in 1999-2000

Researches in cache architecture

Researches in branch prediction mechanisms

CAPS project

CAPS background in compilersSoftware optimizations for cache memories Numerical algorithms on dense structuresOptimizing data layout

Many prototype environments for parallel compilers:CT++ (with CEA): image processing C++ library for a SIMD architecture, Menhir: a parallel compiler for MatLabIPF (with Thomson-LER): Fortran Compiler for image processing on MasparSage (with Indiana): Infrastusture for source level transformation

CAPS project

We build on

SALTO: System for Assembly-Language Transformations and Optimizationsretargetable assembly source to source preprocessorErven Rohous Ph. D

TSF:Scripting language for program transformation on top of ForeSys (Simulog)Yann Mevels Ph. D

CAPS project

Salto overviewAssembly source to source preprocessor Fine grain machine descriptionIndependent from compilers

CAPS project

Compiler activitiesCode optimizations for embedded applicationsinfrastructures rather than compilersoptimizing compiler strategies rather than new code optimizationsGlobal constraintsperformance /code sizes/ low power (starting)Focus on interactive tools rather than automatic code tuningcase based reasoningassembly code optimizations

CAPS project

Computer aided hand tuningAutomatic optimization has many shortcomingsrather provide the user with a testbed to hand-tune applicationsTarget applicationsFortran codes and embedded C applicationsOur approachcase based reasoningstatic code analysis and pattern matchingprofilinglearning techniquesthe user is the ultimate responsible

CAPS project

CAHTPrototype built onForesys: Fortran interactive front-end (from Simulog)TSF: Scripting language for program transformationSage++: Infrastusture for source level transformation

CAPS project

Analysis and Tuning tool for Low Level Assembly and Source code (with Thomson Multimedia)ATLLAS objectives : Has the compiler done a good job ? Try to match source and optimized assembly at fine grainDevelopment/analysis environment:Models for both source and assemblyGlobal and local analysis (WCET, ) at both levelsInteractive environment for codes visualization and manual/ automatic analysis and optimizationBuilt using Salto and Sage++:Retargetable with compilers and architectures

CAPS project

ATLLAS - Analysis and Tuning tool for Low Level Assembly and Source code : Tuning methodGood ?Half-Automatic or Manual Source OptimisationsAtllascompilationprofilingYesHalf-Automatic or Manual Assembly OptimisationsSource CodeAssembly CodePost-ProcessingProcessingSupport

CAPS project

Assembly Level Infrastrure for Software Enhancement (with STmicroelectonics)ALISEenhanced SALTO for code optimization:better integration with code generationinterface with front-endinterface for profiling datatargets global optimizationbased on component software optimization enginesAnswer to a real need from industry: A retargetable infrastructure

CAPS project

ALISEEnvironment for:global assembly code optimizationproviding optimization alternatives

Support for new embedded processors ISAs with ILP support (VLIW, EPIC)Predicated instructionsFunctional unit clusters, ..

CAPS project

ALISEArchitectureDescriptionD to MArchitecture ModelIntermediate representationOpt 1Opt 2Opt nP to IRTextInputIR to Ass(Emit)OptimizedProgramHigh Level APIInterfacesExternal InfrastructureUser interfaceG.U.I.IntermediateCodeExternal Infrastructure

CAPS project

Preprocessor for media processors (MEDEA+ Mesa project)Multimedia instructions on embedded and general-purpose processors but :no consensus on MMD instructions among constructors:saturated arithmetic or not, different instructions,

Multimedia instructions are not well handled by compilers:but performance is very dependent

CAPS project

Preprocessor for media processors:our approachC source to source preprocessoruser oriented idioms recognition:easy to retarget target dedicated recognition

exploiting loop parallelismvectorization techniquesmultiprocessor systemsavailable soon

Collaboration with Stmicroelectonics

CAPS project

Iterative compilationEmbedded systems:Compile time is not criticalPerformance/code size/power are criticalOne can often relate on profiling

Classical compiler: local optimizationsbut constraints are GLOBAL

Proof of concept for code sizes (Rohous Ph. D)new Ph. D. beginning in september 2000

CAPS project

High performance instruction set simulationEmbedded processors:// development of silicon, ISA, compiler and applications Need for flexible instruction set simulation:high performancesimulation of large codesdebuggingretargetable to experiment: new ISA various microarchitecture optionsFirst results: up to 50x faster than ad-hoc simulator

CAPS project

ABSCISS: Assembly Based System for Compiled Instruction Set Simulation

C SourceTriMedia AssemblytmccTriMedia BinaryABSCISStmsimtmasgccC/C++ SourceCompiled simulatorArchitecture Description

CAPS project

Enabling superscalar processor simulationComplete O-O-O microprocessor simulation:10000-100000 slower than real hardwarecan not simulate realistic applications, but slices even fast mode emulation is slow (50-100x):simulation generally limited to slices at the beginning of the applicationrepresentativeness ?Calvin2 + DICE:combines direct execution with simulationreally fast mode: 1-2x slowdownenables simulating slices distributed over the whole application

CAPS project

Calvin2 + DICE

CAPS project

Moving tools to IA64New 64bit ISA from Intel/HP:Explicitly Parallel Instruction ComputingPredicated ExecutionAdvanced loads (i.e. speculative)A very interesting platform for research !!

Porting SALTO and Calvin2+DICE approach to IA64

Exploring new trade-offs enabled by instruction sets:predicting the predicates ?advanced loads against predicting dependenciesultimate out-of-order execution against compiler

CAPS project

Low power, compilation, architecture, (just beginning :=)

Power consumption becomes a major issue:Embedded and general purpose

Compilation (setting a collaboration with STmicroelectronics/Stanford/Milan):Is it different from performance optimization ?Global constraint optimizationInstruction Set Architecture support ?

Architecture:High order bits are generally null, registers and memoryALUs

CAPS project

Caches and branch predictors

International CAPS visibility in architecture =skewed associative cache + decoupled sectored cache+ multiple block ahead branch prediction+ skewed branch predictor

Continue recurrent work on these topics:multiple block ahead + tradeoffs complexity/accuracy

CAPS project

Simultaneous MultithreadingSharing functional units among several processesAmong the first groups working on this topicS. Hilys Ph. D.SMT behavior well understood for independent threadsnow, focus on // threads from a single application

Current research directions:speculative multithreadingultimate performance with a single thread through predicting threadsperformance/complexity tradeoffs: SMT/CMP/hybrid

CAPS project

Enlarging the instruction window (supported by Intel)In an O-O-O processor, fireable instructions are chosen in a window of a few tens of RISC-like instructions.Limitations are:size of the window number of physical registersPrescheduling: separate data flow scheduling from resource arbitration.coarser units of work ?Reducing the number of physical registers:how to detect when a physical register is dead ?Per group validation ? revisiting CISC/RISC war ?

CAPS project

Unwritten rule on superscalar processor designsFor general purpose registers:Any physical register can be the source or the result of any instruction executed on any functional unit

CAPS project

4-cluster WSRS architecture(supported by Intel) Half the read ports, onefourth the write portsRegister file: Silicon area x 1/8 Power x 1/2 Access time x 0.6Gains on:bypass networkselection logic

CAPS project

Multiprocessor on a chip

Not just replicating board level solutions !

A way to manage a large on-chip cache capacity:how can a sequential application use efficiently a distributed cache ?architectural supports for distributing a sequential application on several processors ?how should instructions and data be distributed ?

CAPS project

HIPSORHIgh Performance SOftware Random number generationNeed for unpredicable random number generation:sequences that cannot be reproduced

State of the art:< 100 bit/s using the operating system75Kbit/s using hardware generator on Pentium III

Internal state of a superscalar can not be reproduceduse this state to generate unpredictable random numbers

CAPS project

HIPSOR (2)1000s of unmonitorable states modified by OS interrupts

Hardware clock counter to indirectly probe these states

Combined with in-line pseudo-random number generation

100 Mbit/s unpredictable random numbers

ARC INRIA with CODES

Documents

CAPS team Compilation et Architecture pour les Processeurs Superscalaires et Spécialisés Compiler and Architecture for superscalar and embedded processors