Simulation Techniques.ppt

Embed Size (px)

Citation preview

  • 8/17/2019 Simulation Techniques.ppt

    1/39

    Computer

    ArchitectureSimulators

     Alex RamirezEines i Tecniques

    de Mesura

    Master CANS

    * Some contents taken from Sangyeun Cho, CS/COE1541 course U. of Pittsburg.

  • 8/17/2019 Simulation Techniques.ppt

    2/39

    2

    What is a simulator?

    ●  A tool to reproduce the behavior of a computing device

    ● Why use a simulator?

    ● Obtain fine-grain details about internal behavior 

    ● Performance analysis

    ● Enable software development on not available platforms

    ● Obtain performance predictions for candidate system architectures

  • 8/17/2019 Simulation Techniques.ppt

    3/39

    3

    Taxonomy of simulation tools

     Architecture simulators

    Functional Performance / Timing

    Trace-driven Execution-drivenTrace-driven Execution-driven

    User code Full system

  • 8/17/2019 Simulation Techniques.ppt

    4/39

    4

    Functional vs. Timing simulators

    ● Functional simulators implement the visible architecture state

    ● Maintain the correctness of the programmer vision of the architecture● Main purpose: software development and/or emulation

    ● Eg. Virtual machines (simnow, qemu, vmware)

    ● Timing simulators implement the microarchitecture details

    ● Model the system internals tructures● Processor pipeline

    ● Memory hierarchy

    ● Branch predictors

    ● Interconnection network

    ● ...

    ● Functional simulation is much faster than performance simulation

  • 8/17/2019 Simulation Techniques.ppt

    5/39

    5

    Trace vs. Execution driven

    ● Trace driven simulators

    ● Instrument application, and execute on real platform

    ● Instrumentation records an execution trace

    ● Recorded trace is used as input to the simulator 

    ● Execution driven simulators

    ● The application runs directly on top of the simulator 

    ● Simulators maintains both application state and architecture state

    ● Trace-driven simulation is usually faster than execution driven

    ● Only needs to maintain the architecture state

    ● No need to reproduce the computation parts of the application

    ● Trace driven simulator makes host and target system independent

    ● Obtain traces on machine A, simulate on machine B● Traces enable simulation of propietary applications & input sets

    ● Take traces away, independent of inputs & binaries

    ● Traces can not capture dynamic properties of the application

  • 8/17/2019 Simulation Techniques.ppt

    6/39

    6

    User-code vs Full-system

    ● User-code simulators only simulate the application code

    ● System calls and I/O fall-back to functional simulation● Often calls the native OS to perform the functional emulation

    ● Requires that host OS and target OS are the same

    ● Full-system simulators include OS and I/O devices

    ● Functional and timing simulation of OS codes● Functional and timing simulation of all devices

    ● Disks, Network, ...

  • 8/17/2019 Simulation Techniques.ppt

    7/397

    Building a supercomputer from the ground up

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    Processor exploits ILP

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    F WMAD

    Multicore exploits TLP Node board connects

    multiple chips

    Rack connects

    multiple nodes

    Many interconnected racks

    build a supercomputer 

  • 8/17/2019 Simulation Techniques.ppt

    8/398

    Multiple levels of abstraction

    ● Simulation can be performed at multiple levels of abstraction

    ● Cluster-level

    ● Processors as atomic blocks

    ● No simulation of pipeline or memory system

    ● Detailed simulation of interconnection network

    ● Shared memory node

    ● Processor microarchitecture as atomic block

    ● Detailed simulation of memory system

    ● No interaction with I/O and interconnection network

    ● Processor

    ● Detailed simulation of processor piepeline stages

    ● Detailed simulation of first leevl caches

    ● Little interaction with off-chip memory system

    ● No interaction with I/O and interconnection network

  • 8/17/2019 Simulation Techniques.ppt

    9/399

    The Zen of architecture simulators

    ● Can't build a simulator that achieves

    everything ...

    ● FPGA prototypes

    ● Fast + accurate, but not flexible

    ● Detailed software models

    ●  Accurate + flexible, but slow

    ●  Abstract software models

    ● Fast + flexible, but not accurate

    Speed

     Accuracy Flexibility

  • 8/17/2019 Simulation Techniques.ppt

    10/3910

    Developing architecture simulators

    ● Monolithic C code

    ● Models architecture in low-level c code

    ● Fast, Accurate (but error-prone), not reallyflexible

    ● Need to model interfaces + timing of

    components

    ● Object-oriented (C++, Java)

    ● Objects closely match architecturecomponents

    ● Methods implement the object interfaces

    ● Still need to model timing of components

    ● Domain-specific languages

    ● Modular infrastructures

    ● Programmer defines components +

    interfaces

    ● Programmer defines componnets

    interconnect

    ● Simulation negine models timing

  • 8/17/2019 Simulation Techniques.ppt

    11/3911

    The simulation speed problem

    ● Trace-driven simulation

    ● Collect traces on one machine

    ● Run simulations in a different machine

    ● Run multiple parallel simulations

    ● No need for total correctness

    ● Easier for first-approach data

    ● Execution-driven simulation

    ● Must run application and simulator on the same machine

    ● Cross-ISA emulation is costly

    ● In all cases, simulation is slow (and traces are huge)

    ● 100x to 100.000x slower than real runs

    ● 1 min. of real application can take

    ● 2h to 2 years of simulation

    ● Gigabytes of trace storage

  • 8/17/2019 Simulation Techniques.ppt

    12/3912

    Simulator speed example

    338x147,648x43h 13m 41sw/Ruby + Opal

    89x39,131x11h 27m 25sw/Ruby

    1437x7m 41ssimics

    25x4,029x1h 11m 07ssim-outorder 

    1158x2m 47ssim-fast----1.054sNative

    Ratio to FunctionalRatio to NativeTime

    ● Simplescalar toolset

    ● sim-fast provides functional emulation

    ● sim-outorder provides timing simulation of processor out of order pipeline

    ● Simics + GEMS toolset● simics provides multicore full-system functional simulation

    ● Ruby provides timing simulation of the caches + memory hierarchy

    ● Processor abstracted as fixed 1IPC

    ● Opal provides timing simulation of the out-of-order processor pipeline

    * gcc from SPEC'00 small input on Xeon 3.8GHz

  • 8/17/2019 Simulation Techniques.ppt

    13/3913

    Don't simulate faster: Simulate LESS

    ● Simulation is a single-thread process

    ● Even when simulating parallel hardware● Processors do not get any faster: we have hit the  power  wall

    ● Simulation speed no longer improves with technology

    ● Parallel machines used to run multiple simulations in parallel

    ● We can not make a faster simulator ● Do not simulate the entire system

    ● Do not simulate the entire application

  • 8/17/2019 Simulation Techniques.ppt

    14/3914

    Downsizing the simulated platform

    ● Simulate a smaller system

    ● eg. Multiprocessor of 16 cores instead of 1024● Does not expose scalability issues

    ● Competition for shared resources

    ● Conflicts

    ● Race conditions

    ● Simulate a smaller application

    ● eg. Matrix Multiply of 1K x 1K instead of 1M x 1M

    ● Does not exercise the hardware properly

    ● Smaller working set fits in cache

    ● Capacity and conflict issues● Less data / code reuse

    ● Heavier weight of initialization vs. steady state

  • 8/17/2019 Simulation Techniques.ppt

    15/3915

    Simulation techniques

    ● Sampling

    ● Simulate random samples of the application● Fast-forwarding

    ● Reaching the sample point in the application

    ● Warm-up

    ● Fast simulation of components previous to measuring stage

    ● Checkpointing● Dump application and architecture state before the simulation sample

    ● Phase detection

    ● Select (non-random) samples based on application analysis

  • 8/17/2019 Simulation Techniques.ppt

    16/3916

    Simulation sampling

    ● Statistical approach

    ● Do not examine the entire population

    ● Interview a representative sample

    ● Mathematical approach

    ● confidence margin (eg. 95%)

    ● confidence interval (eg. +/- 2.5)

    ● Two sampling approaches● Systematic sampling

    ● Sample every N instructions

    ● Random sampling

    U

     Actual simulator

    measurement

    N instructions (total benchmark execution)

  • 8/17/2019 Simulation Techniques.ppt

    17/3917

    Number of required samples

    ● Sample population is proportional to

    ● Variance of the target metric● Desired confidence interval

    ● Desired confidence level

    ● n >= ( level * variance / (1 - interval) ) ^ 2

  • 8/17/2019 Simulation Techniques.ppt

    18/39

    18

    SimFlex sample sizes

    ● Multiprocessor applications exhibit much higher variance

    ● Larger samples required to stabilize measurements

    95 +/- 550.000 cycles100.000 cyclesOLTP (TPC-C)

    Web server 

    99.7 +/- 31.000 instr 2.000 instr SPEC CPU

    Confidence intervalSimulation unitDetailed warm-up Application

  • 8/17/2019 Simulation Techniques.ppt

    19/39

    19

    Simulator warm-up

    ● Can't simulate samples on an empty  architecture state

    ● Caches do not start in invalid state

    ● Branch predictor, TLB, ...

    ● Operating system state● Files open / close, read / write pointers, ...

    ● The architecture takes some time to warm-up

    ● The larger the structure, the longer the warm-up

    Functional simulation

    Warm-up of large structures

    W

    Detailed simulator

    warm-up

    U

     Actual simulator

    measurement

    N instructions (total benchmark execution)

  • 8/17/2019 Simulation Techniques.ppt

    20/39

    20

    Checkpointing and sampling simulation

    ● Store warmed-up state before each sample to a checkpoint file●  Amortizes the functional simulation overhead time over more detailed simulations

    ●  Allows parallel simulation of all samples

    Functional simulation

    Warm-up of large structures

    W

    Detailed simulator

    warm-up

    U

     Actual simulator

    measurement

    N instructions (total benchmark execution)

    Restore checkpoint state

  • 8/17/2019 Simulation Techniques.ppt

    21/39

    21

    Non-random sampling

    ● Exploit application's repetitive behavior to select valid samples

    ●  Assume application behavior is linked to static code execution

    ● Each phase corresponds to a static section of the code

  • 8/17/2019 Simulation Techniques.ppt

    22/39

    22

    Basic block vector 

    ● Count executions of each basic block

    ● For the whole program execution● For every N instructions executed

    ● Typical N is 100 million instructions

    ● Normalize vectors to the total number of basic blocks in the BBV

    ● The sum of all elements in the BBV adds up to 1

    ● Compare sample BBV to global BBV

    ● Manhattan distance

    ● Sum of absolute differences produces a value between 0 and 2

    ● Euclidean distance

    ● SQRT of the sum of the squared difference

  • 8/17/2019 Simulation Techniques.ppt

    23/39

    23

    BBV comparison

    ● Comparison of BBVs also shows

    periodic behavior in the application

    ● BBV difference graph used to

    ● Identify initialization phase

    ● Identify repetitive interval

    ● Fourier analysis

  • 8/17/2019 Simulation Techniques.ppt

    24/39

    24

    SimPoint issues

    ● The most representative BBV might be far ahead in the code

    ● Large fast-forward period● Pick the earliest significant sample

    ●  A single sample might not represent the whole application

    ● Pick one sample from each application phase

    ● Selected samples are architecture dependent

    ● However, the same code with a different compiler has the same BBV

  • 8/17/2019 Simulation Techniques.ppt

    25/39

    25

    Multithreaded workloads

    ● Random samples of multiple threads do not overlap in time

    ● Invalidates simulation of multithreaded systems

    ● Taking a vertical sample does not guarantee the proper alignment either 

    ● Thread speed is not the same during

    ● Functional simulation

    ● Warm-up

    ● Simulation

    ● Fast-forward is measured in instructions, not time

  • 8/17/2019 Simulation Techniques.ppt

    26/39

    26

    Alternatives to simulation

    ● Statistical simulation

    ●  Analytical modeling●  Abstract simulation

    ● Hyerarchical simulation

  • 8/17/2019 Simulation Techniques.ppt

    27/39

    27

    Task-based programming model

    ● Programmer annotates the code to identify tasks, inputs, and outputs

    ● Total task working set must fit in the LS

    ● Runtime library takes care of 

    ● Detecting parallelism

    ● Bundling tasks together 

    ● Transferring data in / out of the LS

    ●  Automatically performs double buffering for tasks in a bundle

    The first Tasks on a Bundle

    require to wait for DMAs to

    finished

    The following Tasks on a Bundle

    don't have to wait much, since their 

    data has been requested

    before the previous Task was executed

  • 8/17/2019 Simulation Techniques.ppt

    28/39

    28

    The TaskSim simulator 

    ● TaskSim is a system simulator based on application level

    traces

    ● Thread phases + inter-thread dependencies + DMA events

    ●  Abstract simulation of CPU bursts

    ● Obtain burst duration from trace●  Aplly CPU speed factor based on phase id

    ● Detailed (cycle-accurate) simulation of DMA controller, caches,

    interconnect, memory controller, and DRAM

  • 8/17/2019 Simulation Techniques.ppt

    29/39

    29

    Sample TaskSim application trace

    Helper Thread 

    CPU 8, 15.51us Prepare bundle

    CPU 9, 1.19us Prepare submissionCPU 11, 1.51us Submit bundle

    SIGNAL 8 Bundle 8 is ready

    CPU 16, 0us Low level wait for events

    WAIT 101 Wait for Task 101

    CPU 7, 5,52us Schedule

    CPU 8, 15.61us Prepare bundle

    SPE Thread 8

    CPU 19, 4521.23us Wait for tasks

    WAIT 8 Wait for Bundle 8

    CPU 20, 6.42us Get task description

    CPU 21, 1.35us Task Stage in

    CPU 23, 561.21us Task execution

    In the simulation, duration of wait

    phases will be ignored, as it depends

    on the WAIT event

  • 8/17/2019 Simulation Techniques.ppt

    30/39

    30

    Where does CPU burst timing come from?

    ● CPU bursts in the TaskSim trace are annotated with their duration

    ● TaskSim can then adjust such CPU burst duration

    ● Change duration to a fixed amount (eg. DMA Wait goes to 1ns)

    ● Multiply duration for a given factor (eg. SPU computation gets 2x faster)

    ● CPU burst duration obtained from● The real execution at the time the trace was collected

    ● Cycle-accurate simulation on another simulator 

    ● Take an instruction-level trace through another simulator 

    ● Simulate a sample in an execution-driven simulator 

    ●  All CPU bursts can be simulated independently on such cycle-accurate

    models

    ● Hierarchical simulation

  • 8/17/2019 Simulation Techniques.ppt

    31/39

    31

    Dynamic task scheduling automatically achieves load balancing

    ● The simulator task scheduler dynamically allocates tasks (bundles) to the first

    available processor 

    ● Simulate a variable number of processors from a single application trace

  • 8/17/2019 Simulation Techniques.ppt

    32/39

    32

    Matrix Multiply simulations (8 to 256 CPU)

    Original execution, 8 SPU, 25.6 GB/s

    16 SPU, 25.6 GB/s

    3x PPU, 16 SPU, 25.6 GB/s

    5x PPU, 32 SPU, 25.6 GB/s

    3 PPU, 10x PPU, 256 SPU, 409.6 GB/s

  • 8/17/2019 Simulation Techniques.ppt

    33/39

    33

    MatMul zoom on 256 CPU simulation

    ● Not even 3 masters at 10x speed can keep up with 256 workers● Wrong task creation + scheduling strategy (depth first vs. breatdh first)

  • 8/17/2019 Simulation Techniques.ppt

    34/39

    34

    Task generation scheme scalability

    ● Task generation (green) on the master task limits scalability (on the left)

    ● Parallelization of task generation (on the right) is crucial to avoid this bottleneck

    16p

    32p

    64p

  • 8/17/2019 Simulation Techniques.ppt

    35/39

    35

    Simulator results also allow analysis of per-module behavior 

    ● Paraver trace shows cache accesses / ns for each of the 4 cache banks● 128-byte interleave evenly balances pressure on the 4 caches

    ● 128 KB interleave is only using 1 out of 4 caches at any given time

    ● Worst case quotient of data size (or DMA size?) vs. interleave grain

    ● 256KB interleave is using 2 out of 4 caches

    Cache interleave every 128 bytes

    Cache interleave every 128 KB

    Cache interleave every 256 KB

  • 8/17/2019 Simulation Techniques.ppt

    36/39

    36

    Also produces statistics (counters + histograms) in addition to traces

    ● Graphs show the number of simultaneous data transfers on the cluster and global buses● Little intra-cluster traffic

    ●  All traffic competes to reach the global bus port

    ● High traffic on the global bus

    ● Continously handles 2-4 transfers

    ● Very little use for more than 4 simultaneous transfers

  • 8/17/2019 Simulation Techniques.ppt

    37/39

    37

    Why FFT3D does not perform well with 32 cache modules?

    ● Cache write-allocate policy hurts 1st FFT pass too badly

    ● Even if afterwards the whole working set fits in the cache

    ● TransposeYZ works better with cache due tor educed page conflicts and lower latency

    FFT3D with 800GB/s bandwidth

    FFT3D with 32 cache blocks (800GB/s bandwidth) and 100GB/s memory bandwidth

    Cache is better

    for some

    access patterns

  • 8/17/2019 Simulation Techniques.ppt

    38/39

    38

    Hyerarchical simulation methodology

    ● First execution

    ● Trace real application at a high level

    of abstraction (CPU bursts,synchronization, communicationevents)

    ● Find Representative segment ofexecution

    ● Non-linear filtering and spectralanalysis

    ● 10-100x smaller trace

    ● CPU burst clustering algorithm

    ● Density based clustering usingperformance counters

    ● Selection of CPU burstsrepresentatives

    ● 10-100x smaller trace● Second execution

    ● Trace CPU bursts representatives atmicroarchitecture level

  • 8/17/2019 Simulation Techniques.ppt

    39/39

    Trace Obtention: WRF