51
IBM Systems Group © 2009 IBM Corporation Dan Braden [email protected] AIX Advanced Technical Support http://w3.ibm.com/support/americas/ pseries Disk IO Tuning in AIX 6.1 and 5.3 June 1, 2009

Aix Io Tuning

Embed Size (px)

DESCRIPTION

Aix Io Tuning

Citation preview

Slide 1June 1, 2009
The AIX IO stack
Disk growth
Customers are doubling storage capacities every 12-18 months
Actuator and rotational speed increasing relatively slowly
Network bandwidth - doubling every 6 months
Why is disk IO tuning important?
Approximate CPU cycle time 0.0000000005 seconds
Approximate memory access time 0.000000270 seconds
Approximate disk access time 0.010000000 seconds
Memory access takes 540 CPU cycles
Disk access takes 20 million CPU cycles, or 37,037 memory accesses
System bottlenecks are being pushed to the disk
Disk subsystems are using cache to improve IO service times
Customers now spend more on storage than on servers
*
In the past 8 years, storage density increased 100X, but rotational speed only increased 3X.
Ó 2009 IBM Corporation
Application metrics
Response time
Run time
System metrics
Size based on maximum sustainable thruputs
Bandwidth and thruput sometimes mean the same thing, sometimes not
For tuning - it's good to have a short running job that's representative of your workload
Performance metrics
Random workloads
Sequential workloads
You need more thruput
Ó 2009 IBM Corporation
NFS caches file attributes
The AIX IO stack
JFS uses persistent pages for cache
JFS2 uses client pages for cache
Queues exist for both adapters and disks
Adapter device drivers use DMA for IO
Disk subsystems have read and write cache
Disks have memory to store commands/data
IOs can be coalesced (good) or split up (bad) as they go thru the IO stack
IOs adjacent in a file/LV/disk can be coalesced
IOs greater than the maximum IO size supported will be split up
Write cache
Disk
*
Note that for random IO, coalescing of IOs is a good thing, and splitting up of IOs is a bad thing. The reason is that a disk can do only so many IOPS, and splitting up of IOs results in more physical IOPS.
Ó 2009 IBM Corporation
Data layout
Data layout affects IO performance more than any tunable IO parameter
Good data layout avoids dealing with disk hot spots
An ongoing management issue and cost
Data layout must be planned in advance
Changes are often painful
Best practice: evenly balance IOs across all physical disks
Random IO best practice:
For disk subsystems
Create VGs with one LUN from every array
Spread all LVs across all PVs in the VG
*
Making changes to data layout, after the fact, to go from an unbalanced layout, to best practice, generally requires quite a bit of effort, and a backup and restore of the data. It may also require reconfiguration of the SAN, so plan ahead. Many administrators spend quite a bit of time managing disk hot spots. This can be avoided by using data layout best practices.
The data layout best practices don’t assume that we know what the IO rates are to file systems or LVs, or that the IO rates don’t vary over time. But they do assume that IO to a LV is uniform across the LV over time. Note that IO rates to LVs typically do vary over time, e.g. during production, data loads, backups, batch jobs, etc.. And IO rates to different LVs typically does vary e.g., to data base logs, indices and data tables.
Note that it’s not always possible to follow these practices. E.G., on the ESS, DS8000 and DS6000, we have arrays that are slightly different sizes. But since they are 6+P or 7+P RAID 5 arrays (typically) we can assume they are equivalent since they are close to the same size.
Ó 2009 IBM Corporation
Random IO data layout
…..
Use a random order for the hdisks for each LV
Disk subsystem
Create RAID arrays with data stripes a power of 2
RAID 5 arrays of 5 or 9 disks
RAID 10 arrays of 2, 4, 8, or 16 disks
Create VGs with one LUN per array
Create LVs that are spread across all PVs in the VG using a PP or LV strip size >= a full stripe on the RAID array
Do application IOs equal to, or a multiple of, a full stripe on the RAID array
*
Matching the application IO size to that a full stripe on the RAID array (or a multiple thereof) assures that we get all disks in the array doing IO, and for writes we take advantage of RAID 5 full stripe writes which significantly reduce the RAID 5 write penalty. It’s worth adding that for ESS that we’ve seen better sequential performance with PP sizes from 8-16 MB in size (or with LV striping with the LV strip size equal to 8 or 16 MB). For higher thruput, do application IOs that span all physical disks; thus, are of a size equal to (RAID array full stripe x PP size x # of RAID arrays) if using PP striping or (RAID array full stripe x LV strip size x # of RAID arrays).
It’s recommended to use PP striping (a.k.a. poor man’s striping, or as a maximum inter-disk allocation policy) instead of LV striping as we can dynamically change the stripe width for PP striping, while for LV striping we cannot: a backup, recreation of the LV, and restore is required. So creating the VG with a PP size of 8 or 16 MB assists with performance for sequentially accessed LVs.
Ó 2009 IBM Corporation
Random and sequential mix best practice:
Use the random IO best practices
*
Matching the data stripe size to the IO size assures that we get all disks in the array doing IO, and for writes we take advantage of RAID 5 full stripe writes which significantly reduce the RAID 5 write penalty. It’s worth adding that for ESS that we’ve seen better sequential performance with PP sizes from 8-16 MB in size (or with LV striping with the LV strip size equal to 8 or 16 MB).
It’s recommended to use PP striping (a.k.a. poor man’s striping, or as a maximum inter-disk allocation policy) instead of LV striping as we can dynamically change the stripe width for PP striping, while for LV striping we cannot: a backup, recreation of the LV, and restore is required. So creating the VG with a PP size of 8 or 16 MB assists with performance for sequentially accessed LVs.
Ó 2009 IBM Corporation
Use Big or Scalable VGs
Both support no LVCB header on LVs (only important for raw LVs)
These can lead to issues with IOs split across physical disks
Big VGs require using mklv –T O option to eliminate LVCB
Scalable VGs have no LVCB
Only Scalable VGs support mirror pools (AIX 6100-02)
For JFS2, use inline logs
For JFS, one log per file system provides the best performance
If using LVM mirroring, use active MWC
Passive MWC creates more IOs than active MWC
Use RAID in preference to LVM mirroring
Reduces IOs as there’s no additional writes for MWC
Use PP striping in preference to LV striping
*
The LVCB header is 512 bytes. Thus, an IO which spans LV strip or PP boundaries (where the PPs are on different disks) could be split. This will increase IOPS. And for applications that assume IOs are atomic, this can cause data integrity issues.
Mirror write consistency assures that mirrors are consistent after a system crash. Options include off, active (the default), and passive. Turning MWC off can lead to data corruption after a system crash – though one can avoid this by running “syncvg –f <vgname>” after a crash and before accessing data with an application – but this takes a long time.
Active MWC does additional IOs to what’s called the MWC cache, which is on the outer edge of the PV to journal where LVM is doing mirrored writes; thus doubling the number of IOs for a mirrored write. Passive MWC does IOs to the VGSA, and after a system crash, when the application reads data from one copy of a mirror, LVM checks the mirrors to ensure they are consistent (and makes them consistent if they aren’t) and in the meantime, LVM scrubs the mirror copies in the background to make sure they are consistent. Active MWC is a pay me now and forever overhead, while passive MWC is a pay me now and later (for awhile after a system crash) overhead.
Ó 2009 IBM Corporation
LVM limits
Max PPs per VG and max LPs per LV restrict your PP size
Use a PP size that allows for growth of the VG
Valid LV strip sizes range from 4 KB to 128 MB in powers of 2 for striped LVs
*
To create striped LVs with strip (called stripe) sizes > 128 KB, one needs to use the command line. LTG sizes are automatically set in AIX 5.3.
Note that the largest PV supported by AIX (for other than rootvg) is 1 TB – 1PP for 32 bit kernels and 2 TB-1 PP for 64 bit kernels.
Ó 2009 IBM Corporation
Usually disk actuator limited
Usually limited on the interconnect to the disk actuators
To determine application IO characteristics
Use filemon
# filemon –o /tmp/filemon.out –O lv,pv –T 500000; sleep 90; trcstop
or at AIX 6.1
# filemon –o /tmp/filemon.out –O lv,pv,detailed –T 500000; sleep 90; trcstop
Check for trace buffer wraparounds which may invalidate the data, run filemon with a larger –T value or shorter sleep
Use lvmstat to get LV IO statistics
Use iostat to get PV IO statistics
*
Most environments and applications have a mix of random and sequential IO. Even completely random IO environments will usually have sequential IO for backups and some sequential IO for batch jobs. So performance must be considered for both. Usually a customer will have a greater performance concern for production, backup windows, or batch jobs. So work on performance during the time of concern.
This filemon command only collects IO at the LV and PV layers. Other IO (segment and file) is often IO to memory, not to disk. Output is to /tmp/filemon.out in this case.
You can also run filemon in offline mode:
From AIX 6.1 TL2 (& 5.3 TL9), you can run the following commands for invoking filemon in offline mode:
# trace -a -T 768000 -L 10000000 -o trace.out -J filemon
# sleep 90 (or your program of interest)
# trcstop
# filemon -i trace.out -n gensyms.out -O all,detailed -o filemon.out
Prior to AIX 6.1 TL2 (& 5.3 TL9), you can run the following commands for invoking filemon in offline mode:
# trace -a -T 768000 -L 10000000 -o trace.out -J filemon
# sleep 90 (or your program of interest)
# trcstop
Ó 2009 IBM Corporation
Most Active Logical Volumes
------------------------------------------------------------------------
Most Active Physical Volumes
------------------------------------------------------------------------
1.00 11669 6478 30.3 /dev/hdisk0 N/A
1.00 6247484 4256 10423.1 /dev/hdisk77 SAN Volume Controller Device
1.00 6401393 10016 10689.3 /dev/hdisk60 SAN Volume Controller Device
1.00 5438693 3128 9072.8 /dev/hdisk69 SAN Volume Controller Device
filemon summary reports
*
These reports are in order of utilization, which is the percent of time the LV or PV is active, and not necessarily in order of which device is doing the most IO, or the best or worst IO service times. These reports show if we’re doing high rates of IO and the R/W ratio. But it doesn’t tell us if the IO is sequential or random.
Ó 2009 IBM Corporation
Detailed reports at PV and LV layers (only for one LV shown)
Similar reports for each PV
VOLUME: /dev/rms09_lv description: /RMS/bormspr0/oradata07
reads: 23999 (0 errs)
read sizes (blks): avg 439.7 min 16 max 2048 sdev 814.8
read times (msec): avg 85.609 min 0.139 max 1113.574 sdev 140.417
read sequences: 19478
read seq. lengths: avg 541.7 min 16 max 12288 sdev 1111.6
writes: 350 (0 errs)
write sizes (blks): avg 16.0 min 16 max 16 sdev 0.0
write times (msec): avg 42.959 min 0.340 max 289.907 sdev 60.348
write sequences: 348
write seq. lengths: avg 16.1 min 16 max 32 sdev 1.2
seeks: 19826 (81.4%)
max 157270944 sdev 44289553.4
time to next req(msec): avg 12.316 min 0.000 max 537.792 sdev 31.794
throughput: 17600.8 KB/sec
*
These reports show the number of reads (23999), the average read size (439.7 blocks = 219.85 KB), the average read IO service time (85.609 ms), the number of read sequences (19478), and the average read sequence length (541.7 blocks = 270.85 KB). That there are fewer read sequences than reads indicates that the LVM device driver coalesced many of these IOs into fewer IOs. Equivalent statistics are shown for writes. Note that for many of the statistics the first number is the average, the second is the minimum, the third is the maximum, and the final number is the standard deviation. Also, a block here is 512 bytes.
Seeks is the number of disk seeks actually required (81.4% of the IO requests). A low percentage here indicates sequential IO, a percentage close to 100 indicates totally random IO.
Ó 2009 IBM Corporation
Look for balanced IO across the disks
Lack of balance may be a data layout problem
Depends upon PV to physical disk mapping
LVM mirroring scheduling policy also affects balance for reads
IO service times in the detailed report is more definitive on data layout issues
Dissimilar IO service times across PVs indicates IOs are not balanced across physical disks
Look at most active LVs report
Look for busy file system logs
Look for file system logs serving more than one file system
Using filemon
*
Unbalanced IO, in terms of KB/s, to each PV can indicate a data layout problem. But this isn’t necessarily true if the PV to physical disk mapping isn’t a 1 PV to 1 physical disk mapping, or isn’t a 1 PV to 1 RAID array mapping (where the arrays are of the same size and type). And it’s not necessarily true when we’re doing LVM mirroring, for by default, most reads go to the first copy of the data (a parallel scheduling policy). Alternatively if the read scheduling policy is round robin, then half the IOs go to each copy.
If you see a busy file system log, check the VG (lsvg –l) and file systems (lsfs –q) to see what file systems the log services. If it services more than one file system, then adding a new file system log can improve performance. Adding a log reduces serialization of access to the log. For JFS, one can also turn off logging (if the customer is willing to lose all the data in the file system in case of a crash) via the nointegrity mount option.
Ó 2009 IBM Corporation
Look for increased IO service times between the LV and PV layers
Inadequate file system buffers
Inadequate disk buffers
Inadequate disk queue_depth
Inadequate adapter queue depth can lead to poor PV IO service time
i-node locking: decrease file sizes or use cio mount option if possible
Excessive IO splitting down the stack (increase LV strip sizes)
Disabled interrupts
syncd: reduce file system cache for AIX 5.2 and earlier
Tool available for this purpose: script on AIX and spreadsheet
Using filemon
17.19
2.97
2.29
1.24
14.90
1.73
*
Note that IO to files and segments is often to memory. However, one can determine what file is involved for the segment report using the following procedure (assuming one isn’t doing IO to raw LVs, or using CIO or DIO, as AIX will allocate a segment for file – or multiple segments if >256 MB):
Determine the file being used as follows:
# svmon –S <segid>
This produces a device name and inode. To determine the file, run:
# ncheck -i <inode> <lvname>
Ó 2009 IBM Corporation
Use a meaningful interval, 30 seconds to 15 minutes
The first report is since system boot (if sys0’s attribute iostat=true)
Examine IO balance among hdisks
Look for bursty IO (based on syncd interval)
Useful flags:
-a Adapter report (IOs for an adapter)
-m Disk path report (IOs down each disk path)
-s System report (overall IO)
-A or –P For standard AIO or POSIX AIO
-D for hdisk queues and IO service times
-l puts data on one line (better for scripts)
-p for tape statistics (AIX 5.3 TL7 or later)
-f for file system statistics (AIX 6.1 TL1)
*
tty: tin tout avg-cpu: % user % sys % idle % iowait
24.7 71.3 8.3 2.4 85.6 3.6
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 2.2 19.4 2.6 268 894
hdisk1 5.0 231.8 28.1 1944 11964
hdisk2 5.4 227.8 26.9 2144 11524
hdisk3 4.0 215.9 24.8 2040 10916
...
# iostat –ts <interval> <count> For total system statistics
System configuration: lcpu=4 drives=2 ent=1.50 paths=2 vdisks=2
tty: tin tout avg-cpu: % user % sys % idle % iowait physc % entc
0.0 8062.0 0.0 0.4 99.6 0.0 0.0 0.7
Kbps tps Kb_read Kb_wrtn
82.7 20.7 248 0
Kbps tps Kb_read Kb_wrtn
80.7 20.2 244 0
*
Ó 2009 IBM Corporation
Percent of time the CPU is idle and waiting on an IO so it can do some more work
High %iowait does not necessarily indicate a disk bottleneck
Your application could be IO intensive, e.g. a backup
You can make %iowait go to 0 by adding CPU intensive jobs
Low %iowait does not necessarily mean you don't have a disk bottleneck
The CPUs can be busy while IOs are taking unreasonably long times
Conclusion: %iowait is a misleading indicator of disk performance
What is %iowait?
Provides IO statistics for LVs, VGs and PPs
You must enable data collection first for an LV or VG:
# lvmstat –e –v <vgname>
root/ # lvmstat -sv rootvg 3 10
Logical Volume iocnt Kb_read Kb_wrtn Kbps
hd8 212 0 848 24.00
hd4 11 0 44 0.23
hd2 3 12 0 0.01
hd9var 2 0 8 0.01
..
.
# lvmstat -l lv00 1
*
The lvmstat command is mainly useful for characterizing IO for LVs or filesystems over time. You can also use it to examine IO rates to non-inline file system logs. It provides IO rates and average IO size (KB_read+KB_wrtn)/iocnt.
Ó 2009 IBM Corporation
# timex dd if=<device> of=/dev/null bs=1m count=100
Test sequential write thruput to a device:
# timex dd if=/dev/zero of=<device> bs=1m count=100
Note that /dev/zero writes the null character, so writing this character to files in a file system will result in sparse files
For file systems, either create a file, or use the lptest command to generate a file, e.g., # lptest 127 32 > 4kfile
Test multiple sequential IO streams – use a script and monitor thruput with topas:
dd if=<device1> of=/dev/null bs=1m count=100 &
dd if=<device2> of=/dev/null bs=1m count=100 &
*
Note that the “r” device is a character file while the non-r device is a block file. Usually with dd one will get better performance with the character file (or r device).
Ó 2009 IBM Corporation
http://www.ibm.com/collaboration/wiki/display/WikiPtype/nstress
# ndisk -R -f ./tempfile_10MB -r 50 -t 60
Command: ndisk -R -f ./tempfile_10MB -r 50 -t 60
Synchronous Disk test (regular read/write)
No. of processes = 1
Sync type: none = just close the file
Number of files = 1
Run time = 60 seconds
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
1 - 331550 5517.4 | 21.55 22069.64 60.09
*
The ndisk program has a lot of features. You can change the R/W ratio, do IO to different files, do different types of IO to different files (including random which is shown above, or sequential), change the number of processes, etc. It even has a database simulation program.
Here, the dd command creates a 10 MB file of all zeros that we use for testing.
The sync type here makes a big difference. Here we “just close the file” and let the syncd daemon flush the IO to disk. If we sync every IO, thruput will be substantially lower. Many applications require that certain IOs be on disk before the application can proceed. So there is a distinction between synchronous IOs and asynchronous IOs (speaking in the context of a programmer). Here we achieved thruput of 21.55 MB/s to a vscsi disk, but only because we really did the IO to file system cache and let syncd do all the writes.
Here’s a different run of ndisk:
# ndisk -R -q -f ./tempfile_10MB -r 50 -t 60
Command: ndisk -R -q -f ./tempfile_10MB -r 50 -t 60
Synchronous Disk test (regular read/write)
No. of processes = 1
Sync type:FSYNC = fsync() system call after each write I/O
Number of files = 1
Run time = 60 seconds
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
1 - 10853 180.9 | 0.71 723.43 60.01
Ó 2009 IBM Corporation
Dealing with cache effects
Prime the cache (recommended)
Run the test twice and ignore the first run or
Use #cat <file> > /dev/null to prime file system and disk cache
or Flush the cache
For disk subsystems, use #cat <unused file(s)> > /dev/null
The unused file(s) must be larger than the disk subsystem read cache
It's recommended to prime the cache, as most applications will be using it and you've paid for it, so you should use it
Write cache
If we fill up the write cache, IO service times will be at disk speed, not cache speed
Use a long running job
*
You’ll find that when you run tests where cache is involved, that your results will vary due to the cache. So you may want to run your job several times, and throw out the first result, and use the average of the rest.
Ó 2009 IBM Corporation
AIX 6.1 Restricted Tunables
Generally should only be changed per AIX Support
Display all the tunables using: # <ioo|vmo|schedo|raso|nfso|no> -FL
Display non-restricted tunables without the –F flag
smit access via # smitty tuningDev
Dynamic change will show a warning message
Permanent changes require a confirmation
Permanent changes will result in a warning message at boot in the error log
Some restricted tunables relating to disk IO tuning include:
Most aio tunables
0 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
8755 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
2365 external pager filesystem I/Os blocked with no fsbuf
Run your workload
Increase the resources
For pbufs, increase pv_min_pbuf with ioo or see the next slide
For psbufs, stop paging or add paging spaces
For fsbufs, increase numfsbufs with ioo
For external pager fsbufs, increase j2_nBufferPerPagerDevice (not available in 6.1) and/or j2_dynamicBufferPreallocation with ioo
For client filesystem fsbufs, increase nfso's nfs_v3_pdts and nfs_v3_vm_bufs (or the NFS4 equivalents)
Unmount and remount your filesystems, and repeat
Tuning IO buffers
*
As IOs travel thru layers in the IO stack, buffers are used to keep track of them. So a lack of buffer means an IO must wait until an IO completes and a buffer is freed up. We can be generous with buffers on 64 bit kernels, less so on 32 bit kernels due to limited memory for the kernel.
JFS2 buffers can increase dynamically; thus, we have the j2_dynamicBufferPreallocation variable which affects how quickly buffers are added to the kernel for JFS2 file systems. The j2_nBufferPerPagerDevice is the initial setting for the start but buffers are added/deleted dynamically past the initial value per the j2_ndynamicBufferPreallocation attribute. Note the j2_nBufferPerPagerDevice is not available
Keep in mind that these counts are since system boot. So a few thousand temporarily blocked IOs over several days isn’t worth spending much time on. But hundreds of blocked IOs/sec is. Reducing these blocked IOs will improve the delta between LV and PV layers in a filemon trace.
Use the “ioo –L” to see what the settings are, the defaults, the allowable ranges, and whether the change is dynamic, requires a remount, reboot, or other work to get the change implemented.
Ó 2009 IBM Corporation
pbufs pinned memory buffers - keep track of IOs for hdisks
System wide resource at AIX 5.2 or earlier (pv_min_pbuf)
Configurable for VGs at AIX 5.3
lvmo -v VgName -o Tunable [ =NewValue ]
lvmo [-v VgName] -a
# lvmo -a
vgname = rootvg
pv_pbuf_count = 512 - Number of pbufs added when one PV is added to the VG
total_vg_pbufs = 512 - Current pbufs available for the VG
max_vg_pbuf_count = 16384 - max pbufs available for the VG, requiresvaryoff/varyon
pervg_blocked_io_count = 1 - delayed IO count since last varyon for the VG
pv_min_pbuf = 512 - Minimum number of pbufs added when PV is added to any VG
global_blocked_io_count = 1 - System wide delayed IO count
To increase a VG’s pbufs:
# lvmo –v <vgname> -o pv_pbuf_count=<new value>
pv_min_pbuf is tuned via ioo and takes effect when a VG is varied on
Increase value, collect statistics and change again if necessary
Ó 2009 IBM Corporation
The AIX IO stack
Disk buffers, pbufs, at this layer
Multi-path IO driver (optional)
All IO requires memory
All input goes to memory, all output comes from memory
File system cache consumes memory
File system cache takes CPU cycles to manage
Initial tuning recommendations:
minperm% < numperm% Start with the default of 3%
strict_maxperm=0 Default
strict_maxclient=1 Default
lru_file_repage=0 Default is 1 at 5.3, 0 at 6.1
lru_poll_interval=10 Now is the default
page_steal_method=1 Default at 5.3 is 0 and 6.1 is 1
Initial Setting AIX 5.3 and 6.1
Initial Setting AIX 5.2
minfree = max 960, lcpus * 120 ------------- # of mempools maxfree = minfree +(Max Read Ahead * lcpus) ---------------------- # of mempools The delta between maxfree and minfree should be a multiple of 16 if also using 64KB pages
minfree = max( 960, lcpus * 120) maxfree = minfree +(Max Read Ahead * lcpus)
*
Note that the strategy for setting minfree and maxfree is to assure that every logical CPU has at least enough free memory to read in the maximum file system read ahead value. The lru_file_repage setting assures that if file system cache is > minperm, only file system cache pages are stolen; otherwise, AIX may be doing IO to page space, writing out working storage. The lru_poll_interval tells the lrud daemon to enable interrupts every 10 ms (in this case) to see if any IOs have completed. Part of the time lrud runs, it disables interrupts, and all IO completions are signaled to the CPU via an interrupt.
A change made in AIX 5.3, is that the free list will be minfree times the number of memory pools, whereas in 5.2, it was minfree.
Another approach to setting minfree is use “vmstat –I <interval>” (note that –I is capital i) and for each value add fi plus fo. Then take the maximum value from all the intervals and use that for minfree.
Here’s some sample output showing how to determine the number of memory pools:
root \# echo mempool \*|kdb
Preserving 1392938 bytes of symbol table
First symbol __mulh
START END <name>
(0)> mempool *
memp_frs+010000 00 000 000E5D57 000 001 000C613C
memp_frs+010280 01 001 000DA500 002 003 000CB1F3
This has 2 memory pools, memp_frs+010000 and memp_frs+010280.
The page_steal_method tunable, when set to 1, has separate lists of memory pages for file system cache and working storage so when memory is needed, the appropriate pages will be scanned to free up memory.
Ó 2009 IBM Corporation
Read ahead detects that we're reading sequentially and gets the data before the application requests it
Reduces %iowait
Too much read ahead means you do IO that you don't need
Operates at the file system layer - sequentially reading files
Set maxpgahead for JFS and j2_maxPgReadAhead for JFS2
Values of 1024 for max page read ahead are not unreasonable
Disk subsystems read ahead too - when sequentially reading disks
Tunable on DS4000, fixed on ESS, DS6000, DS8000 and SVC
If using LV striping, use strip sizes of 8 or 16 MB
Avoids unnecessary disk subsystem read ahead
Be aware of application block sizes that always cause read aheads
Read ahead
Tune numclust for JFS
Tune j2_nPagesPerWriteBehindCluster for JFS2
Larger values allow IO to be coalesced
When the specified number of sequential 16 KB clusters are updated, start the IO to disk rather than wait for syncd
Write behind tuning for random writes to a file
Tune maxrandwrt for JFS
Tune j2_maxRandomWrite and j2_nRandomCluster for JFS2
Max number of random writes allowed to accumulate to a file before additional IOs are flushed, default is 0 or off
j2_nRandomCluster specifies the number of clusters apart two consecutive writes must be in order to be considered random
If you have bursty IO, consider using write behind to smooth out the IO rate
Write behind
*
Using write behind is beneficial if iostat shows that we have bursty IO at the syncd interval. These schedule IOs sooner rather than later, so we balance out IOs over time, at the cost of potentially not coalescing IOs. For disk subsystems with large write caches, it may be better to turn off write behind, as we are able to transfer data to disk subsystem cache quickly and let the disk subsystem write out to disk when disk IOPS are available.
Ó 2009 IBM Corporation
syncd - system wide file system cache flushing
Historical Unix feature to improve performance
Applies to asynchronous IOs (not necessarily aio)
inode is locked when each file is synced
There is a tradeoff:
Longer intervals can:
Create bursty IO
Increased IOPS reduces IO service times
AIX 5.3 keeps a list of dirty pages in cache (new feature)
There can be too much filesystem cache
Somewhere around 24 GB for pre-AIX 5.3
sync_release_ilock=1 releases inode lock during syncs, but not recommended when creating/deleting many files in a short time
Ó 2009 IBM Corporation
No time for atime
Ingo Molnar (Linux kernel developer) said:
"It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'"
If you have a lot of file activity, you have to update a lot of timestamps
File timestamps
File last modified time (mtime)
File last access time (atime)
New mount option noatime disables last access time updates for JFS2
File systems with heavy inode access activity due to file opens can have significant performance improvements
First customer benchmark efix reported 35% improvement with DIO noatime mount (20K+ files)
Most customers should expect much less for production environments
APARs
Says to throw the data out of file system cache
rbr is release behind on read
rbw is release behind on write
rbrw is both
DIO: Direct IO
Half the kernel calls to do the IO
Half the memory transfers to get the data to the application
Requires the application be written to use DIO
CIO: Concurrent IO
*
You need to know how applications are accessing a particular file system before you can intelligently use the release behind options. For example with Oracle running in archive log mode, as the redo logs are filled up (written to) and a log switch occurs, we write the redo log data to the archive log. Then this data isn’t needed anymore except in extraordinary circumstances. So using rbr for the redo log and rbw for the archive log would be appropriate. If there is a long time between writing to the redo log and copying it the archive log, then using rbrw may be appropriate for the redo log.
Ó 2009 IBM Corporation
i-node locking: when 2 or more threads access the same file, and one is a write, the write will block all read threads at this level
The AIX IO stack
Disk
IOs must be aligned on file system block boundaries
IOs that don’t adhere to this will dramatically reduce performance
Avoid large file enabled JFS file systems - block size is 128 KB after 4 MB
# mount -o dio
DIO process
Kernel initiates disk IO
Application issues read request
Kernel checks FS cache
Data not found - kernel does disk IO
Data transferred FS cache
Application issues write request
Kernel writes data to FS cache and returns acknowledgment to app
Application continues
Normal file system IO process
File system reads
File system writes
Mount options
Concurrent IO for JFS2 only at AIX 5.2 ML1 or later
# mount -o cio
Assumes that the application ensures data integrity for multiple simultaneous IOs to a file
Changes to meta-data are still serialized
I-node locking: When two threads (one of which is a write) to do IO to the same file are at the file system layer of the IO stack, reads will be blocked while a write proceeds
Provides raw LV performance with file system benefits
Requires an application designed to use CIO
For file system maintenance (e.g. restore a backup) one usually mounts without cio during the maintenance
Ó 2009 IBM Corporation
Disabling file system journaling
You may lose data in the event of a system crash!
Improves performance for meta-data changes to file systems
When you frequently add, delete or change the size of files
JFS
JFS2
Ó 2009 IBM Corporation
New JFS2 Sync Tunables
ioo JFS2 Sync Tunables
The file system sync operation can be problematic in situations where there is very heavy random I/O activity to a large file. When a sync occurs all reads and writes from user programs to the file are blocked. With a large number of dirty pages in the file the time required to complete the writes to disk can be large. New JFS2 tunables are provided to relieve that situation.
j2_syncPageCount
Limits the number of modified pages that are scheduled to be written by sync in one pass for a file. When this tunable is set, the file system will write the specified number of pages without blocking IO to the rest of the file. The sync call will iterate on the write operation until all modified pages have been written.
Default: 0 (off), Range: 0-65536, Type: Dynamic, Unit: 4KB pages
j2_syncPageLimit
Overrides j2_syncPageCount when a threshold is reached. This is to guarantee that sync will eventually complete for a given file. Not applied if j2_syncPageCount is off.
Default: 16, Range: 1-65536, Type: Dynamic, Unit: Numeric
If application response times impacted by syncd, try j2_syncPageCount settings from 256 to 1024. Smaller values improve short term response times, but still result in larger syncs that impact response times over larger intervals.
These will likely require a lot of experimentation, and detailed analysis of IO behavior.
*
j2_syncPageCount Default: 0 Range: 0-65536 Type: Dynamic Unit: 4KB pages
When you are running an application that uses file system caching and does large numbers of random writes, you might need to adjust this setting to avoid lengthy application delays during sync operations. The recommended values are in the range of 256 to 1024. The default value of zero gives the normal sync behavior of writing all “dirty pages” in a single call. Small values for this tunable result in longer sync times and shorter delays in application response time. Larger values have the opposite effects.
j2_syncPageLimit Default: 16 Range: 1-65536 Type: Dynamic Unit: Numeric
Set this tunable when j2_syncPageCount is set, and increase it if the effect of the j2_syncPageCount change does not reduce application response times. The recommended values are in the range of 1 to 8000. The j2_syncPageLimit setting has no effect if the j2_syncPageCount setting is 0.
Ó 2009 IBM Corporation
IO pacing - causes the CPU to do something else after doing a specified amount of IO to a file
Turning it off (the default) improves backup times and thruput
Turning it on ensures that no process hogs CPU for IO, and ensures good keyboard response on systems with heavy IO workload
With N CPUs and N or more sequential IO streams, keyboard response can be sluggish
# chgsys -l sys0 -a minpout=256 maxpout=513
Normally used to avoid HACMP's dead man switch
Old values of 33 and 24 significantly inhibit thruput but are reasonable for uniprocessors with non-cached disk
AIX 5.3 introduces IO pacing per file system via the mount command
mount -o minpout=256 –o maxpout=513 /myfs
AIX 6.1 uses minpout=4096 maxpout=8193
These values can also be used for earlier releases of AIX
IO Pacing
Otherwise IOs will be synchronous and slow down the application
Set minservers and maxservers via smitty aio
minservers is how many AIO kernel processes will be started at boot
Don't make this too high, it only saves time to startup the processes
maxservers is the maximum number of AIOs that can be processed at any one time
maxreqs is the maximum number of AIO requests that can be handled at one time and is a total for the system.
Run # pstat -a | grep aios | wc -l to see how many have been started - represents the max simultaneous AIOs since boot
Adjust maxservers until maxservers is greater than the number started
This is not applicable to AIO thru raw LVs if the fastpath attribute is enabled
IOs in this case are handed off to the interrupt handler
AIX 5.2 or later, min/maxservers value is per CPU, for 5.1 is it system wide
Typical values for AIX 5.3
Default
OLTP
SAP
minservers
1
200
400
maxservers
10
800
1200
maxreqs
4096
16384
16384
*
A reasonable maximum value for maxservers is the sum of the queue_depths for all the disks on the system, as we can’t submit many more than this number of IOs at any one instant.
Ó 2009 IBM Corporation
Asynchronous IO tuning
New -A iostat option monitors AIO (or -P for POSIX AIO) at AIX 5.3 and 6.1
# iostat -A 1 1
System configuration: lcpu=4 drives=1 ent=0.50
aio: avgc avfc maxg maxf maxr avg-cpu: %user %sys %idle %iow physc %entc
25 6 29 10 4096 30.7 36.3 15.1 17.9 0.0 81.9
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 100.0 61572.0 484.0 8192 53380
avgc - Average global non-fastpath AIO request count per second for the specified interval
avfc - Average AIO fastpath request count per second for the specified interval for IOs to raw LVs (doesn’t include CIO fast path IOs)
maxg - Maximum non-fastpath AIO request count since the last time this value was fetched
maxf - Maximum fastpath request count since the last time this value was fetched
maxr - Maximum AIO requests allowed - the AIO device maxreqs attribute
If maxg gets close to maxr or maxservers then increase maxreqs or maxservers
Ó 2009 IBM Corporation
How to turn it on
aioo command option
Can’t be run until aio is loaded
Disabled by default
Application must support CIO
*
For the AIX 6.1 AIO/CIO fast path, this is controlled via the vmo restricted aio_fsfastpath parameter, and by default it’s on.
Ó 2009 IBM Corporation
AIX 6.1 AIO tuning
No longer store AIO tuning parameters in the ODM
The aio kernel extensions are loaded at system boot
No aio servers are started automatically at boot
They are started when AIO requests are made
AIO servers go away with no activity per aio_server_inactivity and posix_aio_server_inactivity vmo parameters (default of 300 seconds)
The maximum number of AIO servers is limited per the aio_maxreqs and posix_aio_maxreqs vmo parameters (default of 65536)
Only maxservers and the aio_maxreqs or posix_aio_maxreqs to tune
*
By default, the aio_fastpath (controlling IOs to raw LVs) and the aio_fsfastpath (controlling IOs to JFS2 CIO) are restricted parameters and shouldn’t be changed from the default. The default values use the fast paths.
Ó 2009 IBM Corporation
Multipath IO code submits IO to hdisk driver
SDD queues IOs and won’t submit more than queue_depth IOs to a hdisk
Disable this with # datapath qdepth disable for heavy IO
SDDPCM does not queue IOs
Hdisk driver has in process and wait queues – in process queue contains up to queue_depth IOs
Hdisk driver submits IOs to adapter driver
Adapter driver has in process and wait queues – FC adapter in process queue contains up to num_cmd_elems IOs
Adapter driver uses DMA to do IOs
Adapter driver submits IOs to disk subsystem
List device attributes with # lsattr -EHl <device>
Attributes with a value of True for user_settable field can be changed
Sometimes you can change these via smit
Allowable values can be determined via:
# lsattr -Rl <device> -a <attribute>
Change attributes with a reboot and:
# chdev –l <device> -a attribute=<newvalue> -P
*
Attributes can sometimes be changed via smit. Usually the device must not be in use to make the change (including all it’s child devices) without a reboot (in which case the –P flag doesn’t need to be used).
Ó 2009 IBM Corporation
Queue depth tuning
Don’t increase queue depths beyond what the disk can handle!
IOs will be lost and will have to be retried, which reduced performance
FC adapters
max_xfer_size attribute controls a DMA memory area for data IO
At the default of 0x100000, the DMA area is 16 MB
At other allowable values (e.g. 0x200000) the DMA area is 128 MB
We often need to change this for heavy IO
lg_term_dma attribute controls a DMA memory area for control IO
The default is usually adequate
How to determine if queues are being filled
With SDD: # datapath query devstats and
# datapath query adaptstats
# pcmpath query adaptstats
With iostat: # iostat –D for data since system boot and
# iostat –D <interval> <count> for interval data
With sar: # sar –d <interval> <count>
*
Purpose
The purpose of this document is to describe how IOs are queued with SDD, SDDPCM, the disk device driver and the adapter device driver, and to explain how these can be tuned to increase performance. This information is also useful for non-SDD or SDDPCM systems.
Where this stuff fits in the IO stack
Following is the IO stack from the application to the disk:
Application
hdisk device driver
adapter device driver
Disk
Note that even though the disk is attached to the adapter, the hdisk driver code is utilized before the adapter driver code. So this stack represents the order software comes into play over time as the IO traverses the stack.
Why do we need to simultaneously submit more than one IO to a disk?
This improves performance. And this would be performance from an application's point of view. This is especially important with disk subsystems where a virtual disk (or LUN) is backed by multiple physical disks. In such a situation, if we only could submit a single IO at a time, we'd find we get good IO service times, but very poor thruput. Submitting multiple IOs to a physical disk allows the disk to minimize actuator movement (using an "elevator" algorithm) and get more IOPS than is possible by submitting one IO at a time. The elevator analogy is appropriate. How long would people be waiting to use an elevator if only one person at a time could get on it? In such a situation, we'd expect that people would wait quite a while to use the elevator (queueing time), but once they got on it, they'd get to their destination quickly (service time).
Where are IOs queued?
As IOs traverse the IO stack, AIX needs to keep track of them at each layer. So IOs are essentially queued at each layer in the IO stack. Generally, some number of in flight IOs may be issued at each layer and if the number of IO requests exceeds that number, they reside in a wait queue until the required resource becomes available. So there is essentially an "in process" queue and a "wait" queue at each layer (SDD and SDDPCM are a little more complicated).
At the file system layer, file system buffers limit the maximum number of in flight IOs for each file system. At the LVM layer, hdisk buffers limit the number of in flight IOs. At the SDD layer, IOs are queued if the dpo device's attribute, qdepth_enable, is set to yes (which it is by default). Some releases of SDD do not queue IOs so it depends on the release of SDD. SDDPCM on the other hand does not queue IOs before sending them to the disk device driver. The hdisks have a maximum number of in flight IOs that's specified by it's queue_depth attribute. And FC adapters also have a maximum number of in flight IOs specified by num_cmd_elems. The disk subsystems themselves queue IOs and individual disks can accept multiple IO requests. Here are an ESS hdisk's attributes:
# lsattr -El hdisk33
location Location Label True
max_transfer 0x40000 N/A True
q_type simple Queuing TYPE True
qfull_dly 20 delay in seconds for SCSI TASK SET FULL True
queue_depth 20 Queue DEPTH True
reserve_policy single_path Reserve Policy True
rw_timeout 60 READ/WRITE time out value True
scbsy_dly 20 delay in seconds for SCSI BUSY True
scsi_id 0x620713 SCSI ID True
start_timeout 180 START unit time out value True
ww_name 0x5005076300cd96ab FC World Wide Name False
The default queue_depth is 20, but can be changed to as high as 256 for ESS, DS6000 and DS8000.
Here's a FC adapter's attributes:
# lsattr -El fcs0
intr_priority 3 Interrupt priority False
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x100000 Maximum Transfer Size True
num_cmd_elems 200 Maximum number of COMMANDS to queue to the adapter True
pref_alpa 0x1 Preferred AL_PA True
sw_fc_class 2 FC Class for Fabric True
The default queue depth (num_cmd_elems) for FC adapters is 200 but can be increased up to 2048.
Here's the dpo device's attributes for one release of SDD:
# lsattr -El dpo
persistent_resv yes Subsystem Supports Persistent Reserve Command False
qdepth_enable yes Queue Depth Control True
When qdepth_enable=yes, SDD will only submit queue_depth IOs to any underlying hdisk (where queue_depth here is the value for the underlying hdisk's queue_depth attribute). When qdepth_enable=no, SDD just passes on the IOs directly to the hdisk driver. So the difference is, if qdepth_enable=yes (the default), IOs exceeding the queue_depth will queue at SDD, and if qdepth_enable=no, then IOs exceed the queue_depth will queue in the hdisk's wait queue. In other words, SDD with qdepth_enable=no and SDDPCM do not queue IOs and instead just pass them to the hdisk drivers. Note that at SDD 1.6, it's preferable to use the datapath command to change qdepth_enable, rather than using chdev, as then it's a dynamic change, e.g., datapath set qdepth disable will set it to no. Some releases of SDD don't include SDD queueing, and some do, and some releases don't show the qdepth_enable attribute. Either check the manual for your version of SDD or try the datapath command to see if it supports turning this feature off.
If you've used both SDD and SDDPCM, you'll remember that with SDD, each LUN has a corresponding vpath and an hdisk for each path to the vpath or LUN. And with SDDPCM, you just have one hdisk per LUN. Thus, with SDD one can submit queue_depth x # paths to a LUN, while with SDDPCM, one can only submit queue_depth IOs to the LUN. If you switch from SDD using 4 paths to SDDPCM, then you'd want to set the SDDPCM hdisks to 4x that of SDD hdisks for an equivalent effective queue depth. And migrating to SDDPCM is recommended as it's more strategic than SDD.
Both the hdisk and adapter drivers have an "in process" and "wait" queues. Once the queue limit is reached, the IOs wait until an IO completes, freeing up a slot in the service queue. The in process queue is also sometimes referred to as the "service" queue
It's worth mentioning, that many applications will not generate many in flight IOs, especially single threaded applications that don't use asynchronous IO. Applications that use asynchronous IO are likely to generate more in flight IOs.
What tools are available to monitor the queues?
For AIX, one can use iostat (at AIX 5.3 or later) and sar (5.1 or later) to monitor some of the queues. The iostat -D command generates output such as:
hdisk6 xfer: %tm_act bps tps bread bwrtn
4.7 2.2M 19.0 0.0 2.2M
read: rps avgserv minserv maxserv timeouts fails
0.0 0.0 0.0 0.0 0 0
write: wps avgserv minserv maxserv timeouts fails
19.0 38.9 1.1 190.2 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
15.0 0.0 83.7 0.0 0.0 136
Here, the avgwqsz is the average wait queue size, and avgsqsz is the average service queue size. The average time spent in the wait queue is avgtime. The sqfull value has changed from initially being a count of the times we've submitted an IO to a full queue, to now where it's the rate of IOs submitted to a full queue. The example report shows the prior case (a count of IOs submitted to a full queue), while newer releases typically show decimal fractions indicating a rate. It's nice that iostat -D separates reads and writes, as we would expect the IO service times to be different when we have a disk subsystem with cache. The most useful report for tuning is just running "iostat -D" which shows statistics since system boot, assuming the system is configured to continuously maintain disk IO history (run # lsattr -El sys0, or smitty chgsys to see if the iostat attribute is set to true).
The sar -d command changed at AIX 5.3, and generates output such as:
16:50:59 device %busy avque r+w/s Kbs/s avwait avserv
16:51:00 hdisk1 0 0.0 0 0 0.0 0.0
hdisk0 0 0.0 0 0 0.0 0.0
The avwait and avserv are the average times spent in the wait queue and service queue respectively. And avserv here would correspond to avgserv in the iostat output. The avque value changed; at AIX 5.3, it represents the average number of IOs in the wait queue, and prior to 5.3, it represents the average number of IOs in the service queue.
SDD provides the "datapath query devstats" and "datapath query adaptstats" commands to show hdisk and adapter queue statistics. SDDPCM similarly has "pcmpath query devstats" and "pcmpath query adaptstats". You can refer to the SDD manual for syntax, options and explanations of all the fields. Here's some devstats output for a single LUN:
Device #: 0
I/O: 29007501 3037679 1 0 40
SECTOR: 696124015 110460560 8 0 20480
Transfer Size: <= 512 <= 4k <= 16K <= 64K > 64K
21499 10987037 18892010 1335598 809036
and here's some adaptstats output:
Adapter #: 0
I/O: 439690333 24726251 7 0 258
SECTOR: 109851534 960137182 608 0 108625
Here, we're mainly interested in the Maximum field which indicates the maximum number of IOs submitted to the device since system boot. Note that Maximum for devstats will not exceed queue_depth x # paths for SDD when qdepth_enable=yes. But Maximum for adaptstats can exceed num_cmd_elems as it represents the maximum number of IOs submitted to the adapter driver and includes IOs for both the service and wait queues. If, in this case, we have 2 paths and are using the default queue_depth of 20, then the 40 indicates we've filled the queue at least once and increasing queue_depth can help performance.
One can similarly monitor adapter queues and IOPS: for adapter IOPS, run # iostat -at <interval> <# of intervals> and for adapter queue information, run # iostat -aD, optionally with an interval and number of intervals.
How to tune
First, one should not indiscriminately just increase these values. It's possible to overload the disk subsystem or cause problems with device configuration at boot. So the approach of adding up the hdisk's queue_depths and using that to determine the num_cmd_elems isn't wise. Instead, it's better to use the maximum IOs to each device for tuning. When you increase the queue_depths and number of in flight IOs that are sent to the disk subsystem, the IO service times are likely to increase, but throughput will increase. If IO service times start approaching the disk timeout value, then you're submitting more IOs than the disk subsystem can handle. If you start seeing IO timeouts and errors in the error log indicating problems completing IOs, then this is the time to look for hardware problems or to make the pipe smaller.
A good general rule for tuning queue_depths, is that one can increase queue_depths until IO service times start exceeding 15 ms for small random reads or writes or one isn't filling the queues. Once IO service times start increasing, we've pushed the bottleneck from the AIX disk and adapter queues to the disk subsystem. Two approaches to tuning queue depth are 1) use your application and tune the queues from that or 2) use a test tool to see what the disk subsystem can handle and tune the queues from that based on what the disk subsystem can handle. The ndisk tool (part of the nstress package available on the internet at http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/nstress) can be used to stress the disk subsystem to see what it can handle. The author's preference is to tune based on your application IO requirements, especially when the disk is shared with other servers.
Caches will affect your IO service times and testing results. Read cache hit rates typically increase the second time you run a test and affect repeatability of the results. Write cache helps performance until, and if, the write caches fill up at which time performance goes down, so longer running tests with high write rates can show a drop in performance over time. For read cache either prime the cache (preferably) or flush the cache. And for write caches, consider monitoring the cache to see if it fills up and run your tests long enough to see if the cache continues to fill up faster than the data can be off loaded to disk. Another issue when tuning and using shared disk subsystems, is that IO from the other servers will also affect repeatability.
Examining the devstats, if you see that for SDD, the Maximum field = queue_depth x # paths and qdepth_enable=yes, then this indicates that increasing the queue_depth for the hdisks may help performance - at least the IOs will queue on the disk subsystem rather than in AIX. It's reasonable to increase queue depths about 50% at a time.
Regarding the qdepth_enable parameter, the default is yes which essentially has SDD handling the IOs beyond queue_depth for the underlying hdisks. Setting it to no results in the hdisk device driver handling them in it's wait queue. In other words, with qdepth_enable=yes, SDD handles the wait queue, otherwise the hdisk device driver handles the wait queue. There are error handling benefits to allowing SDD to handle these IOs, e.g., if using LVM mirroring across two ESSs. With heavy IO loads and a lot of queueing in SDD (when qdepth_enable=yes) it's more efficient to allow the hdisk device drivers to handle relatively shorter wait queues rather than SDD handling a very long wait queue by setting qdepth_enable=no. In other words, SDD's queue handling is single threaded where there's a thread for handling each hdisk's queue. So if error handling is of primary importance (e.g. when LVM mirroring across disk subsystems) then leave qdepth_enable=yes. Otherwise, setting qdepth_enable=no more efficiently handles the wait queues when they are long. Note that one should set the qdepth_enable parameter via the datapath command as it's a dynamic change that way (using chdev is not dynamic for this parameter).
If error handling is of concern, then it's also advisable, assuming the disk is SAN switch attached, to set the fscsi device attribute fc_err_recov to fast_fail rather than the default of delayed_fail. And if making that change, I also recommend changing the fscsi device dyntrk attribute to yes rather than the default of no. These attributes assume a SAN switch that supports this feature.
For the adapters, look at the adaptstats column. And set num_cmd_elems=Maximum or 200 whichever is greater. Unlike devstats with qdepth_enable=yes, Maximum for adaptstats can exceed num_cmd_elems.
Then after running your application during peak IO periods look at the statistics and tune again.
It's also reasonable to use the iostat -D command or sar -d to provide an indication if the queue_depths need to be increased.
The downside of setting queue depths too high, is that the disk subsystem won't be able to handle the IO requests in a timely fashion, and may even reject the IO or just ignore it. This can result in an IO time out, and IO error recovery code will be called. This isn't a desirable situation, as the CPU ends up doing more work to handle IOs than necessary. If the IO eventually fails, then this can lead to an application crash or worse.
Queue depths with VIO
When using VIO, one configures VSCSI adapters (for each virtual adapter in a VIOS, known as a vhost device, there will be a matching VSCSI adapter in a VIOC). These adapters have a fixed queue depth that varies depending on how many VSCSI LUNs are configured for the adapter. There are 512 command elements of which 2 are used by the adapter, 3 are reserved for each VSCSI LUN for error recovery and the rest are used for IO requests. Thus, with the default queue_depth of 3 for VSCSI LUNs, that allows for up to 85 LUNs to use an adapter: (512 - 2) / (3 + 3) = 85 rounding down. So if we need higher queue depths for the devices, then the number of LUNs per adapter is reduced. E.G., if we want to use a queue_depth of 25, that allows 510/28= 18 LUNs. We can configure multiple VSCSI adapters to handle many LUNs with high queue depths. each requiring additional memory. One may have more than one VSCSI adapter on a VIOC connected to the same VIOS if you need more bandwidth.
Also, one should set the queue_depth attribute on the VIOC's hdisk to match that of the mapped hdisk's queue_depth on the VIOS.
For a formula, the maximum number of LUNs per virtual SCSI adapter (vhost on the VIOS or vscsi on the VIOC) is =INT(510/(Q+3)) where Q is the queue_depth of all the LUNs (assuming they are all the same).
Note that to change the queue_depth on an hdisk at the VIOS requires that we unmap the disk from the VIOC and remap it back.
A special note on the FC adapter max_xfer_size attribute
This attribute for the fscsi device, which controls the maximum IO size the adapter device driver will handle, also controls a memory area used by the adapter for data transfers. When the default value is used (max_xfer_size=0x100000) the memory area is 16 MB in size. When setting this attribute to any other allowable value (say 0x200000) then the memory area is 128 MB in size. This memory area is a DMA memory area, but it is different than the DMA memory area controlled by the lg_term_dma attribute (which is used for IO control). This applies to the 2 Gbps FC adapters. The default value for lg_term_dma of 0x800000 is usually adequate.
So for heavy IO and especially for large IOs (such as for backups) it's recommended to set max_xfer_size=0x200000.
The fcstat command can also be used to examine whether or not increasing num_cmd_elems or max_xfer_size could increase performance
# fcstat fcs0
FC SCSI Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 0
No Command Resource Count: 0
This shows an example of an adapter that has sufficient values for num_cmd_elems and max_xfer_size. Non zero value would indicate a situation in which IOs queued at the adapter due to lack of resources, and increasing num_cmd_elems and max_xfer_size would be appropriate.
Note that changing max_xfer_size uses memory in the PCI Host Bridge chips attached to the PCI slots. The salesmanual, regarding the dual port 4 Gbps PCI-X FC adapter states that "If placed in a PCI-X slot rated as SDR compatible and/or has the slot speed of 133 MHz, the AIX value of the max_xfer_size must be kept at the default setting of 0x100000 (1 megabyte) when both ports are in use. The architecture of the DMA buffer for these slots does not accommodate larger max_xfer_size settings"
If there are too many FC adapters and too many LUNs attached to the adapter, this will lead to issues configuring the LUNs. Errors will look like:
LABEL: DMA_ERR
IDENTIFIER: 00530EA6
Sequence Number: 863
Machine Id: 00C3BCB04C00
Node Id: p595back
0000 0000 1000 0003
So if you get these errors, you'll need to change the max_xfer_size back to the default value.
Ó 2009 IBM Corporation
Adapter queue depth tuning
With SDDPCM use # pcmpath query adaptstats
With SDD use # datapath query adaptstats
Use the fcstat command to see counts of IOs delayed since system boot:
# fcstat fcs0
No DMA Resource Count: 4490 <- Increase max_xfer_size
No Adapter Elements Count: 105688 <- Increase num_cmd_elems
No Command Resource Count: 133 <- Increase num_cmd_elems

# pcmpath query adaptstats
I/O: 1105909 78 3 0 200
SECTOR: 8845752 0 24 0 88
*
These statistics are since system boot (or when the devices were configured whichever is later). The adaptstats Maximum field indicates the maximum number of IOs queued since system boot so if this value equals the num_cmd_elems, it indicates that increasing that parameter can help performance.
Ó 2009 IBM Corporation
SAN attached FC adapters
Allows dynamic SAN changes
Set FC fabric event error recovery fast_fail to yes if the switches support it
Switch fails IOs immediately without timeout if a path goes away
Switches without support result in errors in the error log
# lsattr -El fscsi0
fc_err_recov delayed_fail FC Fabric Event Error RECOVERY Policy True
scsi_id 0x1c0d00 Adapter SCSI ID False
sw_fc_class 3 FC Class for Fabric True
# chdev –l fscsi0 –a dyntrk=yes –a fc_err_recov=fast_fail –P
# shutdown –Fr
*
The dynamic tracking is mainly important for allowing dynamic SAN changes such as moving a cable from one switch port to another, changing the SAN domain ID, adding a cascading switch, etc. Otherwise, SAN changes such as these will require an outage.
The fast_fail setting is especially important if using LVM mirroring across disk subsystems; otherwise, it may take LVM several minutes to determine that a disk subsystem has failed.
Ó 2009 IBM Corporation
System configuration: lcpu=4 drives=2 paths=2 vdisks=0
hdisk0 xfer: %tm_act bps tps bread bwrtn
0.0 0.0 0.0 0.0 0.0
read: rps avgserv minserv maxserv timeouts fails
0.2 8.3 0.4 19.5 0 0
write: wps avgserv minserv maxserv timeouts fails
0.1 1.1 0.4 2.5 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 3.0 1020
AIX 5.3 iostat options
Options to summarize IO queueing
sqfull = number of times the hdisk driver’s service queue was full
At AIX 6.1 this is changed to a rate, number of IOPS submitted to a full queue
avgserv = average IO service time
avgsqsz = average service queue size
This can't exceed queue_depth for the disk
avgwqsz = average wait queue size
Waiting to be sent to the disk
If avgwqsz is often > 0, then increase queue_depth
If sqfull in the first report is high, then increase queue_depth
Ó 2009 IBM Corporation
# sar -d 1 2
System configuration: lcpu=2 drives=1 ent=0.30
10:01:37 device %busy avque r+w/s Kbs/s avwait avserv
10:01:38 hdisk0 100 36.1 363 46153 51.1 8.3
10:01:39 hdisk0 99 38.1 350 44105 58.0 8.5
Average hdisk0 99 37.1 356 45129 54.6 8.4
AIX 5.3 new sar output
sar -d formerly reported zeros for avwait and avserv
avque definition changes in AIX 5.3
avque - average IOs in the wait queue
Waiting to get sent to the disk (the disk's queue is full)
Values > 0 indicate increasing queue_depth may help performance
Used to mean number of IOs in the disk queue
avgwait - time (ms) waiting in wait queue
*
VIO
The VIO Server (VIOS) uses multi-path IO code for the attached disk subsystems
The VIO client (VIOC) always uses SCSI MPIO if accessing storage thru two VIOSs
In this case only entire LUNs are served to the VIOC
At AIX 5.3 TL5 and VIO 1.3, hdisk queue depths are user settable attributes (up to 256)
Prior to these levels VIOC hdisk queue_depth=3
Set the queue_depth at the VIOC to that at the VIOS for the LUN
*
The vscsi adapter queue is fixed at 512, two which are dedicated to device recovery, and 3 are used for error recovery for every device using the adapter. The remaining queue slots are used for IOs. Thus, one would not want too many virtual disks managed by a virtual scsi adapter, depending on the queue_depth for the hdisks. At the VIOS, one runs:
$ lsdev -dev hdisk2 -attr
attribute value description user_settable
algorithm fail_over Algorithm True
dist_tw_width 50 Distributed Error Sample Time True
hcheck_interval 0 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
max_transfer 0x40000 Maximum TRANSFER Size True
pvid 00cddeec68220f190000000000000000 Physical volume identifier False
queue_depth 3 Queue DEPTH False
reserve_policy single_path Reserve Policy True
size_in_mb 36400 Size in Megabytes False
to display the device attributes.
IOPS vs IO service time - 15,000 RPM disk
0
100
200
300
400
500
255075100125150175200225250275300325
IOPS