Understanding the Components of Performance
A transaction and the timeline to service this transaction or another unit of work have several different components. These components consist of input/output, CPU, disk, memory, activity level, seizes/locks, network, and errors. Think of each of these components as a server. Requests for service flow through the system from server to server, and some servers can have multiple requests at one time. One or several of these components can be a bottleneck and a source of the problem for some transactions or certain units of work. The key is to eliminate as suspects those components you know are not causing the problem and then try to isolate the specific component(s) contributing to the performance problem.
Focusing on the Problem
Sometimes you hear "the system is slow" or "my job is taking longer to run than it used to" or "there's a slowdown on system xyz." Before digging into the problem, it's a good idea to consider what might be happening.
- Does the problem appear to be affecting only batch, only interactive, a specific subset of jobs, or all jobs on the system?
- Does the problem happen only at certain times of the day, week, month?
- Does it happen all the time or intermittently?
- Did the problem just start happening? If not, when did it start? How long has it been occurring?
- What, if anything, has changed on the system since the problem(s) started? Do the changes coincide with the beginning of poor performance?
- Is the system partitioned? If yes, there are different questions to ask: How much Commercial Processing Workload (CPW) is allocated to this partition? How much Interactive CPW is allocated to this partition? Are the processors shared or dedicated? There is some overhead associated with shared processors.
System Evaluation and Areas of Consideration
Once you have some answers, there are a few things to check and consider:
- What is the system model? What is the processor feature? What is the interactive feature? This will determine total CPW and interactive CPW available for this model. If interactive response time is a problem, the system might be hitting its interactive threshold. This is indicated by message CPI1479 in the history log.
- What is the OS/400 release? This may determine whether the problem(s) might be release-related.
- What is the database group PTFs level? This may determine whether the system is back-level or current with database PTFs.
- What is the cumulative PTF package level? This may determine whether the system is back-level or current in OS/400 and Licensed Internal Code (LIC) PTFs.
- Adjusting system values such as QMCHPOOL, QMAXACTLVL, QPFRADJ, QACTJOB, and QTOTJOB can improve performance and resolve certain performance problems.
QMCHPOOL represents the size of the machine storage pool. If the value is set too small, system performance can be severely inhibited. For new systems, the recommended initial size of QMCHPOOL is at least two to three times the reserved size of the pool, depending on faulting in the pool.
QMAXACTLVL represents the number of jobs/threads that can simultaneously compete for memory and CPU. QMAXACTLVL should be set to *NOMAX. Then, use the activity levels within the Work with Shared Storage Pools (WRKSHRPOOL) command to control the activity level of each system pool.
QPFRADJ dynamically adjusts (at a rate of approximately every 20 seconds) memory and activity levels for all shared pools on the system. QPFRADJ features configuration parameters via WRKSHRPOOL with F11, which allow administrators to establish priorities for adjusting shared pools, set minimum and maximum pool sizes, and determine pool-faulting and job/thread levels. While setting these parameters effectively overrides QPFRADJ's ability to adjust memory of shared pools, QPFRADJ is still worthwhile because it allows the system to benefit from expert cache.
QACTJOB represents the number of active jobs in which auxiliary storage is allocated during an IPL. The value should be set about 10% greater than the number of active jobs found on the Work with Active Jobs (WRKACTJOB) command during the busiest time in a day.
QTOTJOB controls the amount of auxiliary storage that is allocated during an IPL. All jobs are included in this value (i.e., active jobs, jobs on job queues, and jobs having spooled files associated with them). A recommended setting is 10-20% of the total number of jobs found on WRKSYSSTS display.
QADLACTJ and QADLTOTJ correspond to QACTJOB and QTOTJOB, respectively. QADLACTJ and QADLTOTJ control the additional number of jobs auxiliary storage is allocated for when the total value for either system value has been met. Allocation is performed as soon as the storage is needed, so how these values are set can significantly impact performance. Keep these values set at a reasonable number. For example, say QACTJOB is set at 100 and QADLACTJ is set at 10, and 99 jobs are active on the system. If two more jobs are started, bringing the total active jobs to 101, QADLACTJ will create additional auxiliary storage to handle 10 more active jobs.
QDYNPTYADJ controls whether the priority of interactive jobs is dynamically adjusted to maintain high performance of batch job processing on AS/400e server model hardware. This adjustment capability is effective only on systems that are rated for both interactive and noninteractive throughput and have Dynamic Priority Scheduling enabled.
QDYNPTYSCD allows you to turn on/off the dynamic priority scheduler. The task scheduler uses this value to determine the algorithm for scheduling jobs running on the system.
- Check the QEZDEBUG output queue. Are there any dumps there? If yes, do they have the same error?
- Check the QEZJOBLOG output queue. Are there a number of large QPJOBLOG files? Are there a lot of error messages in them? Is CL logging turned on?
- Is CPU running too high with poor response times? If yes, this will help focus on CPU as a possible bottleneck.
- Is CPU barely running at all with poor response times? If yes, then eliminate CPU as a bottleneck and think about other possible resource constraints, such as disk, IOP, seizes/locks, and/or network.
- Is disk utilization above the 40% guideline? Are there enough disk arms? As higher capacity disk devices for the iSeries systems become available, fewer arms are needed to satisfy the capacity requirements. This can lead to configuring too few disk arms to meet the workload demands placed on them. A lack of disk arms can bottleneck the processor's performance. To avoid such a bottleneck, a minimum number of disk arms are needed for optimum performance on each processor. This number is independent of the quantity of drives needed to meet the desired storage capacity. (Click here for the online disk arms calculator.)
- Is disk response time too high? The suggested guideline, on average, is that disk response times be below 10 ms (.010).
- Is machine pool faulting too high? The guideline for the machine pool is < 10 faults per second. If QPFRADJ is turned on, pool adjustment is automatic. However, you can use the WRKSHRPOOL command with F11 to set the minimum and maximum sizes of each pool. (Save a copy of the original screen for future reference before making changes.) Use this guideline for the minimum amount of memory in the machine pool: (2 * the reserved size of the machine pool) / Total amount of memory on the system. This will give a percentage to be used as the minimum size. Give a maximum size of 100% to the machine pool. The other pools should have the maximum set according to a reasonable understanding of need. Be careful setting the minimum size for Interactive. If you make it too low (and if the priorities are equal), memory will oscillate between batch and interactive.
- Where is the majority of the workload running? (For instance, are all jobs running out of *BASE, or are different subsystems using different pools? Are the pools private or shared?) If the majority of the workload is running out of *BASE, there is the possibility of many jobs competing for the same resources on a busy system. Consider separating batch work from other work on the system.
- File sizes: How big are the files used most often? (Use DSPFD to find out.)
- Number of deleted records: Large files (gigabytes) with large numbers of deleted records (almost half or over half are deleted records) could be a problem. If an application is doing a full table scan over these files, over half the records being read aren't even useful. These files would be good candidates for reorganizing (RGZPFM).
- Is expert cache turned on? Expert cache works by minimizing the effect of synchronous DASD I/Os on a job. Best candidates for performance improvement are jobs that are most affected by synchronous DASD I/Os. Once started, expert cache monitors the DASD I/O activity and logical reference pattern for each database file that is accessed within a shared storage pool. Then, it dynamically adjusts the size and type of I/Os for these files to maximize the use of main storage and minimize the number of DASD I/Os. Reducing the number of DASD I/Os, particularly synchronous I/Os, can result in quicker processing. For interactive jobs, this generally means better response time. For batch jobs, it can mean completing current batch work in less time or doing additional work within an existing batch window.
Performance Data Collection
Once you have some answers and know some key things about the system, the next step is to consider the collection of performance data. Keep the following in mind when considering data collection:
- What type of data needs to be collected? System level data includes pools, disk, CPU, and communication lines. Application level data includes specific jobs, programs, procedures, and subsystems.
- What level of data needs to be collected? General, specific, Performance Explorer (PEX), sample/trace data, other.
A number of tools allow interactive review of system performance:
- WRKSYSACT is the quickest way to analyze a problem situation. It shows only the jobs that have been active during the last observation interval. And it uses fewer system resources than the other commands discussed here. (Note: Performance Tools Licensed Program Product - 5722PT1 is required.)
- WRKSYSACT's View 4 lists allocated and deallocated storage assigned to a job/task. If a runaway job is long-running, you may be able to identify it here. The Storage field shows storage usage information that can be sorted by allocated storage, deallocated storage, and net storage to help detect jobs that are using large amounts of storage. Jobs where allocated storage is increasing dramatically are candidates for further investigation.
- WRKSYSSTS shows the number of jobs in the system, disk usage in system ASP, and the number of addresses used. All memory pools, database and non-database faults, and activity level changes can be monitored at a glance. If there is only one pool with a high non-data base faulting rate, find out which subsystem uses that pool and monitor that subsystem with the WRKACTJOB command to find out what jobs are active.
- WRKACTJOB is used to examine CPU used and disk I/O operations done by each job currently active. Rearrange this display to CPU % and find the largest amount of CPU. Information about response time, run priority, and the pool in which the job is run are also displayed. The result is the average amount of I/O during the observation period.
- WRKDSKSTS shows performance and status information about disk units on the system. Pay attention to column "% busy." Use it as an indicator to look at the System or Component report. Do not use these values for capacity planning.
- DSPPFRDTA can be used to analyze either real-time data or data previously collected. (Note: Performance Tools Licensed Program Product - 5722PT1 required.)
Performance Tools Available
With the variety of applications that can run on the iSeries, system performance problems don't always yield easy solutions. Of the many tools available, it's sometimes difficult to determine which to use. Here is a list of tools and their recommended order of usage. Use this list as a guideline to help you get started.
1. Performance monitor (STRPFRMON prior to V4R4) or Collection Services (STRPFRTRC V4R5 or higher). Use Performance Tools LPP (5722PT1) to run reports over collected performance data.
System Report
- Component Report
- Job Summary Report
- Transaction and Transition Report (from trace data only)
2. WRKSYSACT--Display the data or put it into an outfile.
3. WRKACTJOB
4. WRKJOB--Within the Job Watcher under the iDoctor tool set, you can watch a specific job or set of jobs and/or do a system-wide watch to gather statistics over all the jobs on the system. Within the PEX Analyzer under the iDoctor tool set, you can collect various types of information:
- PEX Stats Flat is one of the best tools to get a system-wide view of the most active programs, so it's good to use if you don't know where to start. It shows what programs and/or MI instructions are using the most CPU, the call count for each program, and disk I/O activity. Based on this information, you can identify which programs should be investigated further. Try to run it during a heavy/peak workload.
- PEX Stats Hier is one of the best tools to see program activity in a particular job. It shows call/return flow of programs within a job, call count for each program, CPU usage, and disk I/O activity for each program, as well as CPU used by each job in the call.
- PEX Profile identifies high-level language statement hot spots (high CPU consumption) in programs or service programs. It gathers CPU usage information over a selected set of programs or service programs.
- PEX Task Switch Trace identifies a number of run-time situations, regardless of what job or task they occur in. It answers questions such as why is a job waiting, who/what is it waiting for, who/what woke it up, and what were they doing up until now? Note: Because of the large amount of data collected, this should be run only for very short periods of time on larger systems and not over all jobs.
- PEX Stats Hierarchical's usefulness depends on what you find.
The Tools to Get Started
You should now have a better indication of what the problem might be and where to place your focus. You can feel confident knowing that many performance tools are available to assist in your analysis, from a high-level overview to a very low-level, detailed view.
Sandi Chromey is a Senior IT/Architect Specialist with IBM Global Services. She provides performance support to both internal and external customers within IBM Global Services. Sandi has been with IBM for 22 years of which 11 years have been in IT. She also has experience in iSeries development and component testing.
LATEST COMMENTS
MC Press Online