It sounded simple at first. Several people at one of my clients remote sites were complaining to the support center that their jobs were taking longer to completesometimes hours longerthan they ever did before. This was a busy system, with 30,000 jobs submitted every day. When I began to look into the problem, I had trouble zeroing in on just those problem jobs. The job names and user IDs werent unique enough to easily distinguish them from jobs submitted at other sites. To make things more difficult, the only performance data available only covered a single week, precluding any chance for a comparison with performance data from normal times.
This article describes the process I went through to identify the reason why some jobs were running longer than they should. The article will also introduce you to the utility I used, GENPFRLOG, to make the analysis easier. Ill also describe how I converted the old RPG III-based utility into a browser-based utility that uses Java servlets.
I wrote the GENPFRLOG utility to scan the history log and build a database that I could analyze more easily than the raw history log. Its also useful when performance data isnt available and facilitates analysis by putting data into one place, in contrast to the previous method of storing performance and history log data. I soon realized that there was a lot of data to plow through, but another benefit to this technique was that I could build an index over the file to optimize my research.
Well, This Isnt Working...
In this kind of situation, the first thing I typically do is gather background information. In this case, I spoke with those users who originally reported a problem. Primarily, the users were concerned about their entered status orders report, which they frequently submitted to batch throughout the day. There didnt seem to be a pattern. Mostly, the jobs behaved as expected, but at odd intervals, the jobs would take longer than normal to complete. Sometimes these delays would occur in the wee hours of the morning, or sometimes in the middle of the day.
Next, I spoke with some of the IT employees who were familiar with the application. They told me there was a sequence of jobs that were scheduled to run at regular intervals, and there were also jobs that were submitted on request. Some of the jobs were
specific to a remote site, but some processed data for all the sites. They all worked together and could impact the data for the report in question.
To complicate matters, another server would periodically submit jobs remotely to the system, sometimes up to a thousand jobs at once. My ears perked up when I heard that, and I made a mental note to follow up on that issue.
Since I had performance data from the performance monitor available for the past week, I first tried to use the performance tools reports. These reports didnt tell me anything I could use. In fact, it seemed that the job usually ran in just a few minutes. I needed another way to look into the problem. At this point, I still wasnt even sure I had identified the correct instance of the job that was a problem. There certainly didnt seem to be a correlation between job runtimes and what the users at the remote site were experiencing.
I decided to use my GENPFRLOG utility. Composed of a command, a CL program, and an RPG program, GENPFRLOG uses entries in the system history log, much like a utility described by Gene Gaunt in his article SYSOP: Using the System Log (in the March 1999 isuue of MC). I took a different approach, however; I found that by using the job start message (CPF1124) along with the job completion message (CPF1164), I could determine the entire life history of the job, from submission to initiation to completion.
My utility reads through QHST* files for the given date and time range. Records for the job start messages are written to a table called PERFLOGP. When the program finds a job completion message, it retrieves the record from the job start and updates the remaining fields.
After I ran the utility, I used SQL to analyze the data collected. I sorted the table by total duration, from job submission to completion, in descending order. After a couple of passes, I filtered out jobs that werent relevant and saw some of the jobs that had been identified earlier. The interesting thing about these jobs was that the total runtimes werent really that long, but the time between job submissions and job initiation was often significant. It looked like these jobs spent a lot of time in the job queue. But why?
Using SQL again (how did I ever get along without it?), I generated a list of jobs that went through the same subsystem as the jobs in question and were submitted during the same time period or a little before. Interestingly enough, I didnt see the jobs submitted by the remote server, as I expected. I did see some other jobs, though: query jobs.
I ran the GENPFRLOG utility again, going back as far as I could so that I could compare that earlier history log data to the current weeks data. Ive found that history log data is often kept for a month or more, compared to a typical week or so for performance data. I set the selection criteria to select the same jobs identified earlier and found dramatically different times for the duration spent in the job queue. Something had changed to make these jobs wait longer before beginning to execution.
It looked like the query jobs were implicated, but they ran through a different job queue. I was able to put a rough time line together and find that at certain times, up to three query jobs were running concurrently while the jobs in question were in the job queue. A quick look at the subsystem description confirmed my suspicions: The maximum activity level was set to three.
When I spoke with the people in production, they confirmed that the query jobs listed were indeed important. But the entered status orders job was more important. The production folks and I decided to try creating a new job queue and adding another job queue entry to the subsystem description. Overall resource utilization wasnt too bad; it appeared that the system could handle additional work. We changed the job submission to use the new job queue, bumped up the activity level, and sat back and watched.
Aha!
No News Is Good News
After a week had passed, the support center hadnt received any calls regarding these jobs. This was a nice change. To confirm our success, I again ran my utility and reviewed the total time spent in the job queue compared to data from before our change. The data confirmed what the silent telephones were telling us. The average time from submission to completion dropped from 29 minutes and 34 seconds to 2 minutes and 33 seconds, with only a few seconds in the job queue. We looked at the performance data to confirm that we hadnt put too much of a load on the system. We also reviewed times in job queues for other jobs to ensure that we hadnt caused other jobs to wait longer to begin executing. It looked like the project was complete.
You might think that I could have just looked at the job queues to see if jobs were waiting. I did, but there were usually so many jobs going through the queues, and so quickly, that it was difficult to keep track of it all. Plus, it wasnt easy to tell in advance just when the jobs would be coming through, so it was a little like being a policeman sitting in front of a bank and hoping that a bank robber would come by.
The Same, But Different
When I contacted the editors of Midrange Computing about this article, they told me We would really like this to be a PC GUI interface. I thought, why not? Id been playing with Java a bit but didnt really have a good project to work with, which for me is usually the most effective way to learn something new.
I developed a series of HTML documents and Java servlets to organize the different utility functions, as well as some new functionality, into a browser-based interface. This series also integrated my legacy utility, since I didnt want to recreate that part of the utility. I took the opportunity to augment my utility to make my analysis easier, providing the additional ability to view all resulting data, run a canned query, set a filter, set sort order, enter SQL on the fly, and clear data.
There was a steep learning curve. The little bit of Java experience I had was all on a Windows 95 platform, primarily with console applications. I had to learn how to set up and use the AS/400 Toolbox for Java, Qshell on the AS/400, servlets with the WebSphere application server, plus adapt the examples I saw to work with servlets. Most AS/400 Java programming examples that Ive been able to find are for console applications and applets. But I stuck with it and learned some Java along the way.
Now, I have an AS/400 command interface I can use, along with interactive SQL. I also have a browser interface that has performance benefits from running as a noninteractive job, and which will let me easily run SQL statements on systems where the SQL/400 feature wasnt purchased. In Figure 1, you can see what the data table looks like.
A key technique to reusing the existing legacy utility is the CommandCall class from the Toolbox. I found it easier to use commands than to call a program, but in an environment where many calls are made, this may not be an appropriate strategy. An example is shown here:
AS400 sys = new AS400(localhost,userid,password);
CommandCall myCommand = new CommandCall(sys);
myCommand.run(cmd.toString());
You can use the techniques and utility Ive described in this article to investigate and find hard data for cases in which time spent in job queues is a problem. For complete instructions on installing and using the utility, refer to the readme.txt file in the zip file
youll download from MC at www. midrangecomputing.com/mc. The utility and techniques described here are also useful if performance data isnt available, but history logs are. The history log data is well suited to frequency distribution and trend analysis. In
addition, try some of the techniques described here to put a new face on some of your own utilities, enhancing their usability and functionality at the same time.
REFERENCES AND RELATED MATERIALS
AS/400e Work Management Version 4 (SC41-5306-03, CD-ROM SK3T-0118-04)
Beginning Java. Ivor Horton. Chicago, Illinois: Wrox Press, 1997
Instant Java Servlets. Phil Hanna. Berkeley, California: Osborne/McGraw-Hill, 2000
Joe Plutas Install Java 1.1/400 Web page: www.zappie.net/Java/Javatorium/installi.htm
SYSOP: Using the System Log, Gene Gaunt, MC, March 1999
Figure 1: You can use a browser to view your performance data table.
LATEST COMMENTS
MC Press Online