Learn critical cleanup techniques that target end-of-job due to end job commands.
Author's note: This article is one of a series of articles to memorialize Simon Coulter, an outstanding IBM i expert who contributed much to the prosperity of the IBM i platform. I thank Gwen Hanna, Simon's partner, for providing the biography of Simon at the end of the first article of "Simon's Solutions."
As one the most robust operating systems and application systems in the world, IBM i provides numerous mechanisms to automatically reclaim resources occupied by resource containers, such as activation groups (ACTGRPs) or jobs that end unexpectedly. However, it's still possible for resources to be left as orphans by a job that ends unexpectedly—for example, a pointer-based mutex created by the job with the keep-valid option set to true, or a permanent MI object created without inserting its addressability into a context object (library). It's also bad that an unexpected end-of-job might damage the integrity of application data.
As you might have seen, one of the most common reasons for an unexpected end-of-job is end-job operations initiated by operators. One of Simon's posts in December, 2011 covered most of the cleanup techniques that can be used to monitor for end-job operations. These techniques are based on different rationales and mechanisms, and studying them could help you understand the platform better.
In December 2011, Michael Smith started a discussion in the midrange-l mailing list on techniques that can be used to monitor for end-job operations. Simon listed a range of methods that can be used to achieve the goal in this post, Re: Condition Handlers and CL:
The only ways I know to monitor for ENDJOB/ENDSBS/ENDSYS are:
1/ Signal handler - see the signal handler API section of the Unix API reference manual. Specifically, the sigaction() function to monitor for the SIGTERM signal.
2/ Invocation exit programs - see the C runtime and MI library reference manuals. atexit() is used to specify a procedure that will be invoked during normal end. atiexit() is used specify a program that will be invoked during abnormal end.
3/ ILE APIs - see the Activation Group and Control Flow APIs in the ILE CEE API manual. Specifically, the CEERTX and CEE4RAGE APIs. The handler registered with CEERTX is invoked for ENDJOB *IMMED, and is invoked after the controlled delay time elapses for ENDJOB *CNTRLD. It also gets invoked for any exit except a return.
4/ Scope messages - see the message handler API manual. Specifically, the QMHSNDSM API (as mentioned in other threads). If you send the scope message to *EXT you can monitor for ENDJOB etc.
Since ending a job does not raise an exception in that job you cannot use the condition handler APIs.
Regards,
Simon Coulter.
Before Getting Started
At the MI level, an active IBM i job has an associated MI process that is identified by a Process Control Space MI object. The lifetime of an MI process consists of three phases: the initiation phase, the problem phase, and the termination phase. Each phase is initiated by invoking the corresponding phase program. It might surprise you a little that an MI process does the works it's expected to achieve in the problem phase. The system pointers to the initiation phase, problem phase, and termination phase program of an MI process can be materialized via the Materialize Process Attributes (MATPRATR) MI instruction with option hex 19, hex 1B, and hex
An MI process could end for the following internal termination reasons:
- Return from first invocation in problem phase
- Return from first invocation in initiation phase and no problem phase program specified
- Terminate Thread MI instruction issued against the initial thread by a thread in the process
- Terminate Process MI instruction issued by a thread within the process
- An unhandled signal with a default signal-handling action of terminate the process or terminate the request was delivered to the process
- Exception was not handled by the initial thread in the process
And an MI process could end for the following external termination reasons:
- Terminate Process MI instruction issued explicitly to the process by a thread in another process
- Terminate Thread MI instruction issued explicitly to the initial thread of the process by a thread in another process
The Terminate Process and Terminate Thread MI instructions are blocked and cannot be issued from a
- Number of seconds specified by the DELAY parameter when OPTION(*CNTRLD) is specified
- Number of seconds specified by the QENDJOBLMT system value when OPTION(*IMMED) is specified
Details about monitoring for end job operations via the asynchronous signal support are discussed in section Monitor for End Job Operations by Catching the SIGTERM Asynchronous Signal.
It is important to keep in mind that once the signal handler for SIGTERM of a job receives control (or in other words, the SIGTERM asynchronous signal is caught successfully), the MI process associated with the job isn't regarded as being terminated externally; therefore, cleanup mechanisms targeting abnormal end-of-job (e.g., invocation exit procedures registered by CEERTX in order to monitor for abnormal end-of-invocations) will not be fired.
Besides monitoring for the SIGTERM signal, another way to know that a job is currently being ended by an end job operation is by retrieving the process status indicators via the MATPRATR MI instruction with option hex
For a job being ended by an end job command, to know the value of the OPTION parameter of an end job command issued to the job, you can check the 1-byte Controlled Cancel field in the Language/Utility Work Area (LUWA) at offset hex 10. A value of '0' indicates OPTION(*IMMED), and a value of '1' indicates OPTION(*CNTRLD), or in other words, a controlled cancellation of the current job is being performed. I discussed the LUWA in my article "How Much Do You Know About Job Switches?" The system built-in _LUWRKA returns the space pointer addressing the current job's LUWA. The Controlled Cancel field is referred to as End Status in CL and Work Management APIs documentation and can also be retrieved via the Retrieve Job Attributes (RTVJOBA) command with the ENDSTS parameter or the Retrieve Job Information (QUSRJOBI) API with format JOBI0600. For your convenience, examples of retrieving the Controlled Cancel field (End Status) via the three methods are listed below.
h dftactgrp(*no)
/if defined(HAVE_I5TOOLKIT) /copy mih-pgmexec d luwa ds likeds(luwa_t) d based(spp) /else * Prototype type of system BIF _LUWRKA d luwrka pr * extproc('_LUWRKA') d luwa ds qualified d based(spp) d ctrl_cancel /endif d spp s *
/free spp = luwrka(); dsply 'End Status' '' luwa.ctrl_cancel; // To change the current value of the Controlled Cancel // field, type a character and then press Enter.
*inlr = *on; /end-free |
DCL VAR(&CNTRLD) TYPE(*CHAR) LEN(1) RTVJOBA ENDSTS(&CNTRLD) SNDPGMMSG MSG('Controlled Cancel' *BCAT &CNTRLD) |
* eoj02.rpgle d jobi0600 ds d ctrl_cancel d len s 10i 0 inz(328)
c call 'QUSRJOBI' c parm jobi0600 c parm len c parm 'JOBI0600' fmt_name 8 c parm '*' job_name 26 c parm *blanks int_id 16 c 'End Status' dsply ctrl_cancel c seton lr |
Be aware that the LUWA approach can not only retrieve but also change the Controlled Cancel field (End Status). For example, you can change the Controlled Cancel field via t179.rpgle and check it via eoj01.clp or eoj02.rpgle, like the following:
4 > call t179 DSPLY End Status 0 ? x 4 > call eoj01 Controlled Cancel x 4 > call eoj02 DSPLY End Status x ? *N |
You can also change the Controlled Cancel field to '1' using t179.rpgle and then check the output of DSPJOB OPTION(*STSA) (or WRKJOB OPTION(*STSA)). The Controlled end requested field would look like the following:
Controlled end requested . . . . . . . . . : YES |
Monitor for End Job Operations by Catching the SIGTERM Asynchronous Signal
Signals are a POSIX-defined Interprocess Communication (IPC) mechanism. POSIX.1 defines a "signal" as a mechanism by which a process may be notified of, or affected by, an event occurring in the system. Please refer to Using Signal APIs for detailed information about Signal concepts and Signal Management on IBM i. As mentioned above, when users issue an end job command to an IBM i job, a SIGTERM asynchronous signal is delivered to the target job. For this reason, monitoring for the SIGTERM signal with a signal handler registered for SIGTERM via the sigaction() API is a workable way to monitor for end job operations.
RPG Example of Monitoring an Asynchronous SIGTERM Signal
Signal APIs are ILE procedures exported by service program (*SRVPGM) QSYS/QP0SSRV1 and therefore can be utilized by all ILE High-Level Languages (HLLs). The following is an ILE RPG example, eoj03.rpgle, that demonstrates the steps of catching the SIGTERM asynchronous signal.
* @file eoj03.rpgle h dftactgrp(*no)
* Prototype of sigemptyset() (Initialize and empty signal set) d sigemptyset pr 10i 0 extproc('sigemptyset') d set * Prototype of sigaddset() (Add signal to signal set) d sigaddset pr 10i 0 extproc('sigaddset') d set d sig 10i 0 value * Signal action structure d sigaction_t ds qualified d sa_handler * procptr d sa_mask d sa_flags 10i 0 d sa_sigaction * procptr * Prototype of sigaction() (Examine and change signal action) d sigaction pr 10i 0 extproc('sigaction') d sig 10i 0 value d act likeds(sigaction_t) d const d oact likeds(sigaction_t) d options(*omit) d SIGTERM c 6
d act ds likeds(sigaction_t) d r s 10i 0 inz(0) d bye s n inz(*off) * Prototype of sleep() (Suspend processing for interval of time) d sleep pr 10u 0 extproc('sleep') d seconds 10u 0 value * Prototype of signal handler d oops pr d sig 10i 0 value
/free // Register signal handler for SIGTERM act.sa_flags = 0; act.sa_handler = %paddr(oops); r = sigemptyset(act.sa_mask); r = sigaddset(act.sa_mask : SIGTERM); r = sigaction(SIGTERM : act : *omit);
// Check @var bye periodically dow not bye; // Do my work sleep(5); dsply 'Sleepy ... zzz' 'QSYSOPR'; enddo;
// Do necessary cleanup work dsply 'Doing cleanup' 'QSYSOPR';
dsply 'See you :p' 'QSYSOPR'; *inlr = *on; /end-free
* signal handler p oops b d oops pi d sig 10i 0 value /free dsply 'Inside signal handler' 'QSYSOPR'; // Set on @var bye bye = *on; /end-free p e |
Submit a batch job that runs program EOJ03:
SBMJOB CMD(CALL EOJ03) |
Then end the submitted job by issuing the ENDJOB *IMMED command. Messages sent to the QSYSOPR message queue might look like the following:
DSPLY Sleepy ... zzz DSPLY Sleepy ... zzz DSPLY Sleepy ... zzz DSPLY Inside signal handler Job 528005/LJL/A was ended by user LJL. DSPLY Sleepy ... zzz DSPLY Doing cleanup DSPLY See you :p |
Pass Control Back from the Signal Handler to the Main Procedure/Program Directly
In the above RPG example, a global static indicator variable bye is used to communicate between the main procedure and the signal handler. The signal handler oops notifies the main procedure that a SIGTERM signal is caught by setting on bye. The main procedure checks the value of bye periodically. Once bye is set on, the main procedure completes necessary end-of-job cleanup and quits. An alternative method is to directly pass control back to the main procedure from inside the signal handler via a non-local goto:
- Save the stack environment in the main procedure (or a procedure that is expected to retain control after a SIGTERM signal is caught) via the setjmp() or sigsetjmp() API.
- When a SIGTERM signal is caught successfully inside the signal handler registered for SIGTERM, restore the saved stack environment via the longjmp() or siglongjmp() APIs so that the main procedure receives control and runs from the HLL statement where setjmp() is invoked.
All these four functions are declared in <setjmp.h> (QSYSINC/H.SETJMP). The longjmp() function is an ILE procedure exported by *SRVPGM QSYS/QC2UTIL1. The setjmp() function is implemented as a system built-in (__setjmp) instead of an ILE procedure, and the RPG prototype for __setjmp and structure jmp_buf_t can be found in mih-undoc.rpgleinc.
Note that, in addition to saving the current stack environment, sigsetjmp() can optionally save the current signal mask (the set of blocked signals of the calling thread). The stack environment and signal mask saved by sigsetjmp() can subsequently be restored by siglongjmp().
Benefits of the setjmp/longjmp (or sigsetjmp/siglongjmp) approach include these:
- End-of-job cleanup work can be started without any delay once a SIGTERM signal is caught.
- The signal handler can pass control to any invocation entry (that has been recorded via setjmp() in the saved stack environment) currently available on the invocation stack (aka call stack)—for example, the caller program of the program that registered the signal handler.
The following are parts of the source code of two RPG examples that show how to pass control back to a program's caller program from a signal handler via the setjmp/longjmp approach. Program EOJ05 (eoj05.rpgle) saves the stack environment in structure of type jmp_buf_t and then calls EOJ06 (eoj06.rpgle) and passes the jmp_buf_t structure. EOJ06 registers a signal handler for SIGTERM. When a SIGTERM signal is caught, the signal handler restores the stack environment via jmp_buf_t structure passed to EOJ06 and lets EOJ05 receive control again. Note that setjmp() (sigsetjmp()) returns 0 if returning directly; if the setjmp() (sigsetjmp()) returns as a result of a longjmp() (siglongjmp()) call, it returns the value argument of the longjmp() function, or 1 if the value argument of the longjmp() function is 0.
EOJ05
d pos ds likeds(jmp_buf_t) * EOJ06 d eoj06 pr extpgm('EOJ06') d pos likeds(jmp_buf_t)
/free // Save current stack environment if setjmp(pos) = -1; // SIGTERM is caught dsply 'Stack environment restored' 'QSYSOPR'; else; eoj06(pos); endif;
// Do end-of-job cleanup dsply 'Doing cleanup' 'QSYSOPR'; *inlr = *on; /end-free |
EOJ06
d i_main pr extpgm('EOJ06') d pos likeds(jmp_buf_t)
d i_main pi d pos likeds(jmp_buf_t)
/free // Register signal handler for SIGTERM act.sa_flags = 0; act.sa_handler = %paddr(oops); r = sigemptyset(act.sa_mask); r = sigaddset(act.sa_mask : SIGTERM); r = sigaction(SIGTERM : act : *omit);
sleep(600); *inlr = *on; /end-free
* signal handler p oops b d oops pi d sig 10i 0 value /free dsply 'Inside signal handler' 'QSYSOPR'; // Jump to the saved stack environment longjmp(pos : -1); /end-free p e |
Submit a batch job that runs EOJ05 via a Submit Job (SBMJOB) command and then end it:
SBMJOB JOB(ABC) CMD(CALL EOJ05) ENDJOB ABC *IMMED |
The output in the QSYSOPR message queue might look like the following:
DSPLY Inside signal handler Job 528278/LJL/A was ended by user LJL. DSPLY Stack environment restored DSPLY Doing cleanup |
Finally, if you have critical work that you don't want interrupted by an asynchronous SIGTERM signal, you can use the sigprocmask() (Examine and change blocked signals) API to block the SIGTERM signal temporarily and unblock it after you get the work done. The following example code is extracted from eoj04.rpgle.
* Prototype of sigprocmask() (Examine and change blocked signals) d sigprocmask pr 10i 0 extproc('sigprocmask') d how 10i 0 value d new_set d old_set
/free // Register signal handler for SIGTERM act.sa_flags = 0; act.sa_handler = %paddr(oops); r = sigemptyset(act.sa_mask); r = sigaddset(act.sa_mask : SIGTERM); r = sigaction(SIGTERM : act : *omit);
// Save current stack environment if setjmp(pos) = -1; // SIGTERM is caught dsply 'Stack enviroment restored' 'QSYSOPR'; else; // Doing my work; please don't disturb me! // Block SIGTERM sigprocmask( SIG_BLOCK : act.sa_mask : *OMIT); sleep(30); // Unblock SIGTERM sigprocmask( SIG_UNBLOCK : act.sa_mask : *OMIT); endif;
// Do end-of-job cleanup dsply 'Doing cleanup' 'QSYSOPR'; *inlr = *on; /end-free |
Monitor for End-of-ACTGRP via ACTGRP Exit Programs/Procedures
The end of an IBM i job causes all activation groups (ACTGRPs) within the job to end. So monitoring the end of an ACTGRP is a good way to ensure resources scoped to ACTGRP are being released properly. Simon mentioned three methods to register ACTGRP exit programs/procedures in the post: the Register Activation Group Exit Procedure (CEE4RAGE/CEE4RAGE2) APIs and the C library routines atexit() and atiexit() exported by *SRVPGM QSYS/QC2UTIL1.
- atexit() registers procedures that are called when an ACTGRP ends normally.
- atiexit() registers programs that are called when an ACTGRP ends abnormally
- CEE4RAGE/CEE4RAGE2 registers procedures that are called when an ACTGRP ends (either normally or abnormally).
The only difference between CEE4RAGE and CEE4RAGE2 is that the first parameter of an ACTGRP exit procedure registered by CEE4RAGE is a 4-byte ACTGRP mark, while CEE4RAGE2 is an 8-byte ACTGRP mark.
The atiexit() function is documented in the ILE C/C++ for AS/400 MI Library and is declared in <mipgexec.h>, aka source member QSYSINC/H.MIPGEXEC. Prototypes of CEE4RAGE/CEE4RAGE2 and related structures are defined in <leenv.h> (QSYSINC/H.LEENV), and <letype.h> (QSYSINC/H.LETYPE). The prototype of the atexit() function can be found in <stdlib.h> (QSYSINC/H.STDLIB).
Activation group exit procedures, registered by CEE4RAGE, are called in the reverse order of their registration. If a procedure fails, subsequent procedures will not be called. The ILE C/C++ Run-Time Library Functions manual recommends that, for portability, you should use the atexit() function to register a maximum of 32 functions. The functions are processed in a last-in, first-out (LIFO) order. The atiexit() function can register only one ACTGRP exit program. If there are multiple calls to atiexit(), the last call to atiexit() takes effect.
It is important to note that none of these can be used in the default ACTGRPs!
In the following RPG example, eoj07.rpgle, all three methods to register ACTGRP exits mentioned above are utilized so that you can observe the behaviors of these cleanup mechanisms that target end-of-ACTGRP. Call EOJ07 with the one-character parameter 'a' or 'A' to end the ACTGRP abnormally before EOJ07 returns. (Program EOJ07 uses a named ACTGRP called EOJ07; see the control specification of eoj07.rpgle.) The Abnormal End (CEE4ABN) ILE CEE API is used in eoj07.rpgle to end the ACTGRP abnormally. To end ACTGRP EOJ07 normally after calling EOJ07, you can use the RCLACTGRP command: RCLACTGRP EOJ07 OPTION(*NORMAL). To end ACTGRP EOJ07 abnormally after program EOJ07 returns, issue the RCLACTGRP command with OPTION(*ABNORMAL): RCLACTGRP EOJ07 OPTION(*ABNORMAL).
* @file eoj07.rpgle h dftactgrp(*no) bnddir('QC2LE') actgrp('EOJ07') * Prototype of system BIF _RSLVSP2 d rslvsp2 pr extproc('_RSLVSP2') d syp * procptr d opt * Prototype of atiexit() d atiexit pr 10i 0 extproc('atiexit') d exit_pgm * value procptr d exit_arg * Prototype of atexit() d atexit pr 10i 0 extproc('atexit') d exit_proc * value procptr * Prototype of the Abnormal End (CEE4ABN) API d cee4abn pr d raise_TI 10i 0 d cel_rc_mod 10i 0 d user_rc 10i 0 * Prototype of the Register Activation Group Exit * Procedure (CEE4RAGE) API d cee4rage pr d exit_proc * procptr d fc
d eoj08 s * procptr d rt s d arg s d r s 10i 0 d a s 10i 0 inz(0) d b s 10i 0 inz(0) d c s 10i 0 inz(0)
* Prototype of the ACTGRP exit procedure register by atexit() d agp_exit pr * Prototype of the ACTGRP exit program register by atiexit() d agp_exit2 pr * * Prototype of me -- EOJ07 * * Set @var flag to A to end the current ACTGRP abnormally. * d i_main pr extpgm('EOJ07') d flag
d ag_mark 10u 0 d reason 10u 0 d result_code 10u 0 d user_rc 10u 0 d exit_proc s * procptr d fc s
d i_main pi d flag
/free // Resolve a SYSPTR to ACTGRP exit program EOJ08 rt = *allx'00'; %subst(rt:1:2) = x'0201'; %subst(rt:3:30) = *blanks; %subst(rt:3:10) = 'EOJ08'; rslvsp2(eoj08 : rt);
// Register ACTGRP exit procedures/program r = atiexit(eoj08 : arg); r = atexit(%paddr(agp_exit)); exit_proc = %paddr(agp_exit2); cee4rage(exit_proc : fc);
if flag = 'a' or flag = 'A'; // End the current ACTGRP abnormally cee4abn(a : b : c); endif; *inlr = *on; /end-free
p agp_exit b d agp_exit pi /free dsply 'AGP-EXIT - atexit' 'QSYSOPR'; /end-free p e
p agp_exit2 b d agp_exit2 pi d ag_mark 10u 0 d reason 10u 0 d result_code 10u 0 d user_rc 10u 0
/free dsply 'AGP-EXIT - CEE4RAGE' 'QSYSOPR'; /end-free p e |
The source of the ACTGRP exit program EOJ08, eoj08.clp, is the following. It simply sends the character string that is passed to it to the QSYSOPR message queue.
PGM PARM(&MSG) DCL VAR(&MSG) TYPE(*CHAR) LEN(20) SNDMSG MSG(&MSG) TOMSGQ(*SYSOPR) |
Call EOJ07 three times as shown:
CALL EOJ07 ABNORMALLY_END_ACTGRP /* (1) */ CALL EOJ07 NORMALLY_RETURN /* (2) */ RCLACTGRP EOJ07 OPTION(*NORMAL) /* End ACTGRP EOJ07 normally */ CALL EOJ07 NORMALLY_RETURN /* (3) */ RCLACTGRP EOJ07 OPTION(*ABNORMAL) /* End ACTGRP EOJ07 abnormally */ |
You might get the follow messages in the QSYSOPR message queue:
From . . . : LJL 12/09/15 10:07:30 atiexit /* (1) */ DSPLY AGP-EXIT - CEE4RAGE /* (1) */ DSPLY AGP-EXIT - atexit /* (2) */ DSPLY AGP-EXIT - CEE4RAGE /* (2) */ From . . . : LJL 12/09/15 10:07:35 atiexit /* (3) */ DSPLY AGP-EXIT - CEE4RAGE /* (3) */ |
Appendix A: RPG Example of Retrieving Problem Phase Program of the Current Job
ILE RPG program t180.rpgle retrieves a system pointer addressing the problem phase program of the current job via the Materialize Process Attributes (MATPRATR) MI instruction with option hex 1B and then materializes the context name and object name of it via the Materialize System Object (MATSOBJ) MI instruction.
h dftactgrp(*no)
/copy mih-prcthd /copy mih-mchobs
d pratr ds likeds(matpratr_ptr_tmpl_t) d objatr ds likeds(matsobj_tmpl_t) d opt s d ppp s
/free matpratr1(pratr : opt); matsobj(objatr : pratr.ptr); ppp = %trim(objatr.ctx_name) + '/' + %trim(objatr.obj_name); dsply 'Problem phase program' '' ppp; *inlr = *on; /end-free |
To retrieve the problem phase program of a common batch job, you can submit a job to the QBATCH subsystem that calls program T180:
SBMJOB CMD(CALL PGM(T180)) JOBQ(QBATCH) |
The message sent to the QSYSOPR message queue by T180 might look like this:
DSPLY Problem phase program QSYS/QCMD |
Or suppose you submit a job to the QBATCH subsystem with routing data 'QCMD38' that calls program T180:
SBMJOB CMD(CALL PGM(T180)) JOBQ(QBATCH) RTGDTA(QCMD38) |
The message sent to the QSYSOPR message queue by T180 might look like this:
DSPLY Problem phase program QSYS/QCL |
LATEST COMMENTS
MC Press Online