5.16. Checking when a job ends abnormally¶
5.16.1. Job manager exit code¶
If the submitted job ends abnormally, the cause can be identified by referring to the job manager’s exit code (PJM CODE).
To output the job manager exit code, specify -s
or -S
option when executing pjsub.
Example of job execution is as following.
[_LNlogin]$ pjsub -s ./sample.sh
When the job ends, job statistics are output with the following file name:
File name |
Description |
---|---|
Job name.Job ID.stats |
This file contains job statistical information. |
The PJM code list is as follows.
PJM CODE |
Meaning |
---|---|
0 |
Successful job completion |
1 |
CANCEL by pjdel command that the user controled. |
2 |
REJECT based on job acceptance judgment. pjsubcommand will be error |
3 |
Execution rejected by the job manager exit function. The job is not running |
4 |
HOLD by pjhold command that the user controled. |
6 |
CANCEL by step job dependency expression. The job is not running |
7 |
CANCEL by deadline forced specification |
8 |
CANCEL by job manager exit function. The job is not running |
9 |
EXIT during job reconstruction because re-execution is not allowed |
11 |
Job execution timeout due to violation of elapsed time limit |
12 |
Forced termination due to excessive memory usage |
13 |
Forced termination due to excessive disk usage |
14 |
Waiting for job status transition of bulk job (currently not output) |
16 |
Termination due to inaccessibility of current directory or standard input / standard output / standard error output file |
18 |
A job that has been executed beyond the minimum executable time is terminated by executing a subsequent job or starting a deadline schedule. If the former is the cause the item REASON will be “ANOTHER JOB STARTED” and the latter will be “DEADLINE SCHEDULE STARTED” |
20 |
Node down |
21 |
Shell execution failure |
22 |
ICC error |
23 |
Termination by OOM Killer operation |
25 |
HA failure |
26 |
Error in prologue and epilogue processing |
27 |
Job resource management exit processing error |
28 |
Abnormal job execution environment |
29 |
The specified job execution environment is invalid |
30 |
Suspended due to suspend or resume processing failure |
100 |
Job manager internal error |
120 |
Job scheduler internal error |
140 |
Job resource management internal error |
160 |
Tofu library internal error. The job is not running |
180 |
Tiered storage internal error |
5.16.2. Message output during job execution¶
Error message *1
Job output file *2
Change of mpiexec output target *3
Reference manual
PLE nnnn plexec
Job name.Job ID.err
No
Job Operation Software
- Command Reference
- End-user’s Guide
PJM nnnn xxxxxx
Job name.Job ID.err
No
Job Operation Software
- Command Reference
- End-user’s Guide
mpi::
Job name.Job ID.err
Yes
MPI User’s Guide
jwennnn
Job name.Job ID.err
Yes
Development Studio Fortran/C/C++ Runtime Messages
Note
*1: Indicates the character string included in the message. Classify messages by this string.*2: Indicates the message output target when there is no specifying of mpiexec output target change option.
5.16.3. PJM 0079 ERROR REASON list¶
About the error message by operation resource check (GATE CHECK), display the PJM 0079 ERROR REASON list.
When “[ERR.] PJM 0079 pjsub Arbitrary character string.” is displayed, refer to REASON below.
5.16.3.1. When job acceptance¶
At the time of job acceptance (when an error occurs in GATE CHECK), job acceptance may be rejected.
Error code |
REASON |
Error type |
Overview |
---|---|---|---|
Q01 |
group is not active (ACC->QUE) |
Group enable check error |
Submission group is unabled due to expiration |
Q02 |
user is not active (ACC->QUE) |
User enable check error |
Submission user is unabled due to expiration |
Q03 |
Node is too few (ACC->QUE) |
Lower limit check error
[Number of node]
|
The value specified for the number of nodes (-L node) at the time of job submission is smaller than the permitted value. |
Q04 |
Elapse limit is too short (ACC->QUE) |
Lower limit check error
[elapse]
|
Elapsed time limit at job submission (-L elapse) is smaller than the permitted value. |
Q05 |
Use resource is too few (ACC->QUE) |
Lower limit check error
[Node time product]
|
The value of the node time product (-L node, -L elapse) at job submission is smaller than the permitted value. |
Q06 |
Node is too many (ACC->QUE) |
Upper limit check error
[Number of node]
|
The number of nodes at job submission (-L node) has exceeded the permitted value. |
Q07 |
Elapse limit is too long (ACC->QUE) |
Upper limit check error
[elapse]
|
The elapsed time limit (-L elapse) at the time of job submission exceeds the permitted value. |
Q08 |
Use resource is too many (ACC->QUE) |
Upper limit check error
[Node time product]
|
The node time product (-L node, -L elapse) at job introduction exceeds the permitted value. |
Q09 |
Computing resources shortage occurred.[group] (ACC->QUE) |
Resource left check error
(Assignment to the group)
|
The remaining resources of the group are less than the expected resource consumption of the job. |
Q10 |
Computing resources shortage occurred.[rsc-grp] (ACC->QUE) |
Resource left check error
(Assignment to the group in a group)
|
The remaining resources for the resource group in the group are less than the expected resource consumption of the job. |
Q11 |
Computing resources shortage occurred.[user] (ACC->QUE) |
Resource left check error
(Assignment to the user in a group and a resource group)
|
The remaining resources for the users in the group and resource group are less than the expected resource consumption of the job. |
5.16.3.2. When job execution¶
Job execution may be rejected during job execution (when an error occurs in GATE CHECK).
Error code |
REASON |
Error type |
Overview |
---|---|---|---|
S01 |
group is not active (QUE->SIN) |
Group enable check error |
Submission group is unabled due to expiration |
S02 |
user is not active (QUE->SIN) |
User enable check error |
Submission user is unabled due to expiration |
S03 |
need more resource [group] (QUE->SIN) |
Resource left check error
(Assignment to the group)
|
The remaining resources of the group are less than the expected resource consumption of the job. |
S04 |
need more resource [rsc-grp] (QUE->SIN) |
Resource left check error
(Assignment to the group in a group)
|
The remaining resources for the resource group in the group are less than the expected resource consumption of the job. |
S05 |
need more resource [user] (QUE->SIN) |
Resource left check error
(Assignment to the user in a group and a resource group)
|
The remaining resources for the users in the group and resource group are less than the expected resource consumption of the job. |
S06 |
requeue, remaining resource shortage [group] (QUE->SIN) |
Resource left check error
(Assignment to the group)
|
The remaining resources of the group are less than the expected resource consumption of the job. It may be executable depending on the result of the running (submitted) job. |
S07 |
requeue, remaining resource shortage [rsc-grp] (QUE->SIN) |
Resource left check error
(Assignment to the group in a group)
|
The remaining resources for the resource group in the group are less than the expected resource consumption of the job. It may be executable depending on the result of the running (submitted) job. |
S08 |
requeue, remaining resource shortage [user] (QUE->SIN) |
Resource left check error
(Assignment to the user in a group and a resource group)
|
The remaining resources for the users in the group and resource group are less than the expected resource consumption of the job. It may be executable depending on the result of the running (submitted) job. |
S09 |
requeue, Node is too few (QUE->SIN) |
Lower limit check error
[Number of node]
|
The number of nodes of the submitted job (-L node) is smaller than the permitted value. |
S10 |
requeue, Elapse limit is too short (QUE->SIN) |
Lower limit check error
[elapse]
|
Elapsed time limit (-L elapse) of the submitted job is smaller than the permitted value. |
S11 |
requeue, Use resource is too few (QUE->SIN) |
Lower limit check error
[Node time product]
|
The node time product (-L node, -L elapse) of the submitted job is smaller than the permitted value. |
S12 |
requeue, Node is too many (QUE->SIN) |
Upper limit check error
[Number of node]
|
The number of nodes of the submitted job (-L node) exceeds the permitted value. |
S13 |
requeue, Elapse limit is too long (QUE->SIN) |
Upper limit check error
[elapse]
|
Elapsed time limit (-L elapse) of the submitted job exceeds the permitted value. |
S14 |
requeue, Use resource is too many (QUE->SIN) |
Upper limit check error
[Node time product]
|
The node time product (-L node, -L elapse) of the submitted job exceeds the permitted value. |
S15 |
Computing resources shortage occurred.(QUE->RNA) |
Resource left check error
(Assignment to the group)
|
The remaining resources of the group are less than the expected resource consumption of the job. It may be executable depending on the result of the running (submitted) job. |
5.16.4. CPU clock change (Power capping)¶
When the power consumption of a job exceeds the threshold set in the system, the CPU clock of the node used by the job is forcibly reduced.
Please refer to ‘POWER CAPPING DATE’ to check if you have received this function. If affected, the time exceeding the threshold is output in ‘POWER CAPPING DATE’.