5.12. Displaying a job status¶
This is to explain how to confirm the job execution status that the user submitted.
To check the job status, use pjstat command.
5.12.1. Normal job¶
This is to explain how to check after job submission or job execution status, etc.
5.12.1.1. Display a job list¶
To display a job status list, use pjstat command.
[_LNlogin]$ pjstat
JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM NODE_REQUIRE VNODE CORE V_MEM
238 job.sh NM RUN user1 11/17 09:01:41 0001:00:00 12:2x3x2 - - -
239 bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 12:2x3x2 - - -
240 step.sh ST RUN user1 11/17 09:01:42 - - - - -
241 job2.sh NM RUN user1 11/17 09:01:42 0001:00:00 2 - - -
See also
As default, only jobs that can be referenced by the user who executed the pjstat command are displayed.
5.12.1.2. Display specific jobs¶
Job ID and sub job ID can be specified when executing pjstat.
[_LNlogin]$ pjstat 238
JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM NODE_REQUIRE VNODE CORE V_MEM
238 job.sh NM RUN user1 11/17 09:01:41 0001:00:00 12:2x3x2 - - -
[_LNlogin]$ pjstat '239[1]'
JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM NODE_REQUIRE - - -
239[1] bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 12:2x3x2 - - -
Multiple job IDs and sub job IDs can be specified when executing pjstat.
[_LNlogin]$ pjstat 238 239
JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM NODE_REQUIRE VNODE CORE V_MEM
238 job.sh NM RUN user1 11/17 09:01:41 0001:00:00 12:2x3x2 - - -
239 bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 12:2x3x2 - - -
Multiple job IDs and sub job IDs can be specified, and there are a method to enumerate and a method to specify a range.
Enumeration
[_LNlogin]$ pjstat 238 239 # Job ID [_LNlogin]$ pjstat '239[1]' '239[2]' '239[3]' '239[4]' # Bulk job's sub job ID [_LNlogin]$ pjstat 240_0 240_1 240_2 240_3 # Step job's sub job ID
Specify the area
A range can be specified by connecting the job ID, bulk number, or step number with a hyphen.
[_LNlogin]$ pjstat 238-240 # Job ID 238, 239, 240 [_LNlogin]$ pjstat '239[1-5]' # Sub job ID 239[1], 239[2], 239[3], 239[4], 239[5] [_LNlogin]$ pjstat 240_0-4 # Sub job ID 240_0, 240_1, 240_2, 240_3, 240_4
Attention
When specifying a sub job ID of a bulk job as an argument to the pjstat command, escape the parentheses ‘[]’ with single quotes so that they are not processed by the shell.
5.12.1.3. pjstat output item¶
pjstat output result item and contents are as below.
The item of job information list
Item name |
contents |
---|---|
JOB_ID |
Job ID |
JOB_NAME |
Job name |
MD |
Job model
|
ST |
The current processing status of the job.
For details, refer to Job Status below.
|
USER |
Executing user name |
START_DATE |
If the job is not executed, the scheduled execution start time is displayed. If it is during or after execution, the actual start time is displayed.
|
ELAPSE_LIM |
Displays the elapsed time limit. |
NODE_REQUIRE |
|
VNODE |
Displays the number of virtual nodes. |
CORE |
Displays the number of CPU cores per virtual node. |
V_MEM |
Displays the amount of memory per virtual node. |
5.12.1.4. Job status items¶
The job status (ST) item status list is shown below.
Display content |
Status |
Description |
---|---|---|
ACC |
ACCEPT |
Job submission is accepted. |
CCL |
CANCEL |
The job has been canceled by an instruction from the job submitter or administrator. |
ERR |
ERROR |
The job has been canceled due to an error detected by the job management function while maintaining the submitted status.
|
EXT |
EXIT |
The job has finished. |
HLD |
HOLD |
The job execution is stopped and fixed in the submitted state.
|
QUE |
QUEUED |
The job is accepted and waiting for the execution order. |
RJT |
REJECT |
Job acceptance is rejected. |
RNA |
RUNNING-A |
The resources necessary for job execution have been acquired. |
RNE |
RUNNING-E |
Epilogue processing is being executed. |
RNO |
RUNOUT |
The job is being terminated. |
RNP |
RUNNING-P |
Prologue processing is being executed. |
RSM |
RESUME |
Resume processing is in progress. |
RUN |
RUNNING |
The job is being executed. |
SPD |
SUSPENDED |
Suspended state. |
SPP |
SUSPEND |
Suspend processing is in progress. |
5.12.1.5. List of output messages¶
-v
option to pjstat command.Message |
Description |
---|---|
(none) |
No error |
ANOTHER JOB STARTED |
A job that was running beyond the minimum runnable time for the job has been terminated to run a subsequent job. |
DEADLINE SCHEDULE STARTED |
A job that was running beyond the minimum job execution time was terminated due to the start of the deadline schedule. |
ELAPSE LIMIT EXCEEDED |
The elapsed time limit has been exceeded. |
FILE IO ERROR |
The current directory when the user’s job is submitted cannot be accessed. |
GATE CHECK |
Canceled by the job manager exit function. |
IMPOSSIBLE SCHED |
Scheduling failed. |
INSUFF CPU |
There is a physical shortage of CPUs. |
INSUFF MEMORY |
There is a physical memory shortage. |
INSUFF NODE |
The number of nodes is physically insufficient. |
INSUFF CustomResourceName |
The custom resource defined by the resource name CustomResourceName is insufficient. |
INTERNAL ERROR |
Internal error. |
INVALID HOSTFILE |
The host file is unmatched which specified with |
LIMIT OVER MEMORY |
The memory limit was exceeded during job execution. |
LOST COMM |
All-to-all communication of parallel processes is not guaranteed. |
NO CURRENT DIR |
The current directory or standard input / standard output / standard error output file when the user job was submitted could not be accessed. |
NOT EXIST CustomResourceName |
A custom resource with the resource name CustomResourceName is not defined. |
RESUME FAIL |
Resume failed. |
RSCGRP NOT EXIST |
Resource group does not exist. |
RSCGRP STOP |
The resource group has stopped. |
RSCUNIT NOT EXIST |
Resource unit does not exist. |
RSCUNIT STOP |
The resource unit has stopped. |
RUNLIMIT EXCEED |
The maximum number of concurrent job executions has been exceeded. |
SUSPEND FAIL |
Suspend failed. |
USELIMIT EXCEED |
Waiting for execution due to simultaneous node limit or concurrent CPU core limit. |
USER NOT EXIST |
The job execution user does not exist in the system. |
WAIT SCHED |
The number of jobs subject to scheduling has been reached, so it has been excluded from scheduling. |
Other character strings |
If the message by |
5.12.1.6. Job count display by status¶
The number of jobs and job details by status are displayed if specifying --with-summary
to pjstat command.
The information displayed and its meaning are shown below.
[_LNlogin]$ pjstat --with-summary
ACCEPT QUEUED RUNING RUNOUT HOLD ERROR REJECT EXIT CANCEL TOTAL
n n n n n n n n n n
s n n n n n n n n n n
JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM NODE_REQUIRE VNODE CORE V_MEM
XXXXXX XXXXXXXX XX XX XXXX MM/DD hh:mm:ss hhhh:mm:ss-hhhh:mm:ss nnn:XXxYYxZZ nnnnn nnnn nnnMiB
See also
In the number of jobs displayed by status, the line with s at the beginning displays the number of jobs including sub jobs.
If pjstat --summary
is specified, only the number of jobs by status is displayed.
pjstat command output item (Number of jobs depending on the status)
Item |
Description |
---|---|
ACCEPT |
Displays the number of jobs waiting to be accepted. |
QUEUED |
Displays the number of jobs waiting to be executed. |
RUNING |
Displays the number of running jobs. |
RUNOUT |
Displays the number of jobs waiting to be completed. |
HOLD |
Displays the number of jobs fixed by the user. |
ERROR |
Displays the number of jobs that are fixed due to errors. |
REJECT |
Displays the number of jobs that have been rejected. |
EXIT |
Displays the number of finished jobs. |
CANCEL |
Displays the number of canceled jobs. |
TOTAL |
The total number of displayed jobs by status is displayed. |
5.12.2. Step job¶
-E
option is required when referring to sub job information. In addition, since the sub job is executed sequentially in the order of submission, the next sub job is not scheduled until the execution of the first sub job is completed.Normal display
[_LNlogin]$ pjstat JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM NODE_REQUIRE VNODE CORE V_MEM 240 step.sh ST RUN user1 11/17 09:01:42 - - - - -
Display including sub job (
-E
option added)
[_LNlogin]$ pjstat -E JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM ...(omitted)... 240 step.sh ST RUN user1 11/17 09:01:42 - ...(omitted)... 240_0 step.sh ST RUN user1 11/17 09:01:42 0001:00:00 ...(omitted)... 240_1 step.sh ST QUE user1 - 0001:00:00 ...(omitted)... 240_2 step.sh ST QUE user1 - 0001:00:00 ...(omitted)... 240_3 step.sh ST QUE user1 - 0001:00:00 ...(omitted)... 240_4 step.sh ST QUE user1 - 0001:00:00 ...(omitted)...
5.12.3. Bulk job¶
-E
option is required when referring to sub job information.Display example of parent job only
[_LNlogin]$ pjstat JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM NODE_REQUIRE VNODE CORE V_MEM 239 bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 12:2x3x2 - - -
Display example including sub job (add
-E
option)
[_LNlogin]$ pjstat -E JOB_ID JOB_NAME MD ST USER START_DATE ELAPSE_LIM ...(omitted)... 239 bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 ...(omitted)... 239[1] bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 ...(omitted)... 239[2] bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 ...(omitted)... 239[3] bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 ...(omitted)... 239[4] bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 ...(omitted)... 239[5] bulk.sh BU RUN user1 11/17 09:01:42 0001:00:00 ...(omitted)...
5.12.4. Computer resources¶
The display of computer resources for each assignment can be checked using the accountj command.
The display of computing resources and coefficients for each job can be confirmed using the pjstata command.
About the usage of each command, please refer to “3.1. Command explanation” of “User Support Tool User’s Guide“.
5.12.5. The job statistic information¶
The job statistic information is output to stats file if put -s
or -S
to pjsub option and execute.
The job statistic information can be checked by using pjstat command. If use pjstat command, the following options are used.
pjstat {-s|-S} --history [day=N] [jobid]
Use example
[_LNlogin]$ pjstat -S --history day=7 22269
Please refer to the manual for pjstatsinfo (7) for the job statistics information items.
[_LNlogin]$ man pjstatsinfo
5.12.5.1. Electric power information¶
The job statistics include the following power information:
Calculation cores in CMG units Average Power consumption (Recommended)
L2 cache per CMG Average power consumption (Estimation)
Memory per CMG Average Power (Estimation)
Tofu Average power consumption (Estimation)
In CPU around Average power consumption (Recommended)
Light module Average power consumption (Estimation)
PCI-E Average power consumption (Recommended)
Node Average power consumption (Estimation)
Node Average power consumption (Actual measurement)
Attention
Understanding the interval for obtaining power-related information
- The power measurements from which the power information in the job statistics is derived run every 1 minute.It may not be able to measure power at the right time in the jobs that run for less than 1 minute. In that case, the discrepancy between the output value and the actual power used will be large.
The power information AVG output in the job statistics is the processed value of the information measured in 1 minute increments.
On Supercomputer Fugaku , if the power consumption exceeds the value set on the operation side during job execution, the CPU clock of the node used by the job may be forcibly reduced. The output of statistical information is a measure of the power consumption of the job.
If you need more detailed power consumption research, use PowerAPI or Profiler.
Output example for the job statistic information (Abstract)
AVG POWER CONSUMPTION OF CORES/CMG(0) (IDEAL) : 21.394424
AVG POWER CONSUMPTION OF CORES/CMG(1) (IDEAL) : 21.355174
AVG POWER CONSUMPTION OF CORES/CMG(2) (IDEAL) : 21.362863
AVG POWER CONSUMPTION OF CORES/CMG(3) (IDEAL) : 21.363552
ENERGY CONSUMPTION OF CORES/CMG(0) (IDEAL) : 0.012945
ENERGY CONSUMPTION OF CORES/CMG(1) (IDEAL) : 0.012922
ENERGY CONSUMPTION OF CORES/CMG(2) (IDEAL) : 0.012926
ENERGY CONSUMPTION OF CORES/CMG(3) (IDEAL) : 0.012927
AVG POWER CONSUMPTION OF L2CACHE/CMG(0) (IDEAL) : 1.349224
AVG POWER CONSUMPTION OF L2CACHE/CMG(1) (IDEAL) : 1.337747
AVG POWER CONSUMPTION OF L2CACHE/CMG(2) (IDEAL) : 1.323286
AVG POWER CONSUMPTION OF L2CACHE/CMG(3) (IDEAL) : 1.323401
5.12.5.2. Electric power limitation function (Power capping)¶
When the power limit function is activated when the power consumption of the job exceeds the threshold, the time when the threshold is exceeded is output to ‘POWER CAPPING DATE’ of the job statistical information.
Please refer to ‘POWER CAPPING DATE’ to confirm whether the power limit has been applied.
5.12.5.3. Performance information output¶
Job statistical information includes information (PERF COUNT) for calculating performance information.
Thus this information is output per node, output if put -S
option.
Output example (Abstract)
PERF COUNT 1 : 4385694044
PERF COUNT 2 : 3968
PERF COUNT 3 : 0
PERF COUNT 4 : 267598
PERF COUNT 5 : 328895
PERF COUNT 6 : 0
PERF COUNT 7 : 2398547
PERF COUNT 8 : 98275
PERF COUNT 9 : 237498
Performance information output is calculated as following.
Performance information |
Output direction |
---|---|
Number of execution cycle |
SUM(PERF COUNT 1) |
Number of floating-point instruction operations |
SUM(PERF COUNT 2)+SUM(PERF COUNT 3)x4 |
Number of memory read request |
SUM(PERF COUNT 4) / 12 |
Number of memory write request |
SUM(PERF COUNT 5) / 12 |
Number of sleep cycle |
SUM(PERF COUNT 6) |
EFFECTIVE_INST_SPEC* |
SUM(PERF COUNT 7) |
SIMD_INST_RETIRED* |
SUM(PERF COUNT 8) |
SVE_INST_RETIRED* |
SUM(PERF COUNT 9) |
*reference: https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_PMU_Events_v1.3.pdf
Attention
Note that the performance information calculated from the job statistics is per job, not per executed application.
In the job script, if executed xospastop command or executed a profiler, calculation of PERF COUNT will not be proceeded.
If you need detailed performance information of the application, please use a profiler.
5.12.6. Job status display command options¶
Typical options of job status display command( pjstat ) are shown.
For details, refer to “Job Operation Software End-user’s Guide”.
Option name |
Function |
---|---|
-H, --history [day=value] |
This option displays only the information on jobs that finished processing (in the REJECT, EXIT, or CANCEL state).
If neither the
-H nor --history argument is specified, information on jobs that finished in the past three days is output.day=value outputs information about jobs that finished within the past number of days represented by value.
The value can be an integer [1-90].
The supercomputer Fugaku holds up to 90 days of information.
Example:
pjstat -H day=90 |
-v |
The job information which is not output with standard format is desplayed additionally. |
-s |
Outputs detailed information such as resource usage and resource limit values.
Please refer to Statistical information for output information.
If used
-v or -S option, output error message and terminate. |
-S |
In addition to the information output by the
-s option, the node unit information set for the job is also output.Please refer to Statistical information for output information.
If used
-v or -s option, output error message and terminate. |
-E, --expand |
When sub jobs exist, the list of sub jobs is also output. |
--limit |
Displays the limit value and current usage for user job submission. |
--help |
Shows help. |