3.1.11.2. Advanced Performance Profiler

The Advanced Performance Profiler measures and outputs the execution performance information of the specified section of the application.

3.1.11.2.1. Overview

Advanced Performance Profiler is composed of 2 commands: fapp command that measures profiler data and fapppx command that output the profile result from measured data. The information that the Advanced Performance Profiler measures and outputs is as follows.

  • Statistical time information

  • MPI communication cost information

  • CPU performance analysis information

The flow of using the Advanced Performance Profiler is as follows.

../../../_images/AdvancedProfiler_01.png

3.1.11.2.2. Addition of measurement interval specification routine

Adds measurement interval specification routines / functions necessary for specifying the interval for measuring profile data to the source code.
The measurement interval specification function can be used as a Fortran language subroutine or C/C++ function.
When using C/C++ functions, you must declare the function prototype or include the header file of the profiler subroutine.

Language type

Header file

Sub routine / function name

Argument

Function

Fortran

None

fapp_start
fapp_stop

(name, number, level)

Start measuring information
End measuring information

C/C++

fj_tool/fapp.h

void fapp_start
void fapp_stop

(const char *name, int number, int level)

Start measuring information
End measuring information

[Argument details]

Argument

Description

name

Group name (basic character scalar). A group name consists of letters, numbers, and underscores. Other characters cannot be used.

number

Detail number (4-byte integer type)

level

Priority level (4-byte integer type 0 or greater integer)

Note

The group name and detail number are used to distinguish them as measurement range names. If the priority level is higher than fapp command’s -L option, do not proceed measurement.

Attention

  • When calling subroutines / functions with the same measurement section name multiple times, be sure to call them in the order fapp_start and fapp_stop. If fapp_start is called again before calling fapp_stop, or if fapp_stop is called before calling fapp_start, a warning message is output and the call is ignored.

  • If the measurement section names are different, there is no problem if fapp_start or fapp_stop continues.

  • If the process ends without calling fapp_stop, the profile data for that section is not measured.

  • If the measurement for the same measurement section name is performed multiple times, all the measurement results are added.

  • Specify the same value for the argument level for fapp_start and fapp_stop. If you specify a different value, depending on specification of fapp command’s -L option , unintended results may occur.

  • If specify “all” to argument name and 0 to number,it covers the entire program.

  • In the case of an MPI program, call a subroutine / function with the same measurement section name in all processes to be measured. Profile data for processes that have not been called is not counted.

An example of using the measurement interval specification routine is shown below.

  • Fortran example

    1. Sample specification example

    program main
    ...
    call fapp_start("foo",1,0)         ! Start measurement of measurement section name "foo1"
    do i=1,10000
      ...
      call fapp_start("bar",1,0)       ! Start measurement of measurement section name "bar1"
      do j=1,10000
       ...
      end do
      call fapp_stop("bar",1,0)        ! End measurement of measurement section name "bar1"
    end do
    call fapp_stop("foo",1,0)          ! End measurement of measurement section name "foo1"
    end program main
    
    1. Example of measuring all processes (measurement starts before calling the mpi_init subroutine)

    call fapp_start("foo",1,0)         ! Start measurring
    call mpi_init(err)
    ...
    call mpi_finalize(err)
    call fapp_stop("foo",1,0)          ! End measurring
    
    1. Example of measuring all processes (measurement starts immediately after calling the mpi_init subroutine)

    call mpi_init(err)
    call fapp_start("foo",1,0)         ! Start measurring
    ...
    call fapp_stop("foo",1,0)          ! End measurring
    call mpi_finalize(err)
    
    1. Example of measuring only process 0

    call mpi_init(err)
    call mpi_comm_rank(mpi_comm_world,rank,err)
    if(rank==0) then
      call fapp_start("foo",1,0)       ! Only process 0, start measurring
    end if
      ...
    if(rank==0) then
      call fapp_stop("foo",1,0)        ! Only process 0, end measurring
    end if
    call mpi_finalize(err)
    
  • C/C++ example

    1. Sample specification example

    #include "fj_tool/fapp.h"          // Include header file
    ...
    int main(void)
    {
      int i,j;
      fapp_start("foo",1,0);           // Start measuring the measurement section name "foo1"
      for(i=0;i<10000;i++){
        ...
        fapp_start("bar",1,0);         // Start measuring the measurement section name "bar1"
        for(j=0;j<10000;j++){
           ...
        }
        fapp_stop("bar",1,0);          // End measuring the measurement section name "bar1"
      }
      return 0;
      fapp_stop("foo",1,0);            // End measuring the measurement section name "foo1"
    }
    
    1. Example of measuring all processes (start measurement before calling the MPI_Init function)

    fapp_start("foo",1,0);             // Start measurring
    MPI_Init(&argc, &argv);
    ...
    MPI_Finalize();
    fapp_stop("foo",1,0);              // End measurring
    
    1. Example of measuring all processes (measurement starts immediately after calling the MPI_Init function)

    MPI_Init(&argc, &argv);
    fapp_start("foo",1,0);             // Start measurring
    ...
    fapp_stop("foo",1,0);              // End measurring
    MPI_Finalize();
    
    1. Example of measuring only process 0

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if(rank==0){
      fapp_start("foo",1,0);           // Only process 0, start measurring
    }
    ...
    if(rank==0){
      fapp_stop("foo",1,0);            // Only process 0, end measurring
    }
    MPI_Finalize();
    

3.1.11.2.3. Compiling / Linking

The tool library required to use the Instant Performance Profiler function is linked by default when compiling / linking.
Therefore, there is no need to specify special options as in the following example.
Compile / link is done at the login node or the compute node.
  • Compile / link example (MPI program)

[_LNlogin]$ mpifrtpx  -Kfast,parallel  "Source file name"
  • Compile / link example (sequential / thread parallel program)

[_LNlogin]$ $ frtpx  -Kfast,parallel  "Source file name"

Attention

When split compilation is performed, the optimization options specified at compile time should also be specified at link time so that the library of the appropriate profiler is linked.

For programs using OpenMP, specify -Kopenmp option when linking.

3.1.11.2.3.1. About Fortran translation options

The following are the translation options for the profiler used when translating Fortran programs.
This option is enabled when linking.

Option

Description

-Nfjprof

Combine tool library.
When omitted, -Nfjprof is enabled.

-Nnofjprof

Do not combine tool library.
Cannot use profiler.

3.1.11.2.3.2. About C/C++ translation options

For the C/C++ language, there are two types of compilation modes: trad mode and clang mode.
The translation options for the profiler used when translating programs in each mode are listed below.
This option is enabled when linking.

Mode

Option

Description

trad

-Nfjprof

Combine tool library.
When omitted -Nfjprof is enabled.

trad

-Nnofjprof

Do not combine tool library.
Cannot use profiler.

clang

-ffj-fjprof

Combine tool library.
When omitted -ffj-fjprof is enabled.

clang

-ffj-no-fjprof

Do not combine tool library.
Cannot use profiler.

3.1.11.2.4. Measurement of profile data

Here is an example shell script for running the Advanced Profiler.
In the example shown here, CPU performance analysis information and MPI communication cost information are measured. For details, see “Profiler User’s Guide”.
#!/bin/bash -x
#
#PJM -L "node=8"
#PJM -L "elapse=01:00:00"
#PJM -x PJM_LLIO_GFSCACHE=/vol000N
#PJM -g groupname
#PJM -s
#

LD="./sample_mpi"
MPIEXEC="mpiexec"
#
fapp -C -d ./tmp -Icpupa,mpi -Hevent=statistics ${MPIEXEC} ${LD}

Attention

If proceed the following to the profiler data that is measured by fapp command, we do not guarantee the work result.

  • Edit profile data

  • Add, delete, and rename profile data

3.1.11.2.5. Profiler option

We describe fapp command option.

Option

Description

-C
(Required option)

Specifies measurement of profile data. If this option is omitted, an error message is output and the execution of the program ends.

-d profile_data
(Required option)
Specify the directory for storing profile data. If this option is omitted, an error message is output and the execution of the program ends.
In profile_data, specify the directory name for storing the profile data as a relative or absolute path. The specified directory must be new or empty.
When analyzing a program that moves the current directory during execution, profile_data is specified as an absolute path.
The Profiler creates a subdirectory for every 1000 files under profile_data. Therefore, even for large jobs, you only need to specify profile_data.

-Hitem

Specify the measurement of CPU performance analysis information.
If -Inocpupa option is specified, a warning message is output and this option is disabled.
item:{event=event | event_raw=event_raw } [,method={fast | normal},mode={all | user}]
event=event

Measures information used for CPU Performance Analysis Reports. Specify one of the following for event .

{ pa1 | pa2 | pa3 | pa4 | pa5 | pa6 | pa7 | pa8 | pa9 | pa10 | pa11 | pa12 | pa13 | pa14 | pa15 | pa16 | pa17 | statistics}

event_raw=event_raw

CPU performance analysis information is measured by specifying the event number of PMU event information. event_raw specifies the event number corresponding to the CPU in decimal or hexadecimal notation. Up to 8 event_raw can be specified by separating them with a comma (,).

method=fast

Specify the measurement method for CPU performance analysis information. When this suboption is specified, high-precision CPU performance analysis information is measured by a method that directly measures hardware information.

method=normal

Specify the measurement method for CPU performance analysis information. When this suboption is specified, CPU performance analysis information is measured by the method of measuring via the OS. When omitted, it will be method=normal.

mode=all

Specify the measurement mode for CPU performance analysis information. When this suboption is specified, performance is measured in kernel mode and user mode.

mode=user

Specify the measurement mode for CPU performance analysis information. When this suboption is specified, performance measurement is performed in user mode.

-Iitem
(Hyphen + capital letter I)
Specify the items to be measured by the Advanced Performance Profiler. If specify with multiple item, devide with comma.
item :{{cpupa | nocpupa} | {mpi | nompi}}
Operation when this option is omitted depends on the item to be measured.
For CPU performance characteristics measurement, if -H option is specified, -Icpupa is enabled, if -H option is not specified, -Inocpupa is enabled.
For MPI Cost information measurement, if target is MPI program, mpi is enabled, if target is unti MPI program, nompi is enabled.
cpupa:

Measures the CPU performance characteristics.

nocpupa:

Do not measures the CPU performance characteristics.

mpi:

Measures MPI Cost information.

nompi:

Does not measures MPI Cost information.

-L level

Specify the start level of the measurement target.
For level, specify an integer value in the range of 0 to 2,147,483,647. This option has meaning for the third argument level of the measurement section specification routine. “level”> = “Only the section of the third argument level” of the measurement section specification routine is enabled as a measurement target.
When this option is omitted, -L 0 is enabled.

exec-file [ exec_option … ]

Specify the executable file and options for profile data measurement. For MPI programs, specify from mpiexec.

3.1.11.2.6. Output profile result

To output Advanced Performance Profiler information by inputting created data by fapp command, use fapppx command.
Perform this operation on the login node.

An execution example is shown below of fapppx command.

  • If it’s login node

    Use fapppx command.

  • If it’s computing node

    Use fapp command.

[Command execution example]

[_LNlogin] $ fapppx -A -pall -o tmp.txt -d Fprofd_stati

In this example, output of all process information is specified ( -p all). Also, it is able to output the input file used for the CPU Performance Analysis Report to -tcsv option.

3.1.11.2.6.1. fapppx command option

Option

Function / measurement value (unit)

-A
(Required option)

Specify output processing of profile results.

-d profile_data
(Required option)

Specify the directory where the profile data is stored in profile_data as a relative or absolute path.

-Iitem
(Hyphen + capital letter I)
Specify the items to be output as profile results.
f specify the multiple item, devide with comma.
item:{{cpupa | nocpupa}|} | {mpi | nompi}}
cpupa:

Output the CPU performance characteristics. When omitted cpupa is enabled.

nocpupa:

Do not output the CPU performance characteristics. Cannot specify with -tcsv option at the same time.

mpi:

Output MPI Cost information. When omitted when the target is an MPI program mpi is enabled.

nompi:

Do not output MPI Cost information. If omitted if the target is a non-MPI program nompi is enabled.

-o outfile

Specify the output destination of the profile result. For outfile, specify the output file name as a relative or absolute path, or specify “stdout”.
When this option is omitted, -ostdout option is enabled.

-p p_no

Specify the process to be output to the profile result.
To p_no , specify one of these : N, input=n, limit=m or all.
When this option is omitted, -pinput=0 and limit=16 option is enabled. To -p, as comma (,) as devision, p_no can specified multiply.
For example, it can be like this : -p3,5,limit=10.
N … :

The information of the process number specified in N is output at the beginning. If the information of the process number specified in N does not exist, ignore the specification. If multiple Ns are specified, they are output in the specified order.

input=n:

Reads the number of n process information in descending order of cost. If 0 is specified for n or a value exceeding the number of processes is specified, information for all processes is read. When this sub option is omitted, input=0 is enabled. Sub option input=n and sub option limit=m can be specified at the same time.

limit=m:

Outputs the number of m process information in descending order of cost. The information of the process that was not output is included in the denominator when calculating the ratio. If m is specified as 0 or a value exceeding the number of input processes, information for all processes is output. When this sub option is omitted, limit=16 is enabled.

all:

Read all process number information and output in order of highest cost. It is the same of when specified -pinput=0,limit=0 option. Neither sub option input=n and limit=mis not specified, this is enabled.

-t{csv | text}

Specify output format to the profile result.

csv:

Output as CSV format to profile result. This option cannot specidy at the same time with -Inocpupa option.

text:

Output as TEXT format to profile result. If specified this option, does not output CPS performance analizying information even if -Icpupa option is enabled.

3.1.11.2.6.2. Profile result

The profile result consists of the following statistical information.
ach information can be limited to output with -I option of fapppx command.
For details of each item, please refer to “Profiler User’s Guide”-“3.2.2 Detail of Profile Result”.
  • Profile data measurement environment information

  • Statistical time information

  • MPI communication cost information

  • CPU performance analizying information