3. Chainer for K on the Fugaku

3.1. License

The patches and procedures are scheduled to be provided (Upstream) to the OSS, PyTorch will comply with the MIT license.

3.2. Installed version (Cf. checkout.sh, site-packages and so on.)

  • Chainer ver.4.5.0

    • ChainerMN ver.1.3.1

    • ChainerK ver.1.1.0

  • Python ver.3.8.2

    • mpi4py ver.3.0.3

3.3. Files

/vol0001/apps/oss/ChainerK-4.5.0/
  bin: binaries
  lib: libraries
    lib/python3.8/site-packages: Chainer and Python modules
  include: include files
  build: scripts, patches and so on.
  example: examples of MNIST and ResNet-50

3.4. Build

You can refer the directory build. Please use it after adjusting the paths $PATH and so on, require a timely correction. It takes about one hour to build.

./checkout.sh
pjsub go.sh

We have adjusted compilation options to build. In order to avoid timeout of Python construction, fwe built with -O3 first. Next, we commented out the place to timeout and overwrote with -Kfast.

3.5. Execution

This environment can be performed anywhere by setting $PATH. Because Python module has a large number of files, MDS access high loads will take the importlib time due to MDS access high load. You can use staging by coping common binaries by using llio_transfer refering 8.2.4.2. Tips for common file distribution of the Users Guide . You can also uce staging by deploying all files with tar and expanding using llio_transfer.

You can see some examples of MINT and ResNet-50 under the example directory.

The procedure is described in the above reference. Please use it after adjusting the paths.

pjsub go.sh

Environment variables needed to run when $PREFIX is specified as follows.

TCSDS=1.2.27b                         : set the Fujitsu language environment version
export PREFIX=${PWD}/../..            : installed path

# System Enironment
module switch lang/tcsds-${TCSDS}     : to read latest Fujitsu language environment
export LD_LIBRARY_PATH=${PREFIX}/lib:${LD_LIBRARY_PATH}
export PATH=${PREFIX}/bin:${PATH}     : path of python3

Execution results are included in log. It is a result of running in 2.2 GHz interactive mode. Example of multi-node are the result of the previous version.

3.6. Notes

In order to speed up, tuned libraries developed by R-CCS is incorporated into the framework. These components only support the basic calculation of image recognition etc now. Then depending on the network which you want to use, it may not work at high performance.

Please contact your support desk (R-CCS support desk or HPCI support desk) for questions for performances and requirements of the versions and python modules, and so on, providing your network scripts. Please note that there are several months to support.

3.7. History

  • September 12, 2020 (Sat) build under the tcsds-1.2.26b and release

  • September 20, 2020 (Sun) build under the tcsds-1.2.26b and modify

    • modify the problem maxpooing3d under the -Infinity

    • img2col disable OpenMP on the ChainerK library

  • October 27, 2020 (Mon) build under the tcsds-1.2.27b

    • add the examples of largepage

    • add the examples of fapp, fapp, PA

    • modify of the scripts

  • October 15, 2021 (Fri) release these documents (Ver.1.0)

  • December 07, 2021 (Tue) update these documents (Ver.1.1) addition the usage of llio