**4.4 Principal component analysis (PCA)**

Files for this tutorial (tutorial-4.4.tar.gz)

In this tutorial, we perform principal component analysis (PCA). PCA is useful to understand the dynamical behaviour of biomolecules by projecting the trajectory on small dimensions in terms of principal components (PCs). To perform PCA, we first use the `eigmat_analysis`

tool, followed by the `prjcrd_analysis`

tool, both of which are available in the GENESIS analysis tool sets. The former computes the eigenvalue and eigenvectors of the variance-covarience matrix, and the latter carries out a projection of the snapshot structures onto PC vectors.

In this tutorial, we use the output files obtained in tutorial 1.1 and tutorial 4.3. (Please work on these tutorials first if you have not done it yet.) Untar `tutorial-4.4.tar.gz`

in the same directory as tutorial-1.1 and tutorial-4.3:

```
$ ls
tutorial-1.1 tutorial-4.3
$ tar -zxf tutorial-4.4.tar.gz
$ ls
tutorial-1.1 tutorial-4.3 tutorial-4.4
$ cd tutorial-4.4
$ ls
run.sh run1.inp run2.inp
```

In tutorial-1.1, we carried out an MD simulation of a small protein, BPTI, in solution. The MD simulation was performed for 500,000 steps with `crdout_period = 500`

, resulting in a trajectory file containing 1,000 snapshots. In tutorial-4.3, we analyzed RMSF of all residues in BPTI. Here, we calculate the PCs of BPTI and project all snapshots of the trajectory on to a plane of two PCs.

The control file, `run1.inp`

, is for `eigmat_analysis`

:

```
[INPUT]
pcafile = ../tutorial-4.3/run2.pca # PCA file
[OUTPUT]
valfile = run1.val # VAL file
vecfile = run1.vec # VEC file
cntfile = run1.cnt # CNT file
```

We obtain `valfile`

, `vecfile`

, and `cntfile`

, which contain eigenvalues, eigenvectors, and contribution of each mode to the overall dynamics, respectively. The following figure shows the plot of the accumulated contribution of each PC mode (4th column in `cntfile`

).

The control file, `run2.inp`

, for `prjcrd_analysis`

is as follows:

```
[INPUT]
reffile = ../tutorial-1.1/1_setup/ionize.pdb # PDB file
pdb_avefile = ../tutorial-4.3/run1_ave.pdb # PDB file (Average coordinates)
pdb_aftfile = ../tutorial-4.3/run1_aft.pdb # PDB file (Fitted Average coordinates)
valfile = run1.val # VAL file
vecfile = run1.vec # VEC file
[OUTPUT]
prjfile = run2.prj # PRJ file
```

In `[INPUT]`

section, reffile is the reference structure used in tutorial-1.1, and `pdb_avefile`

and `pdbaftfile`

are the average coordinates of the analysis and fitting atoms, respectively, obtained in tutorial-4.3. `valfile`

and `vecfile`

are the output of `eigmat_analysis`

.

`[TRAJECTORY]`

, `[SELECTION]`

, and `[FITTING]`

sections, which follow the above, are the same as before. See tutorial-4.1 for details.

```
[OPTION]
check_only = NO # (YES/NO)
vcv_matrix = Global # (GLOBAL/LOCAL)
num_pca = 2 # # of principal components
analysis_atom = 1 # analysis target atom group
```

In `[OPTION]`

section, we specify the number of PCs to project the trajectory. The PCs are selected in a decreasing order of their contributions.

The analysis is executed by `run.sh`

script:

```
$ cat run.sh
#!/bin/bash
# set the number of OpenMP threads
export OMP_NUM_THREADS=1
# perform analysis with eigmat_analysis and prjcrd_analysis
$ eigmat_analysis run1.inp | tee run1.out
$ prjcrd_analysis run2.inp | tee run2.out
```

Now, let’s run the script:

```
$ chmod +x run.sh
$ ./run.sh
$ ls
run.sh run1.cnt run1.inp run1.out run1.val
run1.vec run2.inp run2.out run2.prj
```

We obtain `run2.prj`

which contains the projection of each structure onto the calculated principal component axes. The i-th principal component μi of each snapshot is defined by the dot product:

\( \displaystyle \mu_i = \mathbf{v}_i \cdot (\mathbf{q}- \langle \mathbf{q} \rangle) \)

where \(\displaystyle \mathbf{v}_i \) is the i-th eigenvector, **q** is the coordinates of each snapshot in the trajectory, and \( \langle \mathbf{q} \rangle \) is the

averaged coordinates. The following figure is the plot of 2nd and 3rd columns of `run2.prj`

:

```
$ gnuplot
gnuplot> set xlabel "PC1"
gnuplot> set ylabel "PC2"
gnuplot> set xrange [-1:1]
gnuplot> set yrange [-1:1]
gnuplot> set size square
gnuplot> plot "run2.prj" using 2:3
```

*Written by Kiyoshi Yagi@RIKEN Theoretical molecular science laboratory
*

*July, 17, 2016*