GENESIS Tutorial 17.1

17.1 Cryo-EM flexible fitting using a simulated map

Contents

Preparation
1. Making a simulated density map of the target state
2. Building the initial structure
3. Energy minimization
4. Flexible fitting
5. Trajectory analysis
- 5.1 Cross-correlation coefficient and RMSD
- 5.2 MolProbity score
References

Cryo-electron microscopy (Cryo-EM) is a powerful tool to determine three-dimensional structures of biomolecules at near atomic resolution. Flexible fitting has been widely utilized to model the atomic structure from the experimental density map [1]. One of the commonly used methods is the MD-based flexible fitting [2]. In GENESIS, c.c.-based flexible fitting is available in ATDYN and SPDYN [3]. In addition to the conventional methods, flexible fitting with the replica-exchange schemes (REUSfit [4]) or with the all-atom Go-model (MDfit [5]) is available in GENESIS. Here, we introduce a basic procedure to perform MD-based flexible fitting using a simulated density map with the GB/SA implicit solvent model (see also Tutorial 8.1) [6]. Experimental density map will be used in Tutorial 17.2 (under construction).

Preparation

First of all, let’s download the tutorial file (tutorial19-17.1.zip). In this tutorial, we additionally use the SITUS program package. If you have not installed it yet in your computer, please download and install the program in advance. Hereafter, we assume that the binary programs of SITUS ver. 3.1 are installed in /home/user/Situs_3.1/bin/. Since we use the CHARMM36m force field parameters, we make a symbolic link to the CHARMM toppar directory (see Tutorial 2.1).

# Put the tutorial file in the Works directory
$ cd ~/GENESIS_Tutorials-2019/Works
$ mv ~/Downloads/tutorial19-17.1.zip ./
$ unzip tutorial19-17.1.zip

# Clean up the directory
$ mv tutorial19-17.1.zip TRASH

# Update the README file
$ echo "tutorial-17.1: Cryo-EM flexible fitting using a simulated map" >> README

# Check the contents of Tutorial 17.1
$ cd tutorial-17.1
$ ln -s ../../Data/Parameters/toppar_c36_jul21 ./toppar
$ ln -s ../../Programs/genesis-1.7.1/bin ./bin
$ ls 
1_emmap  3_minimize  5_analysis  toppar
2_build  4_fitting   bin

1. Making a simulated density map of the target state

Before dealing with a “real experimental” density map, let’s practice the flexible fitting using a “simulated” density map. Here, we employ the glucose/galactose binding protein (GGBP). Recently, the open and closed states of GGBP have been solved with the X-ray crystallography. We would like to create a target density map from the crystal structure of the open state (PDB: 2FW0), and carry out the flexible fitting from the closed state (PDB: 2FVY). Therefore, we know the “answer” of the fitting.

First, we download the PDB file of the open state (PDB: 2FW0), and create a processed PDB file that contains only protein heavy atoms by using VMD.

# Download PDB file of the open state of GGBP 
$ cd 1_emmap
$ wget https://files.rcsb.org/download/2FW0.pdb

# Generate a new PDB file that contains protein heavy atoms
$ vmd -dispdev text 2FW0.pdb 
vmd > set sel [atomselect top "protein and not hydrogen and not altloc B"] 
vmd > $sel writepdb open.pdb
vmd > exit

To generate a synthetic density map from open.pdb, we use emmap_generator in the GENESIS analysis tool. We generate a control file INP using the “-h ctrl” option, and edit the file.

# Generate a control file of emmap_generator 
$ ../bin/emmap_generator -h ctrl > INP
$ ls
2FW0.pdb  INP  open.pdb

We specify the following parameters in the control file. We create a map with the 5 Å resolution (sigma = 2.5 Å), where the file format is set to SITUS. The variable auto_margin makes a margin between the map edge and minimum or maximum coordinates of protein atoms according to the specified margin_size. Note that the original PDB file (2FW0.pdb) cannot be used as input in this tool.

# Control file of emmap_generator
#
[INPUT]
pdbfile       = open.pdb 

[OUTPUT]
mapfile       = open.sit

[OPTION]
check_only    = NO        # (YES/NO)
allow_backup  = NO        # (YES/NO)
map_format    = SITUS     # (SITUS)
auto_margin   = YES       # (YES/NO)
margin_size_x = 20.0      # margin size
margin_size_y = 20.0      # margin size
margin_size_z = 20.0      # margin size
voxel_size    = 1.0       # voxel size
sigma         = 2.5       # resolution parameter
tolerance     = 0.001     # tolerance (0.001 = 0.1%)

After running emmap_generator for INP, let’s view the generated density map and open.pdb by using VMD. The following is an example to load the map file by the command line. This can be also done by using a mouse as follows [File > New Molecule > Browse > Select “open.sit” > Load].

# Generate synthetic EM density map (open.sit)
$ ../bin/emmap_generator INP > log
$ ls
2FW0.pdb  INP  log  open.pdb  open.sit

# Visualization by using VMD
$ vmd -situs ../1_emmap/open.sit
vmd > mol load pdb open.pdb

Figure 1. Comparison between the PDB and synthetic density map. Let’s change the representation of the density map by specifying “Wireframe” or by moving a seek bar of the “Isovalues”, which show the absolute value of the density data.

2. Building the initial structure

Second, we prepare the initial structure of the closed state (PDB: 2FVY). Again, we download the PDB file and create a processed PDB file that contains only protein heavy atoms:

# Download PDB file of the closed state of GGBP
$ cd ../2_build/
$ wget https://files.rcsb.org/download/2FVY.pdb
$ ls
2FVY.pdb

# Generate PDB file containing protein heavy atoms
$ vmd -dispdev text 2FVY.pdb
vmd > set sel [atomselect top "protein and not hydrogen and not altloc B"] 
vmd > $sel writepdb proa.pdb
vmd > exit
$ ls
2FVY.pdb  proa.pdb

Let’s visualize proa.pdb and open.sit by using VMD. We can see that the position of the initial structure is significantly shifted from the target density. We have to fit the initial structure to the map as much as possible with a rigid docking protocol before the flexible fitting. For this purpose, we use the colores program in SITUS (for details, see here). We run the following command, where the resolution of 5 Å and 4 CPU cores are specified. It may take a few minutes for this calculation.

# Perform rigid docking using SITUS
$ /home/user/Situs_3.1/bin/colores ../1_emmap/open.sit proa.pdb -res 5 -nprocs 4

We obtained several PDB files, named col_best_00X.pdb. We should select the best model among them. The cross-correlation coefficient (c.c.) between the target and simulated densities is written in the header information of each PDB file. We see that col_best_001.pdb shows the best c.c., which means that this PDB is the best fitted model.

# Check the c.c. value in the obtained PDB files
$ ls col_*.pdb
col_best_001.pdb  col_best_003.pdb  col_best_005.pdb  col_best_007.pdb
col_best_002.pdb  col_best_004.pdb  col_best_006.pdb

$ grep "Unnormalized correlation coefficient:" col_*.pdb
col_best_001.pdb:REMARK Unnormalized correlation coefficient: 0.876832
col_best_002.pdb:REMARK Unnormalized correlation coefficient: 0.798847
col_best_003.pdb:REMARK Unnormalized correlation coefficient: 0.798821
col_best_004.pdb:REMARK Unnormalized correlation coefficient: 0.764940
col_best_005.pdb:REMARK Unnormalized correlation coefficient: 0.751088
col_best_006.pdb:REMARK Unnormalized correlation coefficient: 0.750453
col_best_007.pdb:REMARK Unnormalized correlation coefficient: 0.405601

Let’s compare col_best_001.pdb and open.sit by using VMD to check whether the protein is actually superimposed onto the map.

We use col_best_001.pdb as the initial coordinates of the flexible fitting. In order to use the CHARMM force field in the flexible fitting, we create PDB and PSF files with the all-atom model by using VMD (for details, see Tutorial 2.2):

$ vmd -dispdev text
vmd > package require psfgen 
vmd > resetpsf
vmd > topology ../toppar/top_all36_prot.rtf 
vmd > pdbalias residue HIS HSD 
vmd > pdbalias atom ILE CD1 CD 
vmd > segment PROA {pdb col_best_001.pdb} 
vmd > coordpdb col_best_001.pdb PROA 
vmd > guesscoord 
vmd > writepdb closed.pdb 
vmd > writepsf closed.psf
vmd > exit

We obtained closed.pdb and closed.psf. In addition to these files, we make a PDB file with the all-atom model for open.pdb, which will be used for the analysis of the RMSD with respect to the target state.

$ cd ../1_emmap
$ vmd -dispdev text
vmd > package require psfgen
vmd > resetpsf
vmd > topology ../toppar/top_all36_prot.rtf 
vmd > pdbalias residue HIS HSD 
vmd > pdbalias atom ILE CD1 CD 
vmd > segment PROA {pdb open.pdb} 
vmd > coordpdb open.pdb PROA 
vmd > guesscoord 
vmd > writepdb target.pdb 
vmd > exit

3. Energy minimization

Before running the MD-based flexible fitting, energy minimization is usually needed to remove steric clashes in the initial structure. The control file is already contained in the directory. Here, we simply perform the energy minimization. We execute ATDYN for INP. After the calculation, min.dcd and min.rst are obtained.

# Perform energy minimization using 16 CPU cores
$ cd ../3_minimize
$ export OMP_NUM_THREADS=4
$ mpirun -np 4 ../bin/atdyn INP > log
$ ls
INP  log  min.dcd  min.rst

4. Flexible fitting

Then, we carry out the flexible fitting. The control file is already included in the directory.

# Change directory to run the simulation
$ cd ../4_fitting
$ ls
INP

# Take a look at the control file
$ less INP

The following shows the most important parts in the control file. You can recognize that the flexible fitting is a kind of “restrained MD simulation”, and most parameters are common with the conventional MD simulations. We apply the biasing potential on the protein heavy atoms, where the force constant of the bias is set to 5,000 kcal/mol. emfit_sigma is a resolution parameter, which is usually set to the half of the resolution of the target density map.

[INPUT]
topfile = ../toppar/top_all36_prot.rtf
parfile = ../toppar/par_all36m_prot.prm
pdbfile = ../2_build/closed.pdb
psffile = ../2_build/closed.psf
rstfile = ../3_minimize/min.rst

[SELECTION]
group1          = all and not hydrogen

[RESTRAINTS]
nfunctions      = 1
function1       = EM                  # apply restraints from EM density map
constant1       = 5000                # force constant in Ebias = k*(1 - c.c.)
select_index1   = 1                   # apply restraint force on protein heavy atoms

[EXPERIMENTS]
emfit           = YES                 # perform EM flexible fitting
emfit_target    = ../1_emmap/open.sit # target EM density map
emfit_sigma     = 2.5                 # half of the map resolution (5 A)

We execute ATDYN for INP. The following is an example to execute mpirun using 16 CPU cores:

# Perform flexible fitting
$ export OMP_NUM_THREADS=4
$ mpirun -np 4 ../bin/atdyn INP > log
$ ls
INP  log  run.dcd  run.pdb  run.rst

After the calculation, we obtain run.dcd, run.pdb, and run.rst, where the PDB file is the structure at 50 ps. The log file contains time courses of the energy and c.c. (Column 19: RESTR_CVS001). Let’s view the trajectory by using VMD. You can see that the structure is well fitted to the target density.

$ vmd -situs ../1_emmap/open.sit
vmd > mol load pdb ../2_build/closed.pdb psf ../2_build/closed.psf
vmd > mol addfile run.dcd

Figure 2. Comparison between the Initial (green) and fitted (red) structures.

5. Trajectory analysis

5.1 Cross-correlation coefficient and RMSD

First, we analyze the time courses of c.c., which can be easily obtained by using the following command. We select the value in the Column 19 (RESTR_CVS001) in the log file. You can see that the c.c. is successfully increased during the fitting.

# Change directory for analysis
$ cd ../5_analysis

# Analyze time courses of c.c.
$ grep "INFO:" ../4_fitting/log | awk '{print $3, $19}' | grep -v "RESTR" > cc.log
$ less cc.log

Second, we analyze the RMSD with respect to the target structure, because we know what should be the correct “answer” in this flexible fitting. Here, we use rmsd_analysis in the GENESIS analysis tool set, as in Tutorial 3.3. We use target.pdb as reference coordinates. Please, confirm that the number of atoms as well as the order of atoms in target.pdb and closed.pdb are identical to each other, otherwise the atom selection or RMSD calculation cannot be done correctly.

# Analysis of the RMSD
$ ../bin/rmsd_analysis INP
$ ls
INP  cc.log  output.rms

We obtain output.rms. Let us plot the RMSD and c.c. by using gnuplot. We can see that the RMSD with respect to the target state is quickly decreased as the c.c. increased. The c.c. is eventually converged in 50 ps.

# Plot the c.c. and RMSD data
$ gnuplot
gnuplot> set encoding iso
gnuplot> set y2tics
gnuplot> set ytics nomirror
gnuplot> set xlabel "Time (ps)"
gnuplot> set ylabel "c.c."
gnuplot> set y2label "RMSD (\305)"
gnuplot> unset key
gnuplot> plot 'cc.log' u ($0/5):2 w l,'output.rms' u ($0/5):2 w l axis x1y2

Figure 3. Time courses of c.c. (purple line) and Cα RMSD with respect to the target structure (green line).

5.2 MolProbity score

Finally, we evaluate the quality of the obtained structure. One of the commonly used criteria is the MolProbity score [7], which represents how good the structure is as a protein. It can be computed with the GUI server. Let’s access to the MolProbity Server, and upload run.pdb obtained from the flexible fitting.

After the process, we click “Analyze geometry without all-atom contacts”.

Confirm that the “Clashes & clashscore” for Charts, plots, and tables is checked.

We found that the MolProbity score of run.pdb was 1.43.

References

F. Tama et al., J. Mol. Biol., 337, 985-999 (2004).
M. Orzechowski and F. Tama, Biophys. J., 95, 5692-5705 (2008).
T. Mori et al., Structure, 27, 161-174.e3 (2019).
O. Miyashita et al., J. Comput. Chem., 38, 1447-1461 (2017).
P. Whitford et al., PNAS, 108, 18943-18948 (2011).
A. Onufriev et al., Proteins, 55, 383-394 (2004).
V. B. Chen et al., Acta Cryst., D66, 12-21 (2010).

Written by Takaharu Mori@RIKEN Theoretical molecular science laboratory
April. 21, 2022

GENESIS

Generalized-Ensemble Simulation System