17.2 Multi-scale cryo-EM flexible fitting
MD-based flexible fitting to the cryo-electron microscopy (Cryo-EM) density maps at resolutions close to 5 Å is challenging, when a protein domain has to undergo a sophisticated conformational change, such as domain rotation. The protocol presented here and described in [1] involves a few steps, that significantly increase the chances for the correct fitting in those difficult cases. It employs the replica-exchange flexible fitting to obtain a correct protein movement/rotation [2]. To speed up the calculations, the coarse-grained Go-model developed by Karanicolas and Brooks (KB Go-model) [3,4] is used at this step. Then, to map the coarse-grained model to the full-atom model, the targeted MD method is used. Finally, the obtained structure is refined with the all-atom MD flexible fitting [5].
Before going through this tutorial, it is advised to complete the basic tutorials concerning the basic MD-based flexible fitting protocol (Tutorial 17.1), the MD simulations using the GB/SA model (Tutorial 8.1), the coarse-grained MD simulations using the KB Go-model (Tutorial 7.1) and the replica-exchange MD (REMD) simulations (Tutorial 10.1).
Preparation
The tutorial files are available for download here: tutorial-17.2.tar.gz . As described in Tutorial 17.1, installing the SITUS program package and making a symbolic link to the CHARMM toppar directory is necessary.
$ cd Tutorials
$ mv ~/Downloads/tutorial-17.2.tar.gz ./
$ tar -xvzf tutorial-17.2.tar.gz
$ cd tutorial-17.2
$ ln -s ../../Others/CHARMM/toppar ./
$ ls
01_CG_emmap 02_CG_build 03_CG_REMD_fitting 04_CG_analysis
05_AA_emmap 06_AA_build 07_AA_minimize 08_AA_targeted_MD
09_AA_refinement 10_AA_analysis toppar
1. Coarse-grained (CG) simulations
1.1. Preparing a simulated density map of the target state
In this tutorial, a small protein (adenylate kinase) in open state will be fitted to a simulated density map in the closed state. First, the density map for the coarse-grained simulations will be generated as the target for the fitting, basing on the PDB: 1AKE. Downloading and modifying the PDB file is necessary, so that the structure contains only protein chain A with the non-hydrogen atoms:
# Download the PDB file of the closed state of adenylate kinase
$ cd 01_CG_emmap
$ wget https://files.rcsb.org/download/1AKE.pdb
# Generate a new PDB file that contains protein heavy atoms
$ vmd -dispdev text 1AKE.pdb
vmd > set sel [atomselect top "chain A and protein and not hydrogen and not altloc B"]
vmd > $sel writepdb closed.pdb
vmd > exit
The synthetic density map is generated with the emmap_generator with the provided INP file and should be visualized in VMD, as described in Tutorial 17.1.
# Generate synthetic EM density map (CG_closed.sit)
$ /home/user/GENESIS/bin/emmap_generator INP > log
1.2. Building the initial structure
The structure of adenylate kinase in the open state should be downloaded and modified in the same fashion as it was done with structure in the closed state:
# Download PDB file of the open state of adenylate kinase
$ cd ../02_CG_build/
$ wget https://files.rcsb.org/download/4AKE.pdb
$ ls
4AKE.pdb
# Generate a new PDB file that contains protein heavy atoms
$ vmd -dispdev text 4AKE.pdb
vmd > set sel [atomselect top " chain A and protein and not hydrogen and not altloc B"]
vmd > $sel writepdb open.pdb
vmd > exit
Now, it is possible to visualize the open.pdb in VMD, together with the situs map CG_closed.sit. The structure of adenylate kinase in the open state is shifted from the map, so the initial rigid body fitting step with colores is necessary:
# Perform rigid docking using SITUS
$ /home/user/Situs_3.1/bin/colores ../01_CG_emmap/CG_closed.sit open.pdb -res 5 -nprocs 4
This calculation results in 7 new col_*.pdb files with fitted structures. For further calculations, we should choose the structure with the highest cross-correlation coefficient (c.c.) between the fitted structure and simulated densities:
# Check the c.c. value in the obtained PDB files
$ grep "Unnormalized correlation coefficient:" col_*.pdb
col_best_001.pdb:REMARK Unnormalized correlation coefficient: 0.709285
col_best_002.pdb:REMARK Unnormalized correlation coefficient: 0.614500
col_best_003.pdb:REMARK Unnormalized correlation coefficient: 0.601603
col_best_004.pdb:REMARK Unnormalized correlation coefficient: 0.590259
col_best_005.pdb:REMARK Unnormalized correlation coefficient: 0.590186
col_best_006.pdb:REMARK Unnormalized correlation coefficient: 0.580958
col_best_007.pdb:REMARK Unnormalized correlation coefficient: 0.562662
As a result, col_best_001.pdb will be used as the initial structure to generate the coarse-grained KB Go-modelparameters. For this purpose, we will use the MMTSB Web Service, as described in details in the Tutorial 7.1: Coarse-grained MD simulation with the KB Go-model. First, we should prepare the col_best_001.pdb file for the submission to the server:
# "clean" the file by deleting the entries other than ATOM
$ grep -e "^ATOM" col_best_001.pdb > col_best_001_edited.pdb
Next, we submit it to the MMTSB server, with a reference tag (i.e. open). The tar-ball file is sent to your email address. We unpack it in the working directory:
# extract the tar-ball file
$ tar -xvf open.tar
To build the Go-model for simulations, several modifications have to be applied to the files from MMTSB, such as replacing the atom names. Superimposing the coordinates of molecule in GO_open.pdb file on the original coordinates submitted to the server can be done using the CA atoms positions. Then it is possible to generate the go.psf and go.pdb files for the CG simulation using VMD. For the details, please see Tutorial 7.1. This is done with the provided setup.tcl script:
##### read pdb
mol load pdb GO_open.pdb
mol load pdb col_best_001_edited.pdb
##### calculate the shift matrix using the CA atoms positions
set base_shifted [atomselect 0 "name CA"]
set base_original [atomselect 1 "name CA"]
set matrix [measure fit $base_shifted $base_original]
##### move the GO_open.pdb structure
set shifted_molecule [atomselect 0 "all"]
$shifted_molecule move $matrix
##### replace residue names with G1, G2, G3, ...
set residue_list [lsort -unique [$shifted_molecule get resid]]
foreach i $residue_list {
set resname_go [format "G%d" $I]
set res [atomselect 0 "resid $i" frame all]
$res set resname $resname_go
}
$shifted_molecule writepdb tmp.pdb
##### generate PSF and PDB files
package require psfgen
resetpsf
topology GO_open.top
segment PROT {
first none
last none
pdb tmp.pdb
}
regenerate angles dihedrals
coordpdb tmp.pdb PROT
# write psf and pdb files
writepsf go.psf
writepdb go.pdb
exit
The script can be run with VMD:
$vmd -dispdev text <setup.tcl | tee run.out
The structure should be visually inspected. The positions of coordinates should coincide with the density map, so it is worth to open and see also the ../01_CG_emmap/CG_closed.sit in VMD:
$ vmd -situs ../01_CG_emmap/CG_closed.sit
vmd > mol load pdb ../02_CG_build/go.pdb psf ../02_CG_build/go.psf
1.3. Flexible fitting with replica-exchange scheme
When using the coarse-grained models, it is not necessary to run minimization and equilibration. We can run the production step immediately. For real applications, all the simulations shown in this tutorial should have more time steps to achieve a better convergence. The input file for the production run using 32 replicas is provided in the next folder:
# Change directory to run the CG simulation
$ cd ../03_CG_REMD_fitting
$ ls
INP
Have a look at the parameters provided in the INP file, both for REMD and flexible fitting parts:
[INPUT]
topfile = ../02_CG_build/GO_open.top # topology file
parfile = ../02_CG_build/GO_open.param # parameter file
psffile = ../02_CG_build/go.psf # protein structure file
pdbfile = ../02_CG_build/go.pdb # PDB file
[OUTPUT]
dcdfile = md_{}.dcd
rstfile = md_{}.rst
logfile = log_{}.log
remfile = md_{}.rem
[ENERGY]
forcefield = KBGO # [CHARMM,GO]
electrostatic = CUTOFF # [CUTOFF,PME]
switchdist = 18.0 # switch distance
cutoffdist = 20.0 # cutoff distance
pairlistdist = 21.5 # pairlist distance
[DYNAMICS]
integrator = LEAP # [LEAP,VVER]
nsteps = 1000000 # number of MD steps for 20ns
timestep = 0.02 # timestep (ps)
eneout_period = 100 # output energy
crdout_period = 10000
rstout_period = 1000000
nbupdate_period = 10 # update nonbond list
iseed = 1
[REMD]
dimension = 1
exchange_period = 100
iseed = 1
type1 = RESTRAINT
nreplica1 = 32
cyclic_params1 = NO
rest_function1 = 1
[CONSTRAINTS]
rigid_bond = YES # constraints all bonds
fast_water = NO # settle constraint
shake_tolerance = 1.0e-6 # tolerance (Angstrom)
[ENSEMBLE]
ensemble = NVT # [NVE,NVT,NPT,NPAT]
tpcontrol = BERENDSEN # [NO,BERENDSEN,NOSE-HOOVER,LANGEVIN,GAUSS] for [LEAP]
temperature = 200 # initial temperature (K)
tau_t = 1.0 # thermostat friction (ps-1) in [LANGEVIN]
[BOUNDARY]
type = NOBC
[SELECTION]
group1 = all
[RESTRAINTS]
nfunctions = 1
function1 = EM
constant1 = 500 524 555 593 636 684 737 794 856 921 991 1065 1144 1228 1318 1415 1519 \
1632 1755 1889 2035 2195 2371 2564 2777 3011 3269 3553 3865 4209 4586 5000
select_index1 = 1
[EXPERIMENTS]
emfit = YES
emfit_target = ../01_CG_emmap/CG_closed.sit
emfit_sigma = 2.5 # Sigma in simulated map definition
emfit_tolerance = 0.01 # Tolerance for simulated map calculation
emfit_period = 1 # update of the force
The set of 32 force constants provided here was previously tested in reference [1] to ensure sufficient overlaps between the potential energies of replicas and, as a result, the exchange ratios in the range of 0.2 – 0.3 for most of the studied protein systems. The exchange period (100) was also previously tested and yielded better results for tested proteins than exchange periods 10 or 1000. Note, that by decreasing the exchange period to 0 you can automatically perform 32 independent fitting runs at different force constants.
Now, we are ready to execute mpirun using 64 CPU cores (the number of CPU cores should be a multiplication of the number of replicas). This calculation may take a few hours and should be submitted as a batch job on the cluster.
# Perform REMD flexible fitting
$ export OMP_NUM_THREADS=2
$ mpirun -np 32 /home/user/GENESIS/bin/atdyn INP > log
Many files are produced as a result of the simulation: md_*.dcd, md_*.rem, md_*.rst, log_*.log for each replica * and log. The log_*.log files contain information about time course of the energy and c.c. of each replica. After the fitting, it is necessary to visualize at least one trajectory with VMD to ensure that the structure is well fitted to the target density.
$ vmd -situs ../01_CG_emmap/CG_closed.sit
vmd > mol load pdb ../02_CG_build/go.pdb psf ../02_CG_build/go.psf
vmd > mol addfile md_1.dcd
1.4. Analysis of the fitting results
It is time to analyze the changes in c.c. during the simulation, saved in the column 14: RESTR_CVS001 in the log_*.log files. It can be done simultaneously for all 32 replicas in a “for” loop with a simple bash script, provided in the tutorial files. This script also shows which replica have given the highest c.c. Have a look at the script:
#!/bin/sh
for REP in `seq 1 32`
do
grep 'INFO:' ../03_CG_REMD_fitting/log_${REP}.log | awk '{printf ("%10s\t%20s\n",$2,$14)}' | grep -v "RESTR" > temp1
awk -v var=log_${REP}.log '{print var,"\t",$0}' temp1 > cc-list$REP
rm temp1
highest_cc=$(awk '{print $3}' cc-list$REP | sort -n | tail -1)
grep "$highest_cc" cc-list$REP >> temp2
done
sort -rk3 temp2 > the-highest-cc.txt
rm temp2
We run the bash script in the following way:
# Change directory for analysis
$ cd ../04_CG_analysis
# Run the bash script
$ chmod +x analyze-cc.sh
$ ./analyze-cc.sh
The script has generated a series of files cc-list* for each replica and a single file the-highest-cc.txt. All files contain the log name in the first column, time step in the second column and c.c. in the last column.
1.5. Choosing the model with the highest cross-correlation coefficient
The file the-highest-cc.txt gathers the highest c.c. values from all replicas and it is sorted according to the c.c. value in the last column:
log_1.log 634300 0.9065
log_15.log 223000 0.9064
log_11.log 623100 0.9064
log_5.log 248000 0.9063
log_8.log 316300 0.9062
...
It is visible that the replica 1 reached the highest c.c. 0.9065 in the time step 634300. Thus, we encourage the users to calculate the RMSD and to generate a graph “Time courses of c.c. and Cα RMSD with respect to the target structure” for replica 1 by yourself. This can be done in a very similar way as it was described in the Tutorial 17.1.
For the targeted MD simulation, the frame from the REMD simulation that has the highest c.c. has to be extracted from the trajectory:
$ vmd -dispdev text
vmd > mol load pdb ../02_CG_build/go.pdb psf ../02_CG_build/go.psf
vmd > mol addfile ../03_CG_REMD_fitting/md_1.dcd
vmd > animate write pdb CG_open.pdb beg 63 end 63
vmd > exit
2. All atom (AA) simulations
2.1. Preparing a simulated density map of the target state
The simulated density map for the AA simulations is prepared in a similar way as for CG simulations, but a smaller voxel size is used. The synthetic density map is again generated with the emmap_generator and should be visualized.
# Generate synthetic EM density map (AA_closed.sit)
$ /home/user/GENESIS/bin/emmap_generator INP > log
2.2. Building the initial structure
We use the best structure fitted with Situs (../02_CG_build/col_best_001.pdb) as the initial coordinates for the targeted MD simulation. Here, we use the CHARMM force field and the initial pdb and psf files can be prepared with VMD as in Tutorial 2.2:
$ cd ../06_AA_build
$ vmd -dispdev text
vmd > package require psfgen
vmd > resetpsf
vmd > topology ../toppar/top_all36_prot.rtf
vmd > pdbalias residue HIS HSD
vmd > pdbalias atom ILE CD1 CD
vmd > segment PROA {pdb ../02_CG_build/col_best_001.pdb}
vmd > coordpdb ../02_CG_build/col_best_001.pdb PROA
vmd > guesscoord
vmd > writepdb open.pdb
vmd > writepsf open.psf
vmd > exit
Now, it is time to prepare the target pdb file for the targeted MD simulation. We will use the MD frame with the highest c.c. derived in section 1.5. Choosing the model with the highest cross-correlation coefficient. Unfortunately, this MD frame does not have any other atoms than CA and the target pdb in GENESIS needs to have the same number of atoms as the pdb file used in this simulation, even if we select only the CA atoms as a target. Therefore, we copy the open.pdb file, substitute the coordinates of the CA atoms with the coordinates from the CG_open.pdb and save the new “fake” pdb file as CG_open2.pdb. The bash script prepare-target.sh for this operation is given in the folder 06_AA_build.
# Run the bash script
$ chmod +x prepare-target.sh
$ ./prepare-target.sh
2.3. Energy minimization
Before the targeted MD simulation, the structure needs to undergo energy minimization to remove steric clashes. The input file for this calculation is provided.
# Perform energy minimization using 16 CPU cores
$ cd ../07_AA_minimize
$ export OMP_NUM_THREADS=4
$ mpirun -np 4 /home/user/GENESIS/bin/atdyn INP > log
2.4. Targeted MD towards the chosen model from CG simulations
Next, we perform the targeted MD in implicit solvent.
# Change directory to run targeted MD
$ cd ../08_AA_targeted_MD
The control file is given in the folder. For the targeted simulation, the CA atoms are chosen:
[SELECTION]
group1 = an:CA
[RESTRAINTS]
nfunctions = 1
function1 = RMSD
constant1 = 500
reference1 = 0
select_index1 = 1
[FITTING]
fitting_method = TR+ROT
fitting_atom = 1
mass_weight = NO
Now it is necessary to extract the last frame from the targeted fitting. Also, after the targeted MD, the structure might be still shifted a little with respect to the positions of the CA atoms. This is why we additionally superimpose the obtained structure on the CA positions after REMD fitting in VMD with the provided setup.tcl script:
##### save the last frame
mol load pdb ../06_AA_build/open.pdb psf ../06_AA_build/open.psf
mol addfile md.dcd
animate write pdb targeted_last.pdb beg last end last
mol delete all
##### read pdb
mol load pdb targeted_last.pdb
mol load pdb ../06_AA_build/CG_open2.pdb
##### calculate the shift matrix using the CA atoms positions
set base_shifted [atomselect 1 "name CA"]
set base_original [atomselect 2 "name CA"]
set matrix [measure fit $base_shifted $base_original]
##### move the targeted_last.pdb structure
set shifted_molecule [atomselect 1 "all"]
$shifted_molecule move $matrix
$shifted_molecule writepdb targeted_shifted.pdb
exit
Let’s visualize the targeted MD trajectory with VMD. We need to check if the CA atoms in targeted_ shifted.pdb are well fitted to the CA positions from the KB Go-model simulations and how big is the difference in CA positions before and after the superimposing:
$ vmd -pdb ../06_AA_build/open.pdb ../06_AA_build/open.psf md.dcd
vmd > mol load pdb targeted_shifted.pdb
vmd > mol load pdb ../04_CG_analysis/CG_open.pdb
2.5. Refinement of the structure with flexible fitting
The last step of the multi-scale protocol involves the refinement of the obtained fitted model with the MD flexible fitting using CHARMM force field in implicit solvent. The control file is also provided in the directory.
# Change directory to run the MD fitting
$ cd ../09_AA_refinement
The biasing potential set on the protein heavy atoms is very high, because the protein is already fitted to the density and we want to refine the positions of the side chains. See the part of the input file responsible for the fitting:
[SELECTION]
group1 = all and heavy
[RESTRAINTS]
nfunctions = 1
function1 = EM
constant1 = 20000
select_index1 = 1
[EXPERIMENTS]
emfit = YES
emfit_target = ../05_AA_emmap/AA_closed.sit
emfit_sigma = 2.5 # Sigma in simulated map definition
emfit_tolerance = 0.01 # Tolerance for simulated map calculation
emfit_period = 1 # update of the force
The MD flexible fitting simulation can be executed with ATDYN using 16 CPU cores:
# Perform flexible fitting
$ export OMP_NUM_THREADS=4
$ mpirun -np 4 /home/user/GENESIS/bin/atdyn INP > log
The files run.dcd, run.pdb, run.rst and log are produced during the simulation.
2.6. Analysis of the fitting results
The RMSD with respect to the target structure, the time course of c.c. and MolProbity score can be easily analyzed for the MD flexible fitting refinement trajectory. However, this was already described in details in Tutorial 17.1 so the users are advised to do those analyses by themselves.
# Change directory for analysis
$ cd ../10_AA_analysis
Here, we find the highest value of c.c. in the final simulation. The following commands extract the time step (first column) and c.c. value (second column) and sort according to the c.c. value:
# Find the highest c.c.
$ grep "INFO:" ../09_AA_refinement/log| awk '{print $2, $19}' | grep -v "RESTR" > cc.log
$ sort -rk 2 cc.log | head -5
39500 0.9840
47000 0.9837
46500 0.9837
34000 0.9837
48500 0.9836
The structure characterized with the highest c.c. is supposed to be the best structure obtained in the fitting simulation. It is reached at the 39500 time step with the crdout_period = 500, so it will be the 79th frame in the trajectory. Let’s extract this structure from the trajectory and compare it with the initial structure and the best structures obtained in the CG REMD fitting and in targeted MD. We should also visualize the trajectory in VMD to see how well the structure fits the density:
# Find the best fitted structure after the MD refinement
$ vmd -situs ../05_AA_emmap/AA_closed.sit
vmd > mol load pdb ../06_AA_build/open.pdb psf ../06_AA_build/open.psf
vmd > mol addfile ../09_AA_refinement/md.dcd
vmd > animate write pdb final_best.pdb beg 79 end 79
# Load the initial structure after Situs fitting
vmd > mol load pdb ../02_CG_build/col_best_001.pdb
# Load the best fitted structure after REMD
vmd > mol load pdb ../04_CG_analysis/CG_open.pdb
# Load the best fitted structure after targeted MD
vmd > mol load pdb ../08_AA_targeted_MD/targeted_shifted.pdb
# Load the best fitted structure after MD refinement
vmd > mol load pdb final_best.pdb
The structures are compared in the figure below. The target map with the initial structure after Situs fitting is in cyan, the best fitted structure after REMD is shown as red spheres, the best fitted structure after targeted MD is in green, the best fitted structure after MD refinement is in orange together with the target structure, again in cyan. The multi-scale protocol facilitated obtaining a structure very close to the target one. Adenylate Kinase shows a relatively easy domain movement and it was used in this tutorial due to its small size that makes the calculations less time-consuming. However, the multi-scale protocol is dedicated mainly for the protein systems with complex domain movements, that are hard to capture using a one-step flexible fitting simulation.
References
[1] M. Kulik et al., Front. Mol. Biosci., 8, 631854 (2021).[2] O. Miyashita et al., J. Comput. Chem., 38, 1447-1461 (2017).
[3] J. Karanicolas and C. Brooks, Prot. Sci., 11, 2351-2361 (2002).
[4] J. Karanicolas and C. Brooks, J. Mol. Biol., 334, 309-325 (2003).
[5] T. Mori et al., Structure, 27, 161-174 (2019).
Written by Marta Kulik@RIKEN Theoretical molecular science laboratory & University of Warsaw
February, 2021