16.6 The enzyme reaction 2: Free-energy calculation
Contents
1. Introduction
In this section, we demonstrate the replica-exchange umbrella sampling (REUS) simulation based on a QM/MM potential and calculate the free-energy profile of an enzyme reaction. REUS [1] is one of the enhanced sampling methods, in which multiple MD simulations (replicas) are carried out with different restraint potentials, exchanging them stochastically. In tutorial 11.1, REUS has been performed with classical MM-MD. This tutorial is similar to it but differs in:
- The reaction coordinate is set to a pre-determined minimum energy path (MEP).
- QM/MM-MD simulations are carried out using QSimulate-QM.
Following tutorial 16.5, the method is applied to a proton transfer reaction of dihyroxyacetone phosphate (DHAP) catalyzed by triosephosphate isomerase (TIM) [2]. Although the MEP and restart files are provided in the tutorial file, it is recommended to first work on tutorial 16.5, where the MEP is obtained by the string method.
The target proton transfer reaction and selected atomic distances (r1 – r7) are schematically shown in Fig. 1 (a). The variation of r1 – r7 along the MEP is shown in Fig. 1 (b). The figure indicates that r1 – r4 strongly vary along the path, whereas r5 and r7 are rather insensitive. r6 is relatively flat compared to r1 – r4, yet it does change 0.1 Å before and after the reaction. Therefore, the five atomic distances, r1 – r4 and r6 , well represent the reaction coordinate. Here, we carry out the off-lattice REUS using these atomic distances as a collective variable (CV).
2. Setup of window
Download the tutorial file (tutorial19-16.6.zip, or github), unzip it, and proceed to tutorial-16.6/1.tim. The directory contains three sub-directories.
$ unzip tutorial19-16.6.zip $ cd tutorial-16.6/1.tim $ ls 2.equil/ 4.mep/ 5.reus/
2.equil
contains psf and pdb files, and a restart file,
$ ls 2.equil step4.11_qmmm_nvt.rst step4_nvt_100.pdb step4_nvt_100.psf
and 4.mep
contains the information of the MEP,
$ ls 4.mep/2.analysis rpath_93.dat
You are welcome to overwrite these files with your results in tutorial 16.5 (or add 5.reus
to your tutorial16.5/1.tim
). Proceed to 5.reus
, and you will find seven sub-directories.
$ cd 5.reus $ ls 0.window/ 2.equil2/ 3.prod3/ 4.pmf/ 1.equil1/ 2.equil2_analysis/ 3.prod3_analysis/
Now, let us generate the windows of US,
$ cd 0.window $ ls make_window.f90 make_window.sh win_rr.gpi
make_window.f90
is a fortran program to generate the windows, and make_window.sh
is a script to run the program.
$ cat make_window.sh #!/bin/bash gfortran make_window.f90 -o make_window ./make_window -dat ../../4.mep/2.analysis/rpath_93.dat \ -ds 0.1 -ndim 5 -idx 1,2,3,4,6 > win_rr.log
The first line compiles the program. “gfortran” can be replaced with other fortran compilers. The second line executes the program. The options are:
- -dat : the information of the MEP
- -ds : the interval of window (in Å)
- -ndim : the number of dimensions
- -idx : optionally specifies which distances are used. r1-rx are selected with “-ndim x” by default. “-idx 1,2,3,4,6” specifies r1-r4 and r6.
Now run the script,
$ ./make_window.sh $ ls make_window make_window.f90 make_window.sh win_rr.dat win_rr.gpi win_rr.log
The values of r1-r4 and r6 are written for each window in win_rr.dat,
$ cat win_rr.dat cat win_rr.dat 1 2.5300 1.0980 1.7620 1.8690 2.7770 2 2.4319 1.0968 1.7876 1.8607 2.7627 ... 21 1.0160 1.8780 2.2570 1.6640 2.6780
The variation along the MEP can be visualized by gnuplot,
$ gnuplot win_rr.gpi
$ ls
make_window make_window.sh win_rr.gpi win_rr.pdf
make_window.f90 win_rr.dat win_rr.log
Note that 21 windows are set by the program. The number of window depends on the window interval, “-ds”. A large “ds” value reduces the number of window and the computational cost, yet with a higher risk that the neighboring windows have less or insufficient overlap of the probability distribution. The window interval of 0.1 Å with a force constant of 100 kcal/mol/Å2 is often used for chemical reactions. Nonetheless, it is always good to check the overlap in the initial equilibration steps, as we shall see in the next sub-section.
3. Equilibration 1
The equilibration MDs are done in two steps (Fig. 3). First, the MD is carried out sequentially starting from the first window to the last one. Each window is propagated for 500 fs at the level of DFTB3/MM. Then, in the next step, MDs of all windows (replicas) are carried out in parallel for 1 ps at the level of B3LYP-D3/MM.
Proceed to 1.equil
,
$ cd 1.equil $ ls geninp1.sh qsimulate.json run.sh template.inp toppar
geninp1.sh
is a script to generate input files of GENESIS for each window based on a template file, template.inp
. The template file is shown below:
[INPUT] topfile = toppar/top_all36_prot.rtf, toppar/top_all36_cgenff.rtf parfile = toppar/par_all36_prot.prm, toppar/par_all36_cgenff.prm strfile = toppar/toppar_water_ions.str, toppar/toppar_dhap.3.str psffile = ../../2.equil/step4_nvt_100.psf # protein structure file pdbfile = ../../2.equil/step4_nvt_100.pdb # PDB file reffile = ../../2.equil/step4_nvt_100.pdb # reference file rstfile = ../../2.equil/step4.11_qmmm_nvt.rst # restart file [OUTPUT] dcdfile = equil1_ID.dcd rstfile = equil1_ID.rst [ENERGY] forcefield = CHARMM electrostatic = CUTOFF switchdist = 16.0 # switch distance cutoffdist = 18.0 # cutoff distance pairlistdist = 19.5 # pair-list distance water_model = NONE vdw_force_switch = YES [DYNAMICS] integrator = VVER timestep = 0.0005 # timestep (ps) nsteps = 1000 # number of MD steps crdout_period = 500 eneout_period = 500 rstout_period = 500 nbupdate_period = 10 stoptr_period = 10 iseed = 20190319 [CONSTRAINTS] rigid_bond = YES # constraints all bonds involving hydrogen shake_tolerance = 1.0D-10 hydrogen_type = BOTH fast_water = YES noshake_index = 4 9 10 11 # don't constraint these hydrogen atoms [ENSEMBLE] ensemble = NVT tpcontrol = BUSSI # thermostat temperature = 300.0 # temerature (K) [BOUNDARY] type = NOBC spherical_pot = yes # spherical potential [QMMM] qmtyp = qsimulate # QSimulate-QM qmcnt = qsimulate.json # control file of QSimulate-QM workdir = equil1_ID basename = job qmsave_period = 500 qmmaxtrial = 1 qmatm_select_index = 1 exclude_charge = group [SELECTION] group1 = sid:DHA or (sid:TIMA and (rno:95 or rno:165) and \ not (an:CA | an:C | an:O | an:N | an:HN | an:HA)) group2 = atomno:1900 or atomno:5687 # COM of TIMA/TIMB group3 = atomno:1442 # NE2 of HSE95 group4 = atomno:1443 # HE2 of HSE95 group5 = atomno:2559 # OE1 of GLU165 group6 = atomno:2560 # OE2 of GLU165 group7 = atomno:7584 # O2 of DHAP group8 = atomno:7585 # C3 of DHAP group9 = atomno:7587 # HO3 of DHAP group10 = atomno:7588 # H31 of DHAP group11 = atomno:7589 # H32 of DHAP [RESTRAINTS] nfunctions = 6 function1 = POSI constant1 = 10.0 select_index1 = 2 function2 = DIST # r1(OE2-H31) constant2 = 300.0 reference2 = R1 select_index2 = 6 10 function3 = DIST # r2(C3-H31) constant3 = 300.0 reference3 = R2 select_index3 = 8 10 function4 = DIST # r3(OE1-HO3) constant4 = 300.0 reference4 = R3 select_index4 = 5 9 function5 = DIST # r4(O2-HE2) constant5 = 300.0 reference5 = R4 select_index5 = 7 4 function6 = DIST # r6(NE2-HO3) constant6 = 300.0 reference6 = R5 select_index6 = 3 9
Important options are highlighted in red with comments in blue. Note that:
- [INPUT]: The files in
2.equil
are used to restart the job.rstfile=../../2.equil/step4.11_qmmm_nvt.rst
is for the first window. Other windows restarts from the previous window . - [OUTPUT], [QMMM]: “ID” is replaced by window ID.
- [ENERGY]: The switch and cutoff distances are longer than usual.
- [DYNAMICS]: The timestep is shorter than usual.
- [CONSTRAINTS]:
noshake_index
specifies hydrogen atoms where SHAKE is disabled. The selection indicies 4, 9, 10, and 11 refer to HE2 of His95, HO3, H31, H32 of DHAP, respectively. - [QMMM]: QSimulate-QM is specified as a QM program.
qmexe
is not needed, because GENESIS and QSimulate-QM are linked through dynamic libraries. - [SELECTION]: group1 is the QM region (DHAP and sidechain of His95 and Glu165), and group2-11 are used for the restraint.
- [RESTRAINTS]: The first function is a positional restraint of the center of mass of TIMA and TIMB. Functions 2 – 6 are the restraints of the atomic distances, r1 – r4 and r6, respectively. R1, R2, …, R5 are replaced by the reference values of each window.
qsimulate.json
is a control file of QSimulate-QM. In this case, we use the DFTB3 method,
{ "bagel" : [ { "title" : "molecule", "basis" : "dftb" }, { "title" : "force", "method" : [ { "title" : "dftb", "charge" : -3, "thresh" : 1.0e-5 } ] } ]}
Now, let’s run the script, geninp1.sh
,
$ ./geninp1.sh $ ls equil1_1.inp equil1_2.inp equil1_3.inp ...
equil1_xx.inp
are input files of each window. Note that equil1_n.inp
restarts from equil1_(n-1).rst
, as specified in rstfile
of [INPUT]. run.sh
is a script to run the job,
#!/bin/bash # export LD_LIBRARY_PATH=/path/to/qsimulate/lib:$LD_LIBRARY_PATH ... (1) export PATH=$PATH:/path/to/genesis/bin ... (2) export OMP_NUM_THREADS=4 export BAGEL_NUM_THREADS=${OMP_NUM_THREADS} export MKL_NUM_THREADS=${OMP_NUM_THREADS} export I_MPI_PERHOST=4 export I_MPI_DEBUG=5 nimg=$(wc ../0.window/win_rr.dat |awk '{print $1}') for i in `seq 1 $nimg`; do mpiexec.hydra -n 2 atdyn equil1_${i}.inp >& equil1_${i}.out ... (3) done exit 0
- Set the
LD_LIBRARY_PATH
to where the dynamic libraries of QSimulate-QM are installed. - Set the
PATH
to where GENESIS is installed. - Sequential run of equil1_1, equil1_2, … and equil1_nimg.
Now, run the job:
$ ./run.sh
4. Equilibration 2
When the job is done, proceed to 2.equil2
,
$ cd ../2.equil2 $ ls geninp2.sh qsimulate.json run.sh template.inp toppar
geninp2.sh
generates an input file based on template.inp
. The template file is similar to before, but now there is [REMD] section to run all windows (replicas) in parallel. We only show the different parts below.
[INPUT] ... rstfile = OLDNAME_{}.rst # restart file [OUTPUT] logfile = NEWNAME_{}.log # log file of each replica dcdfile = NEWNAME_{}.dcd # DCD trajectory file remfile = NEWNAME_{}.rem # parameter index file rstfile = NEWNAME_{}.rst # restart file [REMD] dimension = 1 # dimension exchange_period = 0 # no exchange iseed = 3141592 # random number seed type1 = RESTRAINT # REUS nreplica1 = NREP # number of replicas cyclic_params1 = NO # Yes, if the parameter is periodic rest_function1 = 2 3 4 5 6 # off-lattice REUS [DYNAMICS] integrator = VVER timestep = 0.0005 # timestep (ps) nsteps = 2000 # 1 ps in total crdout_period = 20 # output for analysis eneout_period = 20 ... [ENSEMBLE] ... tau_t = 0.5 # accelerate the equilibration [QMMM] ... workdir = NEWNAME ... [RESTRAINTS] nfunctions = 6 ... function2 = DIST # r1(OE2-H31) constant2 = FC1 reference2 = R1 select_index2 = 6 10 ...
- [INPUT], [OUTPUT], [QMMM]: NEWNAME and OLDNAME are basename of files. They are given by the arguments of
geninp2.sh
(see below). - [REMD]
- exchange_period: The period of exchange attempt. No attempt is made when exchange_period=0.
- type1=RESTRAINT: Invokes REUS in the first dimension
- nreplica1: The number of replicas of the first dimension
- rest_function1: The restraint function used for REUS. Here, off-lattice REUS is invoked in which multiple restraints are merged into a single reaction coordinate.
- [DYNAMICS]: crdout_period and eneout_period are set to 20 to save the data for analyses.
- [ENSEMBLE]: tau_t = 0.5 is smaller than the default. This enables to equilibrate the system faster (~0.5 ps).
- [RESTRAINTS]: FC1 and R1 are replaced by the force constants and the reference distances of the windows (replicas).
qsimulate.json
is a control file of QSimulate-QM, which now specifies the B3LYP-D3/aug-cc-pVDZ level for DFT calculations. Refer to tutorial-16.5 for details on the options.
Now, run geninp2.sh
with NEWNAME and OLDNAME in the first and second argument, respectively,
$ ./geninp2.sh equil2 ../1.equil1/equil1
$ ls
equil2_reus.inp geninp.sh qsimulate.json run.sh
template.inp toppar
euil2_reus.inp
is the resulting input file. Note that the force constants and the reference distances are written in a single line for all windows:
$ grep -e constant2 -e reference2 equil2_reus.inp constant2 = 100.0 100.0 ... 100.0 100.0 reference2 = 2.5300 2.4319 ...1.0232 1.0160
run.sh
is a script to run the job,
$ cat run.sh #!/bin/bash # export LD_LIBRARY_PATH=/path/to/qsimulate/lib:$LD_LIBRARY_PATH export PATH=$PATH:/path/to/genesis/bin export OMP_NUM_THREADS=4 export BAGEL_NUM_THREADS=${OMP_NUM_THREADS} export MKL_NUM_THREADS=${OMP_NUM_THREADS} export I_MPI_PERHOST=4 export I_MPI_DEBUG=5 mpiexec.hydra -n 168 atdyn equil2_reus.inp >& equil2_reus.out exit 0
Here, we specify 8 MPI processes per replica, and request 8 MPI x 21 replicas = 168 MPI processes in total. Now we run the job,
$ ./run.sh
When the job is finished, it is a good point to do a sanity check before the production run. Go to 2.equil2_analysis
,
$ cd ../2.equil2_analysis
$ ls
analysis.sh pathcv.inp rmsd_analysis.inp
trj_analysis.inp rst_convert.inp ...
analysis.sh
is a script to run analysis tools using the input files shown above. Running the script yields an output for each window as follow,
$ ./analysis.sh 1 2 . . 21 $ ls equil2_1.dis equil2_1.pathcv equil2_1.pdb equil2_1.rms ...
Let us examine the results.
(1) Visualize the pdb files (the last snapshot of the trajectory) of each window,
$ vmd -e equil2.vmd
Make sure that the reaction takes place as intended and that there is nothing collapsing. For example, it happend to me before that the phosphate group left from DHAP. Unexpected weird (or perhaps interesting!) things can happen, so it is very important to visually check the structure.
(2) Plot the RMSD of backbone heavy atoms of proteins,
$ gnuplot rmsd.gpi
The command plots RMSD of each replica and yields rmsd.pdf,
The value of RMSD is normally around 1 to 2. If the RMSD is much larger or abruptly changing along the simulation time, it is a sign that there is something happening in the protein structure.
(3) Check the distribution.
$ gnuplot dist.gpi $ gnuplot pathcv.gpi
The command gives the distribution of r1/r2 (dist.pdf), and the probability of pathCV [3,4] (pathcv.pdf and pathcv_all.pdf). The results are shown in Fig. 5.
Although the number of sampling (100 points) is still few, these figures already tell that the distributions of each window are reasonably overlapped. One of the failure cases is that the distributions are much narrow and sharp, so that they have few overlap each other. In another case, the cumulative distribution exhibits a cleft, where the probability suddenly drops to zero. When the distribution is problematic, it is better to reset the parameters, i.e., add more windows and/or adjust the force constants, because longer simulation rarely solves the issue.
5. Production run
Now, we perform the production run. Proceed to 3.prod3
,
$ cd ../3.prod3 $ ls geninp2.sh qsimulate.json run.sh template.inp toppar
Again, we generate the input based on template.inp
. It is almost the same as the previous one except for [REMD] and [DYNAMICS] sections,
[REMD] dimension = 1 exchange_period = 100 # attempt the exchange every 50 fs ... [DYNAMICS] integrator = VVER timestep = 0.0005 # timestep (ps) nsteps = 2000 # 1 ps in total crdout_period = 10 # sample every 5 fs eneout_period = 10 # energy output period rstout_period = 100 # write restart
Most importantly, the replica exchange is attempted every 100 steps. Note that mod(rstout_period
, exchange_period
) and mod(nsteps
,(2*exchange_period
*dimension
)) must be zero.
Now, generate the input and run the simulation,
$ ./geninp2.sh prod3 ../2.equil2/equil2
$ ls
geninp2.sh prod3_reus.inp qsimulate.json ...
$ ./run3.sh
This command carries out REUS MD simulations for 1 ps. It often happens that we want to extend the MD simulation. This can be done by,
$ ./geninp2.sh prod4 prod3
$ ls
geninp2.sh prod4_reus.inp qsimulate.json ...
$ ./run4.sh
prod4
restarts prod3
and extends the MD for another 1 ps. Note that the length of MD can be adjusted by nsteps
in the template file and the number of jobs. For example, two jobs with nstep = 2000
is the same as a single job with nstep = 4000
. It is also possible to divide the simulation into a smaller pieces. Such a flexibility is often useful when the computational resources is busy.
When prod3
and prod4
are both finished, let us now analyze the results. We first check the calculation. Proceed to 3.prod3_analysis
,
$ cd ../3.prod3_analysis/ $ ls acceptance_ratio.sh prod4.vmd replica_index.sh rmsd_analysis.inp analysis.sh replica_index.gpi rmsd_analysis.gpi rst_convert.inp
(1) analysis.sh
is a script to run rmsd_analysis and rst_convert for each window. Running the script yields an output for each window as follow,
$ ./analysis.sh 1 2 . . 21 $ ls prod4_1.pdb prod4_1.rms ...
The pdb files can be visualized by VMD,
$ vmd -e prod4.vmd
and RMSD is plotted by gnuplot,
$ gnuplot rmsd_analysis.gpi
These results look reasonable. Nonetheless, we emphasize again that the sanity check is important to detect possible errors in the simulation.
(2) Given the output file of REUS, acceptance_ratio.sh
yields the acceptance ratio of each replica,
$ ./acceptance_ratio.sh ../3.prod3/prod4_reus.out 1 > 2 0.15 2 > 3 0.25 3 > 4 0.2 4 > 5 0.5 5 > 6 0.2 6 > 7 0.3 7 > 8 0.2 8 > 9 0.25 9 > 10 0.2 10 > 11 0.1 11 > 12 0.2 12 > 13 0.05 13 > 14 0.1 14 > 15 0.05 15 > 16 0.1 16 > 17 0.2 17 > 18 0.3 18 > 19 0.5 19 > 20 0.35 20 > 21 0.35
Although there are several replicas with low acceptance ratio (in red), most replicas show an acceptance ratio of 0.2 – 0.3.
(3) Let us also check the time course of the index of replica.
$ ./replica_index.sh $ gnuplot replica_index.gpi
The result of one of the parameters (parameter 9) is shown in Fig. 7. It is observed that parameter 9 is exchanged among replica ID 6 – 12, but not in the whole space. A similar tendency is observed in other replicas as well, suggesting that the MD simulation needs to extend to achieve random walk in the replica space.
6. The potential of mean-force (PMF)
Now, we calculate the PMF along the reaction path. Proceed to 4.pmf
and find six sub-directories,
$ cd ../4.pmf $ ls 1.sort_dcd/ 2.calc_dist/ 3.mbar/ 4.calc_pathcv/ 5.pmf_pathcv/ 6.pmf_r1r2/
For clarity, the procedure is outlined in Fig. 8, where the number in the box corresponds to the number of the directories. Also, the analysis tools used in each step are shown in yellow text.
The script to run the analysis tools, run.sh
, are prepared in all directories,
$ ls */run.sh 1.sort_dcd/run.sh 3.mbar/run.sh 5.pmf_pathcv/run.sh 2.calc_dist/run.sh 4.calc_pathcv/run.sh 6.pmf_r1r2/run.sh
These files start with the following lines,
#!/bin/bash
export PATH=${PATH}:/path/to/genesis/bin
Change “/path/to
” to the directory where GENESIS is installed.
6.1. Sort DCD
Because the REUS simulation prints the coordinates (dcd files) in terms of replica ID, we first sort the coordinates in terms of parameter ID. Proceed to 1.sort_dcd,
$ cd 1.sort_dcd $ ls remd_convert3.inp remd_convert4.inp run.sh
remd_convert3.inp
is an input file of remd_convert
, which is shown below,
[INPUT] reffile = ../../../2.equil/step4_nvt_100.pdb dcdfile = ../../3.prod3/prod3_{}.dcd remfile = ../../3.prod3/prod3_{}.rem # REMD parameter ID file logfile = ../../3.prod3/prod3_{}.log # REMD energy log file [OUTPUT] pdbfile = prod3_param.pdb # PDB file trjfile = prod3_param{}.dcd # trajectory file logfile = prod3_param{}.log # REMD energy log file [SELECTION] group1 = all [FITTING] fitting_method = NO [OPTION] convert_type = PARAMETER # (REPLICA/PARAMETER) num_replicas = 21 # total number of replicas nsteps = 2000 # nsteps in [DYNAMICS] exchange_period = 100 # exchange_period in [REMD] crdout_period = 10 # crdout_period in [DYNAMICS] eneout_period = 10 # eneout_period in [DYNAMICS] trjout_format = DCD # (PDB/DCD) trjout_type = COOR # (COOR/COOR+BOX) trjout_atom = 1 # atom group centering = NO # shift center of mass pbc_correct = NO # (NO/MOLECULE)
remd_convert
reads the parameter ID from remfile
, and sorts the coordinates in terms of parameter ID with covert_type = PARAMETER
. remd_convert4.inp
converts the dcd files of prod4
in the same way. run.sh
reads,
remd_convert remd_convert3.inp >& remd_convert3.out remd_convert remd_convert4.inp >& remd_convert4.out
Now, run the script,
$ ./run.sh $ ls prod3_param1.dcd prod3_param1.log ... prod4_param1.dcd prod4_param1.log ...
The new dcd files, prod[3,4]_parmID.dcd
, contain the coordinates of the parameter ID. The energies are sorted similarly and written in the log files, though they are not used in the current analysis.
6.2. Calculate distance
Proceed to 2.calc_dist
,
$ cd ../2.calc_dist $ ls dist.gpi run.sh trj_analysis.inp
trj_analysis.inp
is an input file of trj_analysis
, which calculates the distances, r1 – r4 and r6, from dcd files.
[INPUT] psffile = ../../../2.equil/step4_nvt_100.psf reffile = ../../../2.equil/step4_nvt_100.pdb [OUTPUT] disfile = prod3_NUM.dis # distance file [TRAJECTORY] trjfile1 = ../1.sort_dcd/prod3_paramNUM.dcd # trajectory file trjfile2 = ../1.sort_dcd/prod4_paramNUM.dcd # trajectory file md_step1 = 200 # number of MD steps mdout_period1 = 1 # MD output period ana_period1 = 1 # analysis period repeat1 = 2 # the number of repeat trj_format = DCD # (PDB/DCD) trj_type = COOR # (COOR/COOR+BOX) trj_natom = 0 [OPTION] check_only = NO allow_backup = NO distance1 = TIMA:165:GLU:OE2 DHA:249:DHAP:H31 # r1 (OE2 - H31) distance2 = DHA:249:DHAP:C3 DHA:249:DHAP:H31 # r2 (C3 - H31) distance3 = TIMA:165:GLU:OE1 DHA:249:DHAP:HO3 # r3 (OE1 - HO3) distance4 = DHA:249:DHAP:O2 TIMA:95:HSE:HE2 # r4 (O2 - HE2) distance5 = TIMA:95:HSE:NE2 DHA:249:DHAP:HO3 # r6 (NE2 - HO3)
“NUM” is replaced by the parameter ID at runtime in run.sh
. Note that the dcd files of prod3
and prod4
are read at the same time (repeat = 2
) and that all the data (400 points) are printed to prod3_NUM.dis
.
Now, run the script,
$ ./run.sh $ ls dist.gpi prod3_1.dis prod3_2.dis ...
dist.gpi
plots the distribution of r1 / r2,
$ gnuplot dist.gpi
6.3 MBAR
Here, we solve the MBAR equation [5] and obtain the weight of each snapshot. Proceed to 3.mbar
,
$ cd ../3.mbar $ ls mbar.inp run.sh
mbar.inp
is an input file of mbar_analysis
,
[INPUT] cvfile = ../2.calc_dist/prod3_{}.dis # distant files [OUTPUT] fenefile = fene.dat # free energy file weightfile = weight{}.dat # weight file [MBAR] dimension = 1 num_replicas = 21 input_type = US # Umbrella sampling tolerance = 10E-08 # threshold of MBAR iteration temperature = 300.0 # simulation temperature target_temperature = 300.0 # target temperature rest_function1 = 1 2 3 4 5 # r1 - r4, r6 nblocks = 1 self_iteration = 5 Newton_iteration = 40 [RESTRAINTS] constant1 = 100.0 100.0 100.0 ... 100.0 # r1 (OE2 - H31) reference1 = 2.5300 2.4319 2.3358 ... 1.0160 ... is_periodic1 = no ...
run.sh
reads,
export OMP_NUM_THREADS=4
mbar_analysis mbar.inp >& mbar.out
mbar_analysis
is thread-parallelized, and setting the variable, OMP_NUM_THREADS
, accelerates the calculation. Now, run the script,
$ ./run.sh $ ls fene.dat weight1.dat weight2.dat ...
weightID.dat
contains the weight of trajectory snapshots for each parameter ID. Using the weight, we can readily calculate the thermodynamic average of any quantity. In the following subsections, 1D- and 2D-PMFs are calculated in terms of pathCV and r1/r2, respectively.
6.4 1D-PMF along pathCV
Now, proceed to 4.calc_pathcv
,
$ cd ../4.calc_pathcv $ ls pathcv.gpi pathdist.gpi pathcv.inp run.sh
pathcv.inp
is an input file of pathcv_analysis
.
[INPUT] pathfile = ../../0.window/win_rr.dat # the information of discretized path cvfile = ../2.calc_dist/prod3_{}.dis # CV data [OUTPUT] pathcvfile = prod3_{}.pathcv # pathCV and distance [OPTION] nreplica = 21
PathCV [3] represents the reaction coordinate by discrete points and enumerates them from 1 to N (N is the number of points). Here, the discrete points are set to the anchor points of US, given by pathfile
. Then, the program calculates for each data points given by cvfile
the pathCV and path distant, which indicate where in the path the point is ascribed to and how distant it is from the path, respectively.
Now, run the script,
$ ./run.sh $ ls pathcv.inp pathcv.out prod3_1.pathcv prod3_2.pathcv ...
The pathCV and distance are printed in the second and third columns of prod3_x.pathcv
, respectively. These values are plotted by,
$ gnuplot pathcv.gpi $ gnuplot pathdist.gpi
The first line yields Fig. 10 (a). The probability distribution of each parameter has sufficient overlap in terms of pathCV. The second line plots the time course of the path distance of all parameters in a big sheet; here we show only one of them, parameter 9, in Fig. 10 (b). Although the path distance is around 0.01 – 0.02 on average, the trajectory largely deviates from the path in some occations. Since such occational large deviation cause numerical errors, it is recommended to set a cutoff that is 2 – 3 times larger than the average value when calculating the PMF. We use cutoff=0.04 below.
Proceed to 5.pmf_pathcv,
$ cd ../5.pmf_pathcv $ ls pmf.gpi pmf_bw15.inp pmf_bw20.inp run.sh
pmf_bw*.in
are the input file of pmf_analysis
.
[INPUT] weightfile = ../3.mbar/weight{}.dat # weight file cvfile = ../4.calc_pathcv/prod3_{}.pathcv # pathCV distfile = {}.pathdist # path distance [OUTPUT] pmffile = pmf_bw15.dat # potential of mean force [OPTION] nreplica = 21 dimension = 1 temperature = 300 cutoff = 0.04 # cutoff distance grids1 = 1.0 21.0 101 # (min max num_of_bins) band_width1 = 0.15 # sigma of Gaussian kernel is_periodic1 = NO
cutoff
is set to 0.04 based on the result in Fig. 10 (b). In pmf_analysis
, Gaussian kernel is used to smoothen the PMF, and the bandwidth of the Gaussian function is specified by band_width1
. Although the smoothening is useful to reduce the noise, the bandwidth should be carefuly chosen so as to not to incurr any artifact. The bandwidth is set to 0.15 and 0.20 in pmf_dat15.inp
and pmf_dat20.inp
, respectively.
Now, run the job,
$ ./run.sh $ ls pmf_bw15.dat pmf_bw15.inp pmf_bw15.out pmf_bw20.dat pmf_bw20.inp pmf_bw20.out ...
dat files contain the PMF along pathCV with and without the Gaussian smoothing. Finally, the data is plot by gnuplot,
$ gnuplot pmf.gpi
$ ls
pmf.gpi pmf.pdf pmf_bw15.dat ...
6.5. 2D-PMF in r1/r2
Let us calculate a 2-dimensional (2D) PMF as a function of r1 and r2. Since r1 and r2 are already obtained for each snapshots in 2.calc_dist
, we go directly to pmf_analysis
. Proceed to 6.pmf_r1r2
,
$ cd ../6.pmf_r1r2 $ ls 2dsurf_mix.gpi pmf.inp run.sh
pmf.inp
is an input file,
[INPUT] weightfile = ../3.mbar/weight{}.dat # weight file cvfile = ../2.calc_dist/prod3_{}.dis # r1 and r2 [OUTPUT] pmffile = pmf2.dat # potential of mean force [OPTION] nreplica = 21 dimension = 2 # 2D-PMF temperature = 300 output_type = GNUPLOT # print the output in gnuplot style grids1 = 0.8 2.8 101 # (min max num_of_bins) band_width1 = 0.020 # sigma of Gaussian kernel is_periodic1 = NO grids2 = 0.9 2.4 101 # (min max num_of_bins) band_width2 = 0.020 # sigma of Gaussian kernel is_periodic2 = NO
The dimension is set to 2, and the gridsX, band_widthX, is_periodicX (X=1,2) are set for the first and second dimensions, i.e., r1 and r2, respectively. Note that r1 and r2 are written in the second and third columns of prod3_*.dis
, respectively, and thus we can use them as is for cvfile
.
Now, run the script,
$ ./run.sh
$ ls
pmf.inp pmf.out pmf2.dat ...
pmf2.dat contains the data of 2D-PMF in a gnuplot format. Finally, plot the data using gnuplot,
$ gnuplot 2dsurf_mix.gpi
$ ls
2dsurf_mix.gpi 2dsurf_mix.pdf pmf.inp ...
The resulting 2D-PMF in Fig. 12 shows the overall shape of the free-energy landscape. However, some contour lines (4 and 8 kcal/mol, in particular) show wiggle shape, which is an indication of insufficient sampling.
7. Concluding remarks
We have demonstrated REUS simulations with a QM/MM potential using GENESIS/QSimulate-QM for calculating the free-energy landscape of an enzymatic reaction. 1D- and 2D-PMF have been calculated in terms of pathCV and r1/r2, respectively, using the MBAR method.
QM/MM-MD simulations have been performed for 2 ps per replica, thereby 2 x 21 = 42 ps in total. The resulting PMF along pathCV is found to be reasonably converged, yielding a barrier height of ~ 12 kcal/mol. On the other hand, the 2D-PMF in terms of r1/r2 has turned out to be insufficient in the number of sampling.
One of the main issues of REUS with QM/MM-MD is the high computational cost. The number of QM calculations is 6000 points x 21 replica = 126,000 points in this tutorial, and it could be even more in real applications. Therefore, fast QM programs with rich computational resource is a key technical element, e.g., QSimulate-QM with supercomputers, TeraChem with GPGPU clusters, and so on. On the other hand, utilizing the semiempirical methods such as DFTB is another promising direction, since DFTB is orders of magnitude cheaper than DFT (see 1.equil1 vs 2.equil2). For example, one may perform MD simulations with DFTB and reweight the energy landscape to DFT level. Such a multi-level approch will be our next goal.
References
- Y. Sugita, A. Kitao, and Y. Okamoto, J. Chem. Phys. 113, 6042 (2000).
- K. Yagi, S. Ito, and Y. Sugita, J. Phys. Chem. B 125, 4701 – 4713 (2021).
- D. Branduardi, F. L. Gervasio, M. Parrinello, J. Chem. Phys. 126, 054103 (2007).
- Y. Matsunaga, Y. Komuro, C. Kobayashi, J. Jung, T. Mori, and Y. Sugita, J. Phys. Chem. Lett. 7, 1446−1451 (2016).
- M. R. Shirts and J. D. Chodera, J. Chem. Phys. 129, 124194 (2008).