1.1 All-atom MD simulation of BPTI in NaCl solution

Files for this tutorial (tutorial-1.1.tar.gz)

In this section, a step-by-step tutorial for a molecular dynamics simulation with GENESIS is provided. The aim is to guide a new user through the process of building and simulating a system containing a protein (Bovine pancreatic trypsin inhibitor: BPTI) in a box with water molecules and ions. BPTI, a 58-residue globular protein, is a popular molecule in the simulation community because of it’s small size and stability. Indeed, it is the first protein for which a molecular dynamics simulation study was performed (J. A. McCammon et al., 1977 ) and also a 1-millisecond simulation was recently achieved with a special-purpose supercomputer (D. E. Shaw et al., 2010 ). In this tutorial, we examine its stability by a 1-nanosecond simulation using GENESIS.

After downloading and decompressing the tutorial file, you can find several directories (1_setup, 2_minimization, …). In this tutorial, all works will be done in these directories step by step.

# make a new directory to do this tutorial
$ mkdir tutorial-1.1
$ cd tutorial-1.1

# decompress the downloaded file
$ tar -xzf ~/Download/tutorial-1.1.tar.gz
$ ls
1_setup  2_minimization  3_heating  4_equilibration  5_production


1.1.1 Building a simulation system

We build a simulation system containing a BPTI in a simulation box with water molecules and ions using VMD and its plugins, and generate two files required for a simulation with GENESIS:

  1. Psffile, containing the connectivity, mass, and charges of the molecule;
  2. Pdbfile, containing the initial structure of the molecule.

To build a simulation system, we download a crystal (or NMR) structure of BPTI from the RCSB Protein Data Bank (PDB). You can download a PDB file, 4pti.pdb, via a web browser from here . The file is already prepared in the 1_setup folder.

$ cd 1_setup
$ ls 4pti*.pdb
4pti.pdb  4pti_edit.pdb

Let’s look at the header entries of the file. It can tell a lot about this molecule, for example, the structure was determined by X-ray diffraction (described in EXPDTA entry in the file), there is no missing residues, and there are three disulfide bonds between 5-55, 14-38, 30-51 cystein residue pairs (SSBOND entries).

Since the atom naming conventions of PDB and CHARMM are slightly different, we need to manually edit the file to comply with the CHARMM convention using a text editor (such as vi or Emacs). We need to replace O and OXT1 atoms of the C-terminal residue (ALA58) with OT1 and OT2, respectively. The edited version is 4pti_edit.pdb. After the edit, ALA58 should look as follow:

$ less 4pti_edit.pdb
...skipped...
ATOM    449  N   ALA A  58 25.146  29.681  -6.493  1.00 46.21       N
ATOM    450  CA  ALA A  58 25.617  30.840  -7.256  1.00 45.05       C
ATOM    451  C   ALA A  58 25.248  30.735  -8.729  1.00 46.90       C
ATOM    452  OT1 ALA A  58 24.962  31.791  -9.369  1.00 39.78       O
ATOM    453  CB  ALA A  58 27.160  30.980  -7.146  1.00 50.07       C
ATOM    454  OT2 ALA A  58 24.919  29.594  -9.172  1.00 43.54       O
TER     455      ALA A  58
...skipped...

$ diff 4pti*.pdb
817c817
< ATOM    452  O   ALA A  58      24.962  31.791  -9.369  1.00 39.78           O
---
> ATOM    452  OT1 ALA A  58      24.962  31.791  -9.369  1.00 39.78           O
819c819
< ATOM    454  OXT ALA A  58      24.919  29.594  -9.172  1.00 43.54           O
---
> ATOM    454  OT2 ALA A  58      24.919  29.594  -9.172  1.00 43.54           O

Since GENESIS does not provide any programs for building simulation systems, we use other packages such as CHARMM or psfgen (supplied with NAMD). In this section, we use VMD, and psfgen is called via the plug-in interface within VMD. Other VMD plug-ins (solvate and autoionize) are also used for solvation and neutralization, respectively.

A VMD script, setup.tcl, builds the simulation system containing a BPTI molecule in a box with water molecules and ions. The script consists of four parts. In the first part, the crystal water molecules are removed by using the atom selection facility of VMD. The protein structure without crystal water molecules is written in a pdb file (4pti_protein.pdb).

##### read pdb and remove crystal waters 
mol load pdb 4pti_edit.pdb

# set the origin to the center of mass
set all [atomselect top all]
set protein [atomselect top protein]
$all moveby [vecinvert [measure center $protein weight mass]]

# write pdb of protein
$protein writepdb 4pti_protein.pdb
mol delete all

In the second part, psfgen plugin is called. psfgen matches the residues in the original file with the residues defined in topfile (top_all27_prot_lipid.rtf), and then creates disulfide bonds by patching special residues for them. Also, the coordinates of missing hydrogen atoms are guessed from the topology. Then, a psffile (protein.psf) and a pdbfile (protein.pdb) of BPTI without solvent are written.

##### generate psf by using the psfgen plugin
package require psfgen
resetpsf
topology top_all27_prot_lipid.rtf

segment BPTI {
 first nter
 last cter
 pdb 4pti_protein.pdb
}
patch DISU BPTI:5 BPTI:55
patch DISU BPTI:14 BPTI:38
patch DISU BPTI:30 BPTI:51
regenerate angles dihedrals

pdbalias atom ILE CD1 CD
coordpdb 4pti_protein.pdb BPTI

guesscoord

# write psf and pdb files of protein
writepsf protein.psf
writepdb protein.pdb

In the third part, solvate plugin is invoked to solvate BPTI with TIP3P bulk water molecules and fills the box for simulations under the periodic boundary conditions. The option “-t 22.5” in solvate plug-in is a padding distance, which defines the distance between BPTI and the wall of simulation box. The resulting system is written in solvate.psf and solvate.pdb.

#####  solvate with TIP3P waters by using the solvate plugin
mol delete all
package require solvate
solvate protein.psf protein.pdb -rotate -t 22.5 -o solvate

In the fourth part, counter ions are added to neutralize the system using autoionize plug-in. The final results are saved as ionize.psf and ionize.pdb. Finally, we measure the box size and print to screen.

#####  add counter ions by using the autoionize plugin
mol delete all
package require autoionize
autoionize -psf solvate.psf -pdb solvate.pdb -neutralize -cation SOD -anion CLA -seg ION -o ionize

#####  set the origin to the center of mass and check boxsize
set all [atomselect top all]
$all moveby [vecinvert [measure center $all weight mass]]
$all writepdb ionize.pdb

# measure the box size
set minmax [measure minmax $all]
foreach {min max} $minmax { break }
foreach {xmin ymin zmin} $min { break }
foreach {xmax ymax zmax} $max { break }
puts "Box size estimation:"
puts "boxsizex = [expr abs($xmin)+ $xmax + 1.0]"
puts "boxsizey = [expr abs($ymin)+ $ymax + 1.0]"
puts "boxsizez = [expr abs($zmin)+ $zmax + 1.0]"

exit

We can run it with VMD:

$ vmd -dispdev text < setup.tcl | tee run.out

Here, with a “-dispdev text” option, we inhibit any graphical display windows from opening (we do not need any visualization for this set-up). The redirection “<setup.tcl” instructs VMD to process the script.

It is always a good idea to check the final structure by a visual inspection. We can visualize the structure with VMD by using the visualize.tcl script:

##### read structure and coordinates
mol load psf ionize.psf pdb ionize.pdb

#### delete initial line representation
mol delrep 0 top

#### make VDW representation for ions
mol representation VDW 1.000000 12.00000
mol color Name
mol selection {ions}
mol material Opaque
mol addrep top

#### make cartoon representation for protein
mol representation NewCartoon 0.300000 10.000000 4.100000 0
mol color Structure
mol selection {protein}
mol material Opaque
mol addrep top

#### make line representation for water
mol representation Lines
mol color Name
mol selection {water}
mol material Opaque
mol addrep top

#### turn off axes
axes location off

#### set white background
color Display Background 8

The commands contained in the script are invoked after opening vmd by using the ‘-e scriptfile’ command line option:

$ vmd -e visualize.tcl

If the building has successfully finished, the structure looks like follows:

bpti

The size of the simulation box is printed in the end of run.out:

Box size estimation:
boxsizex = 70.82500076293945
boxsizey = 83.25799942016602
boxsizez = 69.09300231933594

This size is rather large for a typical globular protein. However, we insist to this size because we are going to use SPDYN. For the spatial decomposition scheme in SPDYN, the simulation box size in each of direction has to be 5-times larger than the pair list distance (pairlistdist=13.5 distance is in Angstrom). If you prefer a smaller simulation box, decrease the padding distance(-t 22.5) in solvate plug-in,and use ATDYN for your simulations instead.

Quiz: How many ions are there in the system? Do you know why?


1.1.2 Minimization with restraints on the protein

The initial structures often contain non-physical steric clashes or non-equilibrium geometries, especially at the interface between the protein and solvent, because the solvent molecules are randomly placed around the protein in the building process. Thus, before performing a molecular dynamics simulation, it is imperative to remove such clashes by minimizing the potential energy of the system. ATDYN and SPDYN support the steepest descent algorithm for minimization, which moves atoms proportionally to their negative gradients of the potential energy.

Change the directory to 2_minimization directory.

$ cd 2_minimization/
$ ls
run.inp run.sh

The script run.sh is a script to run SPDYN.

$ cat run.sh
#!/bin/bash
rm -i run.rst run.dcd

# set the number of OpenMP threads
$ export OMP_NUM_THREADS=1

# perform minimization with SPDYN by using 8 MPI processes
$ mpirun -np 8 spdyn run.inp | tee run.out

This script runs SPDYN with 8 MPI processes, each using 1 thread for OpenMP.

GENESIS always stops when the output file(s) specified in the input file already exists. This is to prevent an overwrite of an existing output file by mistake. So, when you run a simulation, make sure the dcd, rst files that you are going to create, don’t exist in the current directory.

The control file, run.inp, performs a 1000-step minimization. It consists of several sections, such as [INPUT], [OUTPUT], [ENERGY], etc. In each section, we specify the control parameters for the calculation.

[INPUT]
topfile = ../1_setup/top_all27_prot_lipid.rtf # topology file
parfile = ../1_setup/par_all27_prot_lipid.prm # parameter file
psffile = ../1_setup/ionize.psf               # protein structure file
pdbfile = ../1_setup/ionize.pdb               # PDB file
reffile = ../1_setup/ionize.pdb               # reference for restraints

[OUTPUT]
dcdfile = run.dcd  # DCD trajectory file
rstfile = run.rst  # restart file

In [INPUT] section, input files are specified. topfile (topology file), parfile (parameter file), psffile (PSF file), pdbfile (PDB file with an initial structure) are always required. reffile (reference file) is given as the reference structure for positional restraints. We set these parameters to the files obtained in the last section.

In [OUTPUT] section, output files are specified. SPDYN does not create any output file unless we explicitly specify them. Here, rstfile (restart file) is used for the restart of the simulation.

[ENERGY]
electrostatic    = PME       # [CUTOFF,PME]
switchdist       = 10.0      # switch distance
cutoffdist       = 12.0      # cutoff distance
pairlistdist     = 13.5      # pair-list distance
pme_ngrid_x      = 72        # grid size_x in [PME]
pme_ngrid_y      = 80        # grid size_y in [PME]
pme_ngrid_z      = 72        # grid size_z in [PME] 

In [ENERGY] section, we set the keyword related to the energy and force evaluation. Here, the particle mesh Ewald (PME) method is selected for computing electrostatic interactions, usually combined with the periodic boundary condition (PBC) in [BOUNDARY] section. By default, an interpolation scheme with the lookup table is used for the evaluations of non-bonded interactions.

[MINIMIZE]
nsteps        = 1000     # number of steps
eneout_period = 100      # energy output period
crdout_period = 100      # coordinates output period
rstout_period = 1000     # restart output period

[BOUNDARY]
type          = PBC      # [PBC,NOBC]
box_size_x    = 70.8250  # box size (x) in [PBC]
box_size_y    = 83.2579  # box size (y) in [PBC]
box_size_z    = 69.0930  # box size (z) in [PBC]

[MINIMIZE] section turns on the minimization engine of SPDYN. Here, we specify 1000 steps of the steepest descent algorithm (SD). In [BOUNDARY] section, the boundary condition and the simulation box size are set. Here, we use the periodic boundary condition (PBC). The values of the box size were taken from the previous building step.

[SELECTION]
group1        = an:CA | an:C | an:O | an:N  # index of restraint group 1

[RESTRAINTS]
nfunctions    = 1     # number of functions
function1     = POSI  # restraint function type
constant1     = 10.0  # force constant
select_index1 = 1     # restrained group

In [SELECTION] section, we define a group (group1) consisting of backbone atoms of the protein to impose positional restraints. In [RESTRAINTS] section, positional restraints are specified for the group of backbone atoms defined in the previous [SELECTION] section. The positional restraints are imposed with respect to the reference coordinates given by reffile in [INPUT] section.

Now, we run the simulation by the following command:

$ chmod +x run.sh
$ ./run.sh

This job creates run.out (a standard output), run.dcd (trajectory file), and run.rst (restart file). After the job is done, it is recommended to confirm the decrease of the potential energy. The output format of GENESIS is simple, so we can easily extract the potential energy term with standard UNIX commands:

$ grep "^INFO" run.out | tail -n +2 | awk ’{print $2, $3}’ > pot.dat
$ gnuplot
gnuplot> set xlabel "step"
gnuplot> set ylabel "potential energy [kcal/mol]"
gnuplot> plot "pot.dat" w lp

Pot


1.1.3 Heat-up with restraints on the protein

In this step, we heat up the system from 0.1 K to 300 K during a 100-picosecond molecular dynamics simulation. During the simulation, the positional restraints are imposed again on the protein’s backbone not to disrupt the structure due to abrupt increase in temperature. The files are in 3_heating directory:

# change to the heating directory
$ cd 3_heating/
$ ls
run.inp run.sh

The control file, run.inp, performs a 100-picosecond molecular dynamics simulation, gradually heating the system from 0.1 K to 300 K:

[INPUT]
topfile = ../1_setup/top_all27_prot_lipid.rtf # topology file
parfile = ../1_setup/par_all27_prot_lipid.prm # parameter file
psffile = ../1_setup/ionize.psf               # protein structure file
pdbfile = ../1_setup/ionize.pdb               # PDB file
reffile = ../1_setup/ionize.pdb               # Reference for restraints
rstfile = ../2_minimization/run.rst           # restart file

In [INPUT] section, rstfile is set to the restart file from the previous minimization. [OUTPUT] and [ENERGY] section are the same as before.

[DYNAMICS]
integrator     = LEAP      # [LEAP,VVER]
nsteps         = 50000     # number of MD steps
timestep       = 0.002     # timestep (ps)
eneout_period  = 500       # energy output period
crdout_period  = 500       # coordinates output period
rstout_period  = 50000     # restart output period
annealing      = YES       # simulated annealing
anneal_period  = 500       # annealing period
dtemperature   = 3         # temperature change at annealing (K)

[CONSTRAINTS]
rigid_bond     = YES       # constraints all bonds involving hydrogen

[ENSEMBLE]
ensemble       = NVT       # [NVE,NVT,NPT]
tpcontrol      = LANGEVIN  # thermostat
temperature    = 0.1       # initial temperature (K)

Major differences from the minimization are [DYNAMICS], [CONSTRAINTS], and [ENSEMBLE] sections, which control molecular dynamics simulations. [DYNAMICS] section turns on the molecular dynamics engine. In this section, the parameters related to molecular dynamics are specified. For heating up the system, a simulated annealing protocol is enabled by annealing=YES. The simulated annealing algorithm increases the target temperature by dtemperature K every anneal_period steps. In this case, the temperature is increased by dtemperature=3 K every anneal_period=500 steps. Thus, nsteps=50000 MD steps results into temperature increase by 3*50000/500=300 K. [CONSTRAINTS] section is for constraints during a simulation. rigid_bond is a keyword for the SHAKE algorithm. For TIP3P water molecules, a faster algorithm (SETTLE) is automatically applied if rigid_bond=YES. In [ENSEMBLE] section, we can choose a thermostat and a barostat from several standard options. For typical simulations, LANGEVIN is recommended.

[BOUNDARY]
type           = PBC       # [PBC,NOBC]

In [BOUNDARY] section, we don’t need to write the simulation box size any more, since the box size values are read from the restart file. [SELECTION] and [RESTRAINTS] sections are the same as before.

Now, we run the simulation by the following command:

$ chmod +x run.sh
$ ./run.sh

After the heating up simulation, the increase in temperature can be confirmed in this way:

$ grep "^INFO" run.out | tail -n +2 | awk ’{print $3, $17}’ >temp.dat
$ gnuplot
gnuplot> set xlabel "time [ps]"
gnuplot> set ylabel "temperature [K]"
gnuplot> plot "temp.dat" w lp

Temp

Also, it is recommended to visualize the trajectory to confirm that the protein structure is not disrupted. We can visualize the trajectory with VMD by the following command:

$ vmd -e visualize.tcl

The visualize.tcl script is identical to the one discussed in section 1.1.2, except the first line which loads the coordinates from the output dcd file

##### read structure and coordinates
mol load psf ../1_setup/ionize.psf dcd run.dcd

For real applications, 100 ps may be too short for the heating up process. For your research, longer runs than 1-nanosecond would be recommended. The same is true for the next equilibration step.


1.1.4 Equilibration simulation

In this step, we equilibrate the system under the condition of 300 K and 1 atm by a 100-picosecond molecular dynamics simulation. Note that the ensemble is changed from NVT to NPT. During the simulation, we no longer impose any restraints on the protein, and we relax the whole system for the next production simulation. The files are in 4_equilibration directory:

# change to the equilibration directory
$ cd 4_equilibration/
$ ls
run.inp run.sh

The control file, run.inp, performs a 100-picosecond molecular dynamics simulation under the condition of 300 K and 1 atm:

[INPUT]
topfile = ../1_setup/top_all27_prot_lipid.rtf # topology file
parfile = ../1_setup/par_all27_prot_lipid.prm # parameter file
psffile = ../1_setup/ionize.psf               # protein structure file
pdbfile = ../1_setup/ionize.pdb               # PDB file
rstfile = ../3_heating/run.rst                # restart file

...(skip)...

[DYNAMICS]
integrator    = LEAP      # [LEAP,VVER]
nsteps        = 50000     # number of MD steps
timestep      = 0.002     # timestep (ps)
eneout_period = 500       # energy output period
crdout_period = 500       # coordinates output period
rstout_period = 50000     # restart output period
stoptr_period = 10        # remove translational and rotational motions period

[CONSTRAINTS]
rigid_bond    = YES       # constraints all bonds involving hydrogen

[ENSEMBLE]
ensemble      = NPT       # [NVE,NVT,NPT]
tpcontrol     = LANGEVIN  # thermostat and barostat
temperature   = 300.0     # initial temperature (K)
pressure      = 1.0       # target pressure (atm)

[BOUNDARY]
type           = PBC      # [PBC,NOBC]

Most of the control parameters are the same as those in the previous heat-up simulation except for:

  1. In [INPUT] section, rstfile is set to the restart file of the previous heat-up simulation.
  2. In [DYNAMICS] section, stoptr_period = 10 is added to remove the translation and rotation motion of the whole system, so as to prevent the flying ice cube effect.
  3. In [ENSEMBLE] section, ensemble is set to NPT and pressure = 1.0 is set to specify the target pressure.
  4. [SELECTION] and [RESTRAINTS] sections are removed, since we perform a restraint-free simulation.

Now, we run the simulation by the following command:

$ chmod +x run.sh
$ ./run.sh

After the equilibration, it is always a good practice to check the convergence of quantities related to thermodynamic conditions, such as temperature, pressure, and volume. For example, the volume can ben checked by

$ grep "^INFO" run.out | tail -n +2 | awk ’{print $3, $17}’ >volume.dat
$ gnuplot
gnuplot> set xlabel "time [ps]"
gnuplot> set ylabel "volume [Angs^3]"
gnuplot> plot "volume.dat" w lp

Vol

Quiz: How do you obtain the density (in g/mol) from the above volume data? What value should the density converge to?


1.1.5 Production simulation

Finally, we perform a production 1-nanosecond molecular dynamics simulation. By running such a relatively long simulation, we can analyze the stability of BPTI in solution. The files are in 5_production directory:

# change to the production directory
$ cd 5_production/
$ ls
run.inp run.sh

The control file, run.inp, performs a 1-nanosecond molecular dynamics simulation at 300 K and 1 atm:

[INPUT]
topfile = ../1_setup/top_all27_prot_lipid.rtf # topology file
parfile = ../1_setup/par_all27_prot_lipid.prm # parameter file
psffile = ../1_setup/ionize.psf               # protein structure file
pdbfile = ../1_setup/ionize.pdb               # PDB file
rstfile = ../4_equilibration/run.rst          # restart file

..(skip)..

[DYNAMICS]
integrator    = LEAP      # [LEAP,VVER]
nsteps        = 500000    # number of MD steps
timestep      = 0.002     # timestep (ps)
eneout_period = 500       # energy output period
crdout_period = 500       # coordinates output period
rstout_period = 500000    # restart output period
stoptr_period = 10        # remove translational and rotational motions period

..(skip)..

The only difference is in [DYNAMICS] section, where the total number of simulation steps is set to nsteps=500000. The combination with timestep=0.002 ps results in a total length of nsteps*time step = 1000 ps = 1 ns simulation. Since the output period for coordinates is crdout_period=500, we have the total of nsteps/crdout_period = 1000 snapshots in the output trajectory.

Now, we run the simulation by the following command:

$ chmod +x run.sh
$ ./run.sh

Depending on the machine specifications and the number of MPI processes used, this step could take a couple of hours or more than a day. We can always check the progress by inspecting the standard output.

Let’s visualize the trajectory with VMD, when the production simulation finishes. From the visualization, you may get some insights into the fluctuations of the protein structure.

$ vmd -e visualize.tcl

In the analysis section(4.1 – 4.4), we will quantify the size of the fluctuations by analyzing the trajectory.


Written by Kiyoshi Yagi@RIKEN Theoretical molecular science laboratory
July, 14, 2016