### 4.3 Statistical analysis of the trajectory data by Python

Note: this tutorial is a simple introduction to using python (version 3.x) and packages such as NumPy and matplotlib to do some fundamental statistical analysis of simulation results. If you want to use python to parse MD trajectories, please try specific libraries such as MDTraj and MDAnalysis, which are beyond the scope of the current tutorial. Please refer to the website of python for a basic programming guide and get help from PyPI or anaconda about how to install packages.

#### Preparation

##### 1.2 Using the “numpy.ndarray” data structure

In this part, we show an alternative way to get the maximum/minimum, average, and standard deviation values using the numpy library. numpy is a Python “package” or “module” that provides a multidimensional array object and various functions for fast operations on arrays.

The program is in the 02_numpy.py file:

#!/usr/bin/env python3import numpy as np# analyze the 5th columni_col = 5# use array ene_data to store energy valuesene_data = np.loadtxt("./energy.log", usecols=i_col - 1)# print length of the listlen_ene_data = ene_data.sizeprint("Length of the energy list:", len_ene_data)# find out the max/minprint("Maximum of column {0:>2d}: {1:>12.4f}".format(i_col, ene_data.max()))print("Minimum of column {0:>2d}: {1:>12.4f}".format(i_col, ene_data.min()))# calculate the averageene_ave = ene_data.mean()print("Average of column {0:>2d}: {1:>12.4f}".format(i_col, ene_ave))# calculate the standard deviationene_std = ene_data.std()print("Standard deviation of column {0:>2d}: {1:>12.4f}".format(i_col, ene_std))

Compared with 01_list.py, this new script looks much simpler. Actually, in many cases, numpy can simplify the development of programs. We list out the main differences between the two versions in the following:

• import numpy as np is used to declare the usage of numpy.
• np.loadtxt() is a function used to read formatted data from file; the return value is an array (“ene_data“).
• size of an array is the number of elements (be aware that size is not a function).
• max() and min() are now used as member functions of the array ene_data.
• mean() calculates the average of an array.
• std() directly computes the standard deviation.

Let’s run this new script:

$./02_numpy.py # or: python3 02_numpy.pyLength of the energy list: 1000Maximum of column 5: 7.3745Minimum of column 5: -10.2807Average of column 5: -1.5140Standard deviation of column 5: 2.8306 As can be seen, the results are the same as the ones we computed with the 01_list.py. #### 2. Histogram analysis A histogram is an approximate representation of the distribution of numerical data. In section 2, we are going to compute and plot one-dimensional (1D) and two-dimensional (2D) histograms from the MD simulation results. ##### 2.1 1D-Histogram First, we analyze the distance data (output.dis) obtained from Tutorial 3.2 and plot a 1D histogram. There are different ways to get a histogram. For example, the data visualization package matplotlib provides a matplotlib.pyplot.hist function, which directly draws a histogram. Alternatively, numpy also has a function numpy.histogram. We can first use this function to get the result and then use matplotlib to plot it. Of course, we can also use the list data structure to construct a histogram by “manually” counting the number of data in each predefined bin. Here we show the simplest way with the matplotlib.pyplot.hist function. # Change directory to 1d-histogram analysis$ cd ../Analysis2.1$lshist1d.py output.dis The program is in the file hist1d.py: #!/usr/bin/env python3import numpy as npimport matplotlib.pyplot as plt# load data from file (the 2nd column)distance_data = np.loadtxt("output.dis", usecols=(1))# plot the histogramplt.hist(distance_data, bins=150, range=(0, 15))# set axis rangesplt.xlim(7, 13)plt.ylim(0, 80)# set labelsplt.xlabel(r"distance ($\AA$)")plt.ylabel("frequency")# save figure to png fileplt.savefig("distance_histogram.png", dpi=150) • we first load the packages numpy and matplotlib.pyplot using import. • we then read data from “output.dis” using the np.loadtxt function. Our purpose is to get the distribution of the distance, so we only have to load the second column of the data (usecols=(1)). • plt.hist() is a function used to directly calculate and draw a histogram, with options bins for the number of bins and range for the lower and upper limit of the data. • plt.xlim() and plt.ylim() set the range of the x and y-axis in the figure, respectively. • plt.xlabel() and plt.ylabel() set the label of the x and y-axis, respectively. • plt.savefig() is used to output the figure to a file. We can now go to the subdirectory Analysis2.1 and execute our script: $ ./hist1d.py# or$python3 hist1d.py The script generates a figure named distance_histogram.png as shown below: This figure shows the distance distribution and is exactly the same as the one we get in tutorial 4.2. ##### 2.2 2D-Histogram We then try to analyze the secondary structure data (output.tor) obtained in Tutorial 3.1 and draw a 2D histogram (the Ramachandran plot). As mentioned in section 2.1, there is also more than one method to make a 2D histogram plot. Here, we simply utilize the matplotlib.pyplot.hist2d function. # Change directory to 2d-histogram analysis$ cd ../Analysis2.2$lshist2d.py output.tor The program can be found in the file hist2d.py: #!/usr/bin/env python3import numpy as npimport matplotlib.pyplot as plt# load data from file (use the 1st and 2nd columns)x_data, y_data = np.loadtxt("output.tor", usecols=(1, 2), unpack=True)# plot the 2d histogram with 72 bins on each dimensionplt.hist2d(x_data, y_data, bins=[72, 72], range=[[-180, 180], [-180, 180]], cmap="Reds")# set aspect ratio to "equal"plt.gca().set_aspect('equal', 'box')# set ticks for x and y axisplt.xticks([-180, -135, -90, -45, 0, 45, 90, 135, 180])plt.yticks([-180, -135, -90, -45, 0, 45, 90, 135, 180])# set x and y axis labelsplt.xlabel(r"$\phi$($^\circ$)")plt.ylabel(r"$\psi$($^\circ$)")# plot a color barplt.colorbar(label="frequency")# save figure to fileplt.savefig("Ramachandran_plot.png", dpi=150) • np.loadtxt is used to load data from a file. Here we use the 2nd and 3rd columns of “output.tor” (usecols=(1, 2)) and store them in different arrays (unpack=True). • plt.hist2d() is a function that calculates and draws the 2D histogram. We specify the number of bins (72) and range ([-180, 180]) in each dimension, respectively. • plt.gca() is a method to “Get the Current Axis”. We then set the aspect ratio of the current axis to “equal” and fix it to the “box”, which generates a square frame of the plot, instead of a rectangle by default. • plt.xticks() and plt.yticks() set the ticks of the x and y-axis, respectively. • plt.xlabel() and plt.ylabel() set the labels of the x and y-axis, respectively. • plt.colorbar() add a color bar to explain the color map of the histogram plot. • plt.savefig() saves the current plot to a png file. By running the following commands in the subdirectory Analysis2.2: $ ./hist2d.py# or\$ python3 hist2d.py

we get the figure in the file Ramachandran_plot.png as shown in the following: In this figure, the intensity of the red color represents the frequency of the dihedral angle pair (φ, ψ).

Written by Cheng Tan@RIKEN Center for Computational Science, Computational Biophysics Research Team
October, 2021