NMR Spectroscopy Research Group

Analysis of data from molecular dynamics simulation

When the simulation has finished, it is time to continue with the analysis of the data. The analysis comprises of three stages. First, it is necessary to perform some standard checks to assess the quality of the simulations. If the results from these analyses show that the simulations are fine, then the analyses required to answer the research question asked can be performed per simulation. Finally, the results from different simulations can be combined.

NOTE: The names of the files should be self-explanatory, depending on the system you simulated. Here we assume that the default names have been used, which means that there are the following files:

topol.tpr: The run input file containing a complete description of the system at the start of the simulation
confout.gro: Structure file, contains the coordinates and velocities of the last step
traj.trr: The full precision trajectory containing the positions, velocities and forces over time
traj.xtc: A light weight trajectory, containing only coordinates in low precision (0.001 nm)
ener.edr: Energy related parameters over time
md.log: A file containing information about the simulation.

As a side note, many of the analysis tools produce graphs in the form of .xvg files. These files can be viewed with the program xmgr or xmgrace, but they can also be viewed in a terminal using the python script xvg2ascii.py.

A brief check of results

Before all else, it has to be verified that the simulations finished properly. There are many things that may cause simulations to crash, especially problems with the force field or insufficient equilibration of the system. To check whether the simulation finished properly, use the program gmxcheck:

gmxcheck -f traj.xtc

Verify that the simulation ran for 1 ns.

==Q== How many frames are in the trajectory file and what is the time resolution?

Another important source of information about the simulation is the log file. At the end of the file md.log, statistics about the simulation are given. This includes information about the use of CPU and memory resources and about the simulation time. Have a look at the end of the log file. If you use 'less' you may skip all the way to the end of the file by pressing 'G' (shift-g) Nerd tip: use sed to extract only the last part of the log file: sed -ne '/Total NODE time/,$p' md0.log

==Q== How long did the simulation run in real time (hours), what was the simulation speed (ns/day) and how long would it take to have the simulation reach a second?

==Q== Which contribution to the potential energy accounts for most of the calculations?

Don't be afraid to use the Gromacs online manual, to search in the Gromacs mailing list or even to use Google to get information about the terminology used by Gromacs.

Visualization of results

Now it's time for the fun part. Although much of the analysis will come down to extracting graphs from the trajectories, molecular dynamics is of course foremost about the motions in the system. Time to have a look at the trajectory.

First have a look with the trajectory viewer supplied with gromacs, ngmx. Although rudimentary and visually less appealing than other viewers, it has the advantage that it draws bonds based on the information in the topology file. Other viewers may infer bonds from distances, which can cause bonds that are considered too long not te be drawn or will add bonds between atoms that come too close together. This is a common source of misinterpretation of simulation results. ngmx is also lightweight enough to be used remotely, when viewing trajectories in a compute clister, for instance. Use ngmx to load the topology and trajectory:

ngmx -s topol.tpr -f traj.xtc

Have a look at the menus and try out the options. Play the movie. The view is controlled by the options on the right. For each of these options right- or left-click the mouse to change the view.

==Q== What happens if the protein diffuses over the boundary of the box?

For nicer visualization, we will now extract frames evry 10ps from the trajectory (-dt 10), leaving out the water (select Protein when asked for a selection). Moreover, we will remove the jumps over the boundaries and make a continuous trajectory (-pbc nojump). For these things there's the Swiss army knife gromacs tool trjconv, with its 1001 possible combinations of options. We use it to write a multi-model pdb file, to allow visualization in Pymol.

trjconv -s topol.tpr -f traj.xtc -o protein.pdb -pbc nojump -dt 10

Load the trajectory in Pymol.

pymol protein.pdb

After all frames have loaded, play the movie.

mplay

While the movie is playing, all other controls still work. You can rotate the system or zoom in/out with the mouse and you can change the representation of the molecule.

spectrum

show cell

If all is well, you now see the protein diffusing, tumbling and wiggling. However, we're more interested in the internal motions than the overall behaviour. In Pymol you can fit all frames in the trajectory to the first one, using the command intra_fit. Subsequently, you can set the focus to the protein with orient:

intra_fit protein and (name c,n,ca)

orient

Now all frames should be superimposed and you can see that some parts of the protein move more than other parts. This difference in motility will be quantified later on.
Of course, the protein would look better in cartoon representation. Try the following:

show cartoon

This probably gives you a thick tube for the backbone, rather than proper secondary structure elements, since there is no secondary structure information in the .pdb file. Pymol can calculate the secondary structure of a protein, but will only do it for one frame and project the results on all of them. For instance, the following will calculate the secondary structure for the first frame.

dss

By specifying a state, the frame to use for the calculations can be changed:

dss state=1000

Finally, let's look at all frames simultaneously and check the flexible and rigid regions of the protein.

set all_states=1

Feel free to play around with Pymol. Try to zoom in on flexible or rigid regions and check the side-chain conformations. Feel free to waste some (CPU) time on making an image, using 'ray' and 'png' Do mind that scenes that are too complex may cause the built-in ray-tracer of Pymol to crash, so in that case you can only get the image as you have it on screen using 'png' directly.

The following part, up to 'quality assurance', is optional and it may be best to first finish the other sections.

If you think you have plenty of time left for the remainder of the tutorial, or if you have already finished, than you may consider making a proper movie. You probably noticed that the trajectory is very noisy. That's basically thermal noise, and therefore just part of the normal behaviour, but it doesn't make very good movies. It is possible to filter out these high-frequency motions and retain only the slower and smoother global motions. For this, you can use the program g_filter:

g_filter -s topol.tpr -f traj.xtc -ol filtered.pdb -fit -nf 5

Now load the filtered trajectory in Pymol. Set the secondary structure (dss), show the secondary structure (show cartoon), hide the lines for backbone atoms (hide lines, not (name c,n,o)) and color the protein the way you want to have it. Then orient the molecule to set the scene. Now, you can start making a movie:

viewport 640,480

set ray_trace_frames,1

mpng frame_.png

Now quit Pymol (quit) and list the files in the directory (ls). As you may notice, there are quite a number of files now, including 1000 images. With 30 frames per second, that will make a movie of around 30 seconds. Download the program mpeg_encode and the parameter file movie.param and use it to create a movie from the individual frames (you may need to edit the parameter file to change the filenames):

mpeg_encode movie.param

Quality assurance

After a first visual inspection of the trajectory, it's time to perform some more thorough checks regarding quality of the simulations. This quality assurance (QA) involves tests for the convergence of thermodynamic parameters, such as the temperature, the pressure, the potential and the kinetic energy. More commonly phrased, QA tries to assess whether an equilibrium was reached. Convergence is also checked in terms of the structure, through the root mean square deviation (RMSD) against the starting structure and the average structure. Next to that, it has to be checked that there have not been interactions between adjacent periodic images, as such interactions could lead to unphysical effects. A final QA test involves the calculation of the root mean square fluctuations of atoms, which can be compared to crystallographic b-factors.

Convergence of energy terms

We start off with the extraction of some thermodynamic parameters from the energy file. The following properties will be investigated: temperature, pressure, potential energy, kinetic energy, unit cell volume, density, and the box dimensions. Most of these have been checked already for some of the preparation steps. Energy analysis is performed with the tool g_energy. This program reads in the energy file, which is produced during the simulation and has the extension .edr. g_energy will ask for a term to extract from the energy file and will produce a graph from that. Give the following command to run g_energy:

g_energy -f ener.edr -o temperature.xvg

This will give a list of the energy and related terms which are stored in the .edr energy file. In our case, there are probably about 68 terms in the energy file, each of which may be extracted and graphed. The first nine entries correspond to different energy terms in the force field. Also note the entries from 47 onward, which list the terms split over the groups Protein and Non-Protein, including the interactions between the two. To extract the temperature, type "14" and press Enter twice. Have a look at the graph with the program xmgrace and see how the temperature fluctuates around the value specified (300 K). From the fluctuations the heat capacity of the system can also be calculated if you pass the -fluct_props flag to g_energy and ask for both Temperature and Enthalpy ("Total-Energy"). The heat capacity will be given at the end of the output from g_energy.

==Q== What is the average temperature and what is the heat capacity of the system?

Now examine the temperature and enthalpy plots.

xmgrace -nxy temperature.xvg

Referencing the energy terms by name facilitates automatic processing of the energy file. Using 'echo' and a pipe ("|") to redirect output from one program to the other, the query from g_energy can be answered automatically. To extract multiple terms, these have to be separated with "\n". Copy and paste or type the following lines to extract the other terms. The energy terms can be referenced by numbers or by their names.

echo 15 | g_energy -f ener.edr -o pressure.xvg

echo -e "11\n12\n13" | g_energy -f ener.edr -o energy.xvg

echo Volume | g_energy -f ener.edr -o volume.xvg

echo Density | g_energy -f ener.edr -o density.xvg

echo -e "Box-X\nBox-Y\nBox-Z" | g_energy -f ener.edr -o box.xvg

Have a look at each of these files and see how the values converge. If the values have not converged, this indicates that the simulation has not yet reached thermal equilibrium and that it should be extended before doing further analysis. Moreover, the period over which equilibration takes place should not be included in the analysis. Here, for the sake of simplicity, we will neglect these considerations and use the results from the simulations as they are.

==Q== What are the terms plotted in the files energy.xvg and box.xvg

==Q== Estimate the plateau values for the pressure, the volume and the density.

The equilibration of some terms takes longer than that of other terms. In particular, the temperature readily converges to its equilibrium value, whereas the interaction between different parts of the system may take much longer. Have a look at the interaction energies between the protein and the solvent:

echo -e "49\n51" | g_energy -f ener.edr -o coulomb-inter.xvg

echo -e "50\n52 0" | g_energy -f ener.edr -o vanderwaals-inter.xvg

==Q== What are the terms plotted in the files coulomb-inter.xvg and vanderwaals-inter.xvg

Minimum distances between periodic images

One of the most important things to check in terms of quality assurance is that there have not been direct interactions between periodic images. Since periodic images are identical, such interactions are unphysical self-interactions. Imagine that a protein with a dipole would have direct interactions. Then the attraction between the two ends of the same molecule over the periodic boundary would affect and invalidate the results relating to the native behaviour of the protein. To verify that such interactions have not taken place, we calculate the minimal distance between periodic images at each time. This is done using g_mindist.

g_mindist -f traj.xtc -s topol.tpr -od minimal-periodic-distance.xvg -pi

==Q== What was the minimal distance between periodic images and at what time did that occur?

==Q== What happens if the minimal distance becomes shorter than the cut-off distance used for electrostatic interactions? Is it the case in your simulations?

It also matters if the small distance occurs transiently or if it is persistent. If it is persistent, it is likely affecting the protein dynamics; but if it's just transiently than it will hardly, if at all, influence.

==Q== Run now g_mindist on the C-alpha group, does it change the results? What does is mean for your system?

Not only direct interactions are of concern, but indirect effects, mediated through the water can also pose a problem. For instance, the protein can cause water ordening in the first four shells of water. That corresponds to about one nanometer. Ideally, the minimal distance should therefore not be less than two nanometers.

Root mean square fluctuations

Next to inspection of energies and such, convergence of the simulation towards equilibrium should also be inspected from the relaxation of the structure. Usually, such relaxation is only considered in terms of the Euclidean distance from the structure to a reference structure, e.g. the crystal structure. This distance is termed the root mean square deviation (RMSD). However, it is recommended also to investigate the relaxation towards the average structure, i.e. the RMSD with respect to the average. The reasons for this will be set out in the next paragraph. But to calculate the RMSD against the average structure, requires first to obtain the average. This structure can be obtained as a side product from calculation of the root mean square fluctuations (RMSF).

The RMSF captures, for each atom, the fluctuation about its average position. This gives insight into the flexibility of regions of the protein and corresponds to the crystallographic b-factors (temperature factors). Usually, one would expect similar profiles for the RMSF and the b-factors and this can be used to investigate whether the simulation results are in accordance with the crystal structure. The RMSF (and the average structure) are calculated with g_rmsf. The -oq options allows to calculate b-factors and add them to a reference structure. We're most interested in the fluctuations on a per residue basis, which is controlled by the flag -res.

g_rmsf -f traj.xtc -s topol.tpr -o rmsf-per-residue.xvg -ox average.pdb -oq bfactors.pdb -res

Have a look at the graph of the RMSF with xmgrace and identify the flexible and rigid regions.

==Q== Indicate the start and end residue for the most flexible regions and the maximum amplitudes.

Now view the two pdb files, by loading them into Pymol. Then color the structure bfactors.pdb according to b-factors and inspect the flexible regions. The average structure is an unphysical structure. Have a look at some of the side chains, and notice the effect of averaging over conformations.

pymol average.pdb bfactors.pdb

spectrum b, selection=bfactors

The colors are distributed over the range of b-factor values, with blue representing the lowest (most stable) and red the highest (most fluctuating). You can adjust the colour range by truncating the maximum value, for example setting it to 500:

Q = 500; cmd.alter("all", "q = b > Q and Q or b"); spectrum q

Convergence of RMSD

Be careful! Because your protein likely jumped out of the box, we have to rebuild the trajectory, bringing back the particles in the center periodic image. For that, run the following commands:

trjconv -f traj.xtc -o traj_nojump.xtc -pbc nojump

As the calculation of the RMSF also gave the average structure, we can now calculate the Root Mean Square Deviation (RMSD). The RMSD is commonly used as an indicator of convergence of the structure towards an equilibrium state. As mentioned above, the RMSD is merely a distance measure and is most meaningful for low values. The RMSD is calculated using the program g_rms. First calculate the RMSD for all protein atoms, using the starting structure as a reference:

g_rms -f traj_nojump.xtc -s topol.tpr -o rmsd-all-atom-vs-start.xvg

==Q== If observed, at what time and value does the RMSD reach a plateau?

Now calculate the RMSD again, but only select the backbone atoms:

g_rms -f traj_nojump.xtc -s topol.tpr -o rmsd-backbone-vs-start.xvg

This time the RMSD settles at a lower value, which is the result of excluding the, often flexible, side chain atoms. In both cases the RMSD should increase to a plateau value. This means that the structure of the protein reaches a certain distance from the reference structure and then keeps that distance more or less. However, with increasing distance, the amount of conformations available also increases. This means that, although the RMSD has reached a plateau value, the structure may still be progressing towards its equilibrium state. For this reason, it is advisable also to check the convergence towards the average structure.

echo 1 | trjconv -f traj_nojump.xtc -s topol.tpr -o protein.xtc

g_rms -f protein.xtc -s average.pdb -o rmsd-all-atom-vs-average.xvg

g_rms -f protein.xtc -s average.pdb -o rmsd-backbone-vs-average.xvg

Compare the resulting graphs to the previous ones. Note at which point the RMSD values level off.

==Q== Briefly discuss two differences between the graphs against the starting structure and against the average structure. Which is a better measure for convergence?

Convergence of radius of gyration

As a final part of the QA, we calculate the radius of gyration. This measure gives an indication of the shape of the molecule at each time. The radius of gyration compares to the experimentally obtainable hydrodynamic radius. It can be calculated using g_gyrate. This program will also give the individual components, which correspond to the eigenvalues of the matrix of inertia. This means that the first individual component corresponds to the longest axis of the molecule, while the last corresponds to the smallest. In effect, the three axes give a global indication of the shape of the molecule. Issue the command

g_gyrate -f traj.xtc -s topol.tpr -o radius-of-gyration.xvg

Have a look at the radius of gyration and the individual components and note how each of these progress to an equilibrium value.

==Q== At what time and value does the radius of gyration converge?

At this point we end the first part of the analysis, involving a visual inspection and quality assurance checks. Now time has come to dig in a bit deeper and have a look at what's going on inside the proteins. The second part of the analysis covers structural properties that can be calculated from protein configurations, such as the number of hydrogen bonds, the solvent accessible surface or specific atom-atom distances. If you're ready, advance to the next page.

Molecular Modeling Practical