Dealing with Coordinates
The primary information stored in the PDB archive consists of coordinate files for biological molecules. These files list the atoms in each protein, and their 3D location in space. These files are available in several formats (PDB, mmCIF, XML). A typical PDB formatted file includes a large "header" section of text that summarizes the protein, citation information, and the details of the structure solution, followed by the sequence and a long list of the atoms and their coordinates. The archive also contains the experimental observations that are used to determine these atomic coordinates.
When you start exploring the structures in the PDB archive, you will need to know a few things about coordinate files. Major topics are included here.
ATOMs and HETATMs
A typical PDB format file will contain atomic coordinates for a diverse collection of proteins, small molecules, ions and water. Each atom is entered as a line of information that starts with a keyword: either ATOM or HETATM. By tradition, the ATOM keyword is used to identify proteins or nucleic acid atoms, and keyword HETATM is used to identify atoms in small molecules. Following this keyword, there is a list of information about the atom, including its name, its number in the file, the name and number of the residue it belongs to, one letter to specify the chain (in oligomeric proteins), its x, y, and z coordinates, and an occupancy and temperature factor (described in more detail below).
This information gives you a lot of control when exploring the structure. For instance, most molecular graphics programs enable you to color identified portions of the molecule selectively--for example, to pick out all of the carbon atoms and color them green, or to pick one particular amino acid and highlight it.
Tip: By default, many molecular graphics programs do not display the water positions in a PDB file, even though they are often important to the function and interaction of biological molecules. Most of these programs have a way to display them, if you use their methods for atom selection.
Chains and Models
Biological molecules are hierarchical, building from atoms to residues to chains to assemblies. Coordinate files contain ways to organize and specify molecules at all of these levels. As described above, the atom names and residue information are included in each atom record. The higher-order information is identified by keywords that separate blocks of atom records, such as TER and MODEL.
Protein and nucleic acid chains are specified by the TER keyword, as well as a one-letter designation in the coordinate records. The chains are included one after another in the file, separated by a TER record to indicate that the chains are not physically connected to each other. Most molecular graphics programs look for this TER record so that they don't draw a bond to connect different chains.
PDB format files use the MODEL keyword to indicate multiple molecules in a single file. This was initially created to archive coordinate sets that include several different models of the same structure, like the structural ensembles obtained in NMR analysis. When you view these files, you will see dozens of similar molecules all superimposed. The MODEL keyword is now also used in biological assembly files to separate the many symmetrical copies of the molecule that are generated from the asymmetric unit (For more information, see the tutorial on biological assemblies).
If we were able to hold an atom rigidly fixed in one place, we could observe its distribution of electrons in an ideal situation. The image would be dense towards the center with the density falling off further from the nucleus. When you look at experimental electron density distributions, however, the electrons usually have a wider distribution than this ideal. This may be due to vibration of the atoms, or differences between the many different molecules in the crystal lattice. The observed electron density will include an average of all these small motions, yielding a slightly smeared image of the molecule.
These motions, and the resultant smearing of the electron density, are incorporated into the atomic model by a B-value or temperature factor. The amount of smearing is proportional to the magnitude of the B-value. Values under 10 create a model of the atom that is very sharp, indicating that the atom is not moving much and is in the same position in all of the molecules in the crystal. Values greater than 50 or so indicate that the atom is moving so much that it can barely been seen. This is often the case for atoms at the surface of proteins, where long sidechains are free to wag in the surrounding water.
The picture shows the whole molecule, with the atoms colored by the temperature factors. High values, indicating lots of motion, are in red and yellow, and low values are in blue. Notice that the interior of the protein has low B-values and the amino acids on the surface have higher values.
Click on the Jmol tab to see an interactive Jmol.
The Jmol shows the whole molecule, with the atoms colored by the temperature factors. High values, indicating lots of motion, are in red and yellow, and low values are in blue. Notice that the interior of the protein has low B-values and the amino acids on the surface have higher values.
Tip: Temperature factors are a measure of our confidence in the location of each atom. If you find an atom on the surface of a protein with a high temperature factor, keep in mind that this atom is probably moving a lot, and that the coordinates specified in the PDB file are only one possible snapshot of its location.
Occupancy and Multiple Conformations
Macromolecular crystals are composed of many individual molecules packed into a symmetrical arrangement. In some crystals, there are slight differences between each of these molecules. For instance, a sidechain on the surface may wag back and forth between several conformations, or a substrate may bind in two orientations in an active site, or a metal ion may be bound to only a few of the molecules. When researchers build the atomic model of these portions, they can use the occupancy to estimate the amount of each conformation that is observed in the crystal. For most atoms, the occupancy is given a value of 1, indicating that the atom is found in all of the molecules in the same place in the crystal. However, if a metal ion binds to only half of the molecules in the crystal, the researcher will see a weak image of the ion in the electron density map, and can assign an occupancy of 0.5 in the PDB structure file for this atom. Occupancies are also commonly used to identify sidechains or ligands that are observed in multiple conformations. The occupancy value is used to indicate the fraction of molecules that have each of the conformations. Two (or more) atom records are included for each atom, with occupancies like 0.5 and 0.5, or 0.4 and 0.6, or other fractional occupancies that sum to a total of 1.
The picture below of the whole myoglobin molecule is shown with all of the amino acids that have two conformations in the file.
Click on the Jmol tab to see an interactive Jmol.
Alternate Conformations in Myoglobin (PDB entry 1a6m)
Tip: When dealing with PDB entries with multiple coordinates, you often need to pay close attention. It is not always possible to select just the "A" conformations and throw away the "B" conformations. You need to look carefully in each case and make sure that there are not any bad contacts between mobile sidechains.