Guide to Understanding PDB Data
Introduction
PDB Overview
Beginner’s Guide to PDBx/mmCIF
Dealing with Coordinates
Biological Assemblies
Missing Coordinates
Protein Primary Sequences
Protein Hierarchical Structure
Small Molecule Ligands
Exploring Carbohydrates
Methods for Determining Structure
Crystallographic Data
Computed Structure Models
Molecular Graphics Programs
Introduction to RCSB PDB APIs

Missing Coordinates and Biological Assemblies

Due to the characteristics of structure determination methods, most entries do not include coordinates for every single atom in the identified molecule. In some cases, the experimental method may not observe certain atoms. For example, flexible regions and hydrogen atoms are not observed in X-ray crystallographic experiments, and therefore, are not included in the PDB coordinate files. In other cases, only a portion of the molecule may be included in the PDB entry. For instance, in X-ray crystallographic structures of symmetrical molecules, the PDB entry often includes only one subunit of the complex, and coordinates for the full biological assembly need to be calculated from the subunit coordinates. When searching the PDB archive, it is important to consider what parts of a structure are included in each particular entry.

A few of the common situations you might encounter are described below.


Asymmetric and Biological Assemblies

In crystals used for X-ray crystallography, multiple copies of the protein and/or nucleic acid are symmetrically stacked in an array. Usually, the structure of the smallest unique portion of this array - called the asymmetric unit - is deposited with the PDB archive. Depending upon the symmetry in the crystal, the asymmetric unit may have one or more copies of the protein and/or nucleic acid.

The biologically-relevant assembly of a molecule may be completely different from the asymmetric unit structure included in the PDB entry. In the case of hemoglobin, which acts as a tetramer, the asymmetric unit includes only 2 chains (half of the functional tetramer) in some PDB entries and 8 or more chains (representing several functional tetramers) in others. Icosahedral viruses are another common example: typically only a single chain is deposited, so coordinates for all 60 chains in the capsid need to be generated. The symmetry operations required to generate or select chains for the biological assemblies is provided in the entry files, if you want to do the calculation yourself, or you can download coordinates for the biological assembly from the archive.

For a detailed tutorial on biological assemblies, click here.

1hho_asr_r_500.jpg 1hho_bio_r_500.jpg
Asymmetric Unit Biological Assembly
PDB entry 1hho contains two chains, as shown at the top. Coordinates for the biologically-active tetramer are available in the biological assembly file, as shown at the bottom.

Tip: Coordinates for the biological assemblies are included in the "Download Files" menu.


Alpha-Carbon Coordinate Files

In some cases, experiments yield only a low-resolution image of a protein, as in the case of structures from electron microscopy or X-ray crystallography with crystals that are not well-ordered. In these cases, the experimental data are not sufficient to resolve every atom, and the researcher may choose to include only a single coordinate for each amino acid in the protein. Most often, the position of the alpha-carbon position is included. These structures show the folding of the protein chain.

1f6g_asym_r_500.jpg

This structure of the full-length KscA potassium channel (PDB entry 1f6g) was solved by a number of spin-labeling and spectroscopic techniques. Since the method did not determine the location of each atom, only alpha-carbons were submitted to the PDB.

Tip: If you try to display a wireframe diagram of a PDB entry and get a blank screen or just a bunch of small dots, you may be looking at a structure with only alpha-carbons. Wireframe diagrams will typically come up blank with these files because the alpha-carbon positions are too far apart to show bonds. Instead, try using a ribbon diagram or a thick backbone tube to display the molecule. A spacefilling diagram with artificially-large spheres (5 Ångstrom radius) also works well, if your molecular graphics program allows spheres that large.


Missing Loops and Tails

Since X-ray crystallography relies on obtaining crystals with many, many proteins in exactly identical positions, flexible proteins cause problems. Regions in a protein that move are generally not observed in X-ray structures, so coordinates for these regions are not included in the PDB entries. You will see these as breaks in the chain, and often as missing segments at the beginning and end of the chain. Structures derived from NMR typically do not have this problem. Ensembles of NMR structures often include several very different conformations for flexible regions, so you can choose one or use them all.

Unfortunately, there is no simple solution to this problem apart from modeling coordinates for the missing portions (see the list of links for molecular modeling programs). This problem can be significant, since the flexible loops are often involved in the active site or binding site of the protein.

flexibleloops.jpg

The structure of SIV protease solved without its active site (PDB entry 1az5) had two loops that were too flexible to be seen in the experiment (shown with stars in the upper picture). When the protein was crystallized with inhibitors, however, the loops adopted a stable structure that may be seen (PDB entry 1yti). (This picture was created with MBT Simple Viewer).

Tip: It is often useful to search for other structures that include ligands or binding partners In those cases, the loop may be closed around the ligand in a stable conformation, and thus will be seen in the crystallographic experiment.


Fragments and Domains

Many large proteins, especially proteins with several movable parts, have proven impossible to crystallize as a whole. In these cases, researchers have taken a piecewise approach. They cut the protein into manageable pieces, and then solved the structure of each piece. In order to obtain a picture of the whole protein, the pieces have had to be reassembled in the proper orientation.

Unfortunately, there isn't a comprehensive resource to help you piece together the functional molecule in these cases. You will need to look at sequence data along with reports of the molecular biology to sort out the overall form.

ATPsynthase.jpg

ATP synthase is composed of two molecular motors connected by an axle and a stator. It has proven impossible (at least so far) to crystallize the whole thing, but structures are available for both of the motors (PDB entries 1c17 and 1e79) and several of the connecting parts (PDB entries 1l2p and 2a7u). (This picture was created with QuickPDB)

Tip: When searching for PDB entries, it is important to pay attention to what is actually included in each coordinate file. Watch for words like "ligand-binding domain" and "fragment" in the PDB title to give you hints that you are viewing only a portion of the functional molecule.


Where are the Hydrogen Atoms?

Most crystallographic experiments do not resolve hydrogen atoms, so most of the crystallographic coordinate files in the PDB archive only include positions for the non-hydrogen atoms. In some cases, the polar hydrogen atoms are included (polar hydrogen atoms are those that are attached to nitrogen, oxygen and sulfur, which can participate in hydrogen bonds) when they are used during the refinement of the structure. NMR-determined structures, on the other hand, most often include all of the hydrogen atoms in the structure, since much of the experimental information obtained in these experiments consists of the distance between these hydrogen atoms.

Since crystallographic experiments typically do not see hydrogen atoms, and since oxygen and nitrogen atoms have similar numbers of electrons and thus look similar in crystallographic electron density maps, it is often difficult to determine the exact identity of atoms in sidechains such as asparagines and glutamines. In some cases, if you look carefully at the hydrogen-bonding pattern with neighboring amino acids, you may find a better match by switching the nitrogen and oxygen in an amide sidechain.

insulin.jpg

These two structures of insulin were solved by different experimental techniques. The one on the top (PDB entry 2ins) was solved by X-ray crystallography, which does not provide data for the positions of hydrogen atoms. The one on the botton (PDB entry 2hiu) was solved by NMR spectroscopy and includes hydrogen coordinates. (This picture was created with MBT Simple Viewer)

Tip: The program Reduce is useful for adding hydrogen atoms that are missing in proteins and nucleic acids, and for determining the best hydrogen-bonding pattern in a protein.