Primary Sequences and the PDB Format
Each PDB formatted file includes "SEQRES records" which list the primary sequence of the polymeric molecules present in the entry. This sequence information is also available as a FASTA download. This listing includes the sequence of each chain of linear, covalently-linked standard or modified amino acids or nucleotides. It may also include other residues that are linked to the standard backbone in the polymer. Chemical components or groups covalently linked to side-chains (in peptides) or sugars and/or bases (in nucleic acid polymers) will not be listed here.
Here is an example from PDB entry 2dgc, which includes a protein chain and a DNA chain:
SEQRES 1 B 19 DT DG DG DA DG DA DT DG DA DC DG DT DC
SEQRES 2 B 19 DA DT DC DT DC DC
SEQRES 1 A 63 MET ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS
SEQRES 2 A 63 ARG ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA
SEQRES 3 A 63 ARG LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL
SEQRES 4 A 63 GLU GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU
SEQRES 5 A 63 VAL ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG
In many cases, you may find that the coordinates presented in ATOM records in a PDB file may not exactly match the sequence in the SEQRES records. The ends of chains and mobile loops are often not observed in crystallographic experiments, and coordinates are not included as ATOM records in the file. However, these amino acids will often be included in the SEQRES records, since the portion of the chain was present during the experiment. In these cases, a "REMARK 465" entry will be included in the header of the PDB file to identify each missing residue. For all PDB entries, the file https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz notes regions of the molecule that have not been observed (e.g. residues which exist in the originally studied molecule as shown in the SEQRES records, but not in the observed structure/coordinates).
You may also notice some differences with sequences in other databases. For example, a researcher may change or mutate particular residues to see the effect this will have on the overall structure, or a particular portion of it. The DBREF record provides cross-reference links between PDB sequences (what appears in SEQRES record) and a corresponding database sequence. The SEQADV record identifies differences between sequence information in the SEQRES records of the PDB entry and the sequence database entry given in DBREF.
Also, structural biologists often work with fragments of macromolecules, because they are more amenable to study than the full macromolecule. Thus, the SEQRES and ATOM records may include only a portion of the molecule, not the whole protein. The numbering of residues can also provide an additional complication. In some cases, the researchers number the ATOM records based on the numbering of the whole protein, while in other cases, they number the chain based on the fragment. Any number (negative, 0, positive) can be used.
Amino Acid and Nucleotide Nomenclature
In the SEQRES records, the standard 3-character code is used for standard amino acids, and standard nucleotides are specified by 1 or 2 characters:
ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN, PRO, GLN, ARG, SER, THR, VAL, TRP, TYR
DA, DC, DG, DT, DI
A, C, G, U, I
Other codes are used for modified amino acids (such as MSE for selenomethionine) and for modified nucleotides (such as CBR for bromocytosine). You can see all of the codes used by browsing the wwPDB's Chemical Component Dictionary with Ligand Expo.
Several additional records are included in the PDB format to define modifications as they appear in the ATOM records:
|MODRES||Modifications to standard residues|
|HET||Nonstandard residues (as well as ligands, ions and water)|
|HETNAM||Full chemical name of the residue|
|HETSYM||Synonyms for the residue|
|FORMUL||Chemical formula of the residue|
As an example, here are the records that describe HYP (hydroxyproline, a modified version of PRO, or proline) in the ATOM records for collagen entry 1cag:
MODRES 1CAG HYP A 2 PRO 4-HYDROXYPROLINE
HET HYP A 2 8
HETNAM HYP 4-HYDROXYPROLINE
HETSYN HYP HYDROXYPROLINE
FORMUL 1 HYP 30(C5 H9 N O3)
Complete file format documentation is available at http://www.wwpdb.org/documentation/file-format.
Access to Sequence Information
Primary sequences are presented in several ways on the RCSB PDB site. They are available directly in the PDB entry, which is easily accessed using the "Display Files" menu on each Structure Summary. A more detailed presentation is available under the "Sequence" Tab (example: 1cag). There, the sequence from UniProtKB is presented, along with the sequence of residues that are included in the PDB file. This page also includes a schematic of secondary structures from DSSP, and a variety of other annotations are available.
Several methods may be used to search for PDB entries based on their primary sequence, including the Sequence Search option from the left hand menu. Advanced Search provides more options for making a more targeted search. For instance, you can specify sequence motifs using wildcards, repeated residues or alternative residues, and you can specify chain lengths or location in a genome. These options are available in Advanced Search under Sequence Features.
Sequences may also be used on the RCSB PDB site to find entries with similar sequences. In the "Seq. Similarity" tab of a Structure Summary page (example: 2dcg), clusters of structures grouped by percentage of sequence similarity are presented. Using the tools on this page, you can align sequences in these clusters using a variety of alignment methods.