Guide to Understanding PDB Data
Introduction
PDB Overview
Beginner’s Guide to PDBx/mmCIF
Dealing with Coordinates
Biological Assemblies
Missing Coordinates
Protein Primary Sequences
Protein Hierarchical Structure
Small Molecule Ligands
Exploring Carbohydrates
Methods for Determining Structure
Crystallographic Data
Computed Structure Models
Molecular Graphics Programs
Introduction to RCSB PDB APIs

Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format

Introduction and History

The PDBx/mmCIF file format and data dictionary is the basis of wwPDB data deposition, annotation, and archiving of PDB data from all supported experimental methods.

This PDB-101 resource is intended as an introductory guide.  The primary PDBx/mmCIF resource mmcif.wwpdb.org contains relevant data dictionaries and documentation, as well as a detailed description of the format's development and history.

The initial CIF (Crystallographic Information File) format and dictionary was developed for archiving small molecule crystallographic experiments. In 1997, the dictionary was expanded (mmCIF) to include data items relevant to macromolecular crystallographic experiments (PDBx/mmCIF). This format overcomes limitations of the legacy PDB file format and supports data representing large structures, complex chemistry, and new and hybrid experimental methods.  The legacy PDB file format is no longer modified or extended to support new content. As the PDBx/mmCIF format continues to evolve, PDB format files will become outdated.

PDBx/mmCIF is powerful.  PDBx/mmCIF explicitly documents all relationships between common data items (e.g. atom and residue identifiers) which permits software applications to evaluate and validate referential integrity with any PDB entry, and maps information between the residue sequences of the experimental sample and the model coordinates. The mmCIF/PDBx Exchange Dictionary provides metadata (e.g. data types, allowed ranges, controlled vocabularies).

It is supported by visualization applications such as Jmol, Chimera, and OpenRasMol and structure determination systems such as CCP4 and Phenix.


Basics of Syntax and Format

PDBx/mmCIF format utilizes the ASCII character set.

All data items are identified by name, begin with the underscore character and are composed of a category name and an attribute name. The category name is separated from the attribute name by a period:

_citation.year

This combination of category and attribute may be termed an mmCIF token.

Data categories are presented in two styles: key-value and tabular.

In the key-value style, the mmCIF token is followed directly by a corresponding value.  The following example shows the unit cell parameters from entry 4hhb:

_cell.entry_id           4HHB
_cell.length_a           63.150
_cell.length_b           83.590
_cell.length_c           53.800
_cell.angle_alpha        90.00
_cell.angle_beta         99.34
_cell.angle_gamma        90.00
_cell.Z_PDB              4

The tabular style is used when there are multiple values for each token.  In this style, a loop_ token is followed by rows of data item names and then white-space delimited data values.  The following example shows the beginning of the coordinate records from entry 4hhb.  Here, data items in the atom_site category are used to describe the identities and atomic coordinates of the atoms in the entry:

loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM   1    N  N   . VAL A 1 1   ? 6.204   16.869  4.854   1.00 49.05 ? 1   VAL A N   1
ATOM   2    C  CA  . VAL A 1 1   ? 6.913   17.759  4.607   1.00 43.14 ? 1   VAL A CA  1
ATOM   3    C  C   . VAL A 1 1   ? 8.504   17.378  4.797   1.00 24.80 ? 1   VAL A C   1
ATOM   4    O  O   . VAL A 1 1   ? 8.805   17.011  5.943   1.00 37.68 ? 1   VAL A O   1
ATOM   5    C  CB  . VAL A 1 1   ? 6.369   19.044  5.810   1.00 72.12 ? 1   VAL A CB  1
ATOM   6    C  CG1 . VAL A 1 1   ? 7.009   20.127  5.418   1.00 61.79 ? 1   VAL A CG1 1
ATOM   7    C  CG2 . VAL A 1 1   ? 5.246   18.533  5.681   1.00 80.12 ? 1   VAL A CG2 1

The first data item name corresponds to the first data value, the second item to the next, and so on in each line of data. For example, the third data item _atom_site.type_symbol corresponds to the atom type given in column 13.  The list of data items, is then “looped through” for each line of data values.

The hash symbol (#) at the beginning of a line is used to indicate a comma or to separate categories.

Syntax is described in detail at the PDBx/mmCIF Resource site.


Format Examples

One of the main benefits of PDBx/mmCIF format is that it imposes no limitations for the number of atoms, residues or chains that can be represented in a single PDB entry.

All data items in the current PDB format have corresponding data items in the PDBx/mmCIF format and each data item is precisely defined in the PDBx Exchange Data Dictionary. Correspondences between records in PDB file format and data items defined in the PDBx/mmCIF dictionary are described in detail.

For example, the PDB record that contains the authors of the deposition, AUTHOR:

AUTHOR   G.FERMI,M.F.PERUTZ

Is represented in PDBx/mmCIF with data items in the audit_author category:

loop_

_audit_author.name
_audit_author.pdbx_ordinal
'Fermi, G.'    1
'Perutz, M.F.' 2

PDBx/mmCIF organizes information in categories containing related data items. In the example above, the audit_author category contains data items .name and .pdbx_ordinal. The .name data item contains the deposition authors last name and initials. The .pdbx_ordinal data item defines the order of the author's name in the list of authors.

A category is a tabular data structure where data items are the rows and the stored information are the columns:

audit_author

.name

Fermi, G.

Perutz, M.F

.pdbx_ordinal

1

2

If there are multiple columns within a data item or group of data items in the same category, the category is preceded by a loop_ token. The list of data item names can then be followed by repeated rows of data values.

For example, the JRNL records of the PDB file for structure 4HHB includes a primary citation with four authors:

JRNL        AUTH   G.FERMI,M.F.PERUTZ,B.SHAANAN,R.FOURME
JRNL        TITL   THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT 1.74 A
JRNL        TITL 2 RESOLUTION
JRNL        REF    J.MOL.BIOL.                   V. 175   159 1984
JRNL        REFN                   ISSN 0022-2836
JRNL        PMID   6726807
JRNL        DOI    10.1016/0022-2836(84)90472-8

The PDB format file then continues with several additional references such as:

REMARK   1 REFERENCE 1
REMARK   1  AUTH   M.F.PERUTZ,S.S.HASNAIN,P.J.DUKE,J.L.SESSLER,J.E.HAHN
REMARK   1  TITL   STEREOCHEMISTRY OF IRON IN DEOXYHAEMOGLOBIN
REMARK   1  REF    NATURE                        V. 295   535 1982
REMARK   1  REFN                   ISSN 0028-0836

The _citation_author category loops through the authors of the various references:

loop_
_citation_author.citation_id
_citation_author.name
_citation_author.ordinal
primary 'Fermi, G.'     1
primary 'Perutz, M.F.'  2
primary 'Shaanan, B.'   3 
primary 'Fourme, R.'    4 
1       'Perutz, M.F.'  5 
1       'Hasnain, S.S.' 6 
1       'Duke, P.J.'    7 
1       'Sessler, J.L.' 8 
1       'Hahn, J.E.'    9 
2       'Fermi, G.'     10
2       'Perutz, M.F.'  11
3       'Perutz, M.F.'  12
4       'Teneyck, L.F.' 13
4       'Arnone, A.'    14
5       'Fermi, G.'     15
6       'Muirhead, H.'  16
6       'Greer, J.'     17

The rest of the information contained in the JRNL records is included in related _citation category data items such as:

_citation.title
_citation.journal_abbrev
_citation.journal_volume
_citation.page_first
_citation.page_last
_citation.year

Categories have explicit relationships with one another. A category group is a named collection of categories. Category groups are typically used to organize groups of related categories. For instance, all of the mmCIF categories containing bibliographic information are members of citation_group category group. Included in this group are the citation, citation_author, and citation_editor categories.


Entities

One concept on which the PDBx/mmCIF format relies is that of entities. An entity is a chemically distinct part of a structure as represented in the PDBx/mmCIF data file. Data items in the _entity category, describe the chemistry and identity of the molecules under investigation. In any particular entry, there may be multiple copies of a given entity.

For example, structure 4hhb contains two copies of the hemoglobin alpha chain (or chains A and C) and two copies of the beta chain (or chains B and D). The entry also contains four heme groups. In the PDBx/mmCIF file, the two alpha chains are considered one entity, the two beta chains are another, and the heme groups a third. Water and phosphate ions make up the fourth and fifth entities:

Loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.pdbx_ec
_entity.pdbx_mutation
_entity.pdbx_fragment
_entity.details
1 polymer     man 'HEMOGLOBIN (DEOXY) (ALPHA CHAIN)' 15150.353 2   ? ? ? ?
2 polymer     man 'HEMOGLOBIN (DEOXY) (BETA CHAIN)'  15890.198 2   ? ? ? ?

3 non-polymer syn 'PROTOPORPHYRIN IX CONTAINING FE'  616.487   4   ? ? ? ?
4 non-polymer syn 'PHOSPHATE ION'                    94.971    2   ? ? ? ?

5 water       nat water                              18.015    221 ? ? ? ?

Each entity is assigned a unique numerical identifier (in the  entity.id category).


Parent-child Relationships

When data items occur in multiple categories a parent-child relationship is created. This most commonly occurs for labels and identifiers which are reused throughout the dictionary. For instance, the entity identifier _entity.id defined in category ENTITY is the parent definition of this item. This identifier is reused in the ATOM_SITE category as item _atom_site.label_entity_id. In this case, the data item in the ATOM_SITE category is defined as a child of the data item in the ENTITY category.

Chemical Component Dictionary

Chemical descriptions of all of the monomers and ligands in PDB structures are provided in PDBx/mmCIF format in the Chemical Component Dictionary. The collection of PDBx/mmCIF data categories used in the Chemical Component Dictionary are in the CHEM_COMP_DICTIONARY category group.

For example in the PDBx/mmCIF definition for the chemical component HEM ("PROTOPORPHYRIN IX CONTAINING FE"), the group is defined:

_chem_comp.id                                    HEM
_chem_comp.name                                  "PROTOPORPHYRIN IX CONTAINING FE"
_chem_comp.type                                  NON-POLYMER
_chem_comp.pdbx_type                             HETAIN
_chem_comp.formula                               "C34 H32 Fe N4 O4"
_chem_comp.mon_nstd_parent_comp_id               ?
_chem_comp.pdbx_synonyms                         HEME
_chem_comp.pdbx_formal_charge                    0
_chem_comp.pdbx_initial_date                     1999-07-08
_chem_comp.pdbx_modified_date                    2016-01-20
_chem_comp.pdbx_ambiguous_flag                   Y
_chem_comp.pdbx_release_status                   REL
_chem_comp.pdbx_replaced_by                      ?
_chem_comp.pdbx_replaces                         MHM
_chem_comp.formula_weight                        616.487
_chem_comp.one_letter_code                       ?
_chem_comp.three_letter_code                     HEM _chem_comp.pdbx_model_coordinates_details        ? _chem_comp.pdbx_model_coordinates_missing_flag   N _chem_comp.pdbx_ideal_coordinates_details        Corina _chem_comp.pdbx_ideal_coordinates_missing_flag   N _chem_comp.pdbx_model_coordinates_db_code        3IA3 _chem_comp.pdbx_subcomponent_list                ?
_chem_comp.pdbx_processing_site                  RCSB

This definition is then followed by data items describing the identity and ideal positions of each atom in the group.


Examples of software packages generating PDBx/mmCIF format files include

PDBx/mmCIF format files can be generated output from the following software packages:

  • CCP4
    • CCP4/REFMAC Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355-367.
    • Dedicated PDB deposition tasks in CCP4i2: Potterton, L., Agirre, J., Ballard, C., Cowtan, K., Dodson, E., Evans, P. R., Jenkins, H. T., Keegan, R., Krissinel, E., Stevenson, K., Lebedev, A., McNicholas, S. J., Nicholls, R. A., Noble, M., Pannu, N. S., Roth, C., Sheldrick, G., Skubak, P., Turkenburg, J., Uski, V., von Delft, F., Waterman, D., Wilson, K., Winn, M. & Wojdyr, M. (2018). Acta Cryst. D74, 68-84
    • CCP4 Cloud Krissinel, E., Uski, V., Lebedev, A., Winn, M. & Ballard, C. (2018). Acta Cryst. D74, 143-151.
  • PHENIX: Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213-221.
  • Global Phasing BUSTER: Bricogne, G., Blanc, E., Brandl, M., Flensburg, C., Keller, P., Paciorek, W., Roversi, P., Sharff, A., Smart, O. S., Vonrhein, C. & Womack, T. O. (2009). BUSTER, Global Phasing Ltd., Cambridge, UK.

Examples of visualization software applications supporting PDBx/mmCIF include

  • CCP4
    • CCP4mg: Potterton, L., McNicholas, S., Krissinel, E., Gruber, J., Cowtan, K., Emsley, P., Murshudov, G. N., Cohen, S., Perrakis, A. & Noble, M. (2004). Acta Cryst. D60, 2288-2294.
    • Coot: Brown, A., Long, F., Nicholls, R. A., Toots, J., Emsley, P. & Murshudov, G. (2015). Acta Cryst. D71, 136-153.
  • Chimera: Goddard, T. D., Huang, C. C., Meng, E. C., Pettersen, E. F., Couch, G. S., Morris, J. H. & Ferrin, T. E. (2018). Protein Sci. 27, 14-25.
  • Jmol/JSMol: Hanson, R. M. (2010). J. Appl. Cryst. 43, 1250-1260.]; Hanson et al., 2013[Hanson, R. M., Prilusky, J., Renjian, Z., Nakane, T. & Sussman, J. L. (2013). Isr. J. Chem. 53, 207-216.
  • LiteMole: Sehnal, D., Deshpande, M., Vařeková, R. S., Mir, S., Berka, K., Midlik, A., Pravda, L., Velankar, S. & Koča, J. (2017). Nat. Methods, 14, 1121-1122.
  • Molmil: Bekker, G. J., Nakamura, H. & Kinjo, A. R. (2016). J. Cheminform, 8, 42, 1-5.
  • NGL: Rose, A. S., Bradley, A. R., Valasatava, Y., Duarte, J. M., Prlić, A. & Rose, P. W. (2018). Bioinformatics, bty419.
  • OpenRasMol: Bernstein, H. J. (2000). Trends Biochem. Sci. 25, 453-455.
  • PyMoL: DeLano, W. (2002). The pyMOL molecular graphics system.
  • VMD: Humphrey, W., Dalke, A. & Schulten, K. (1996). J. Mol. Graph. 14, 33-38.

Helpful Links

Author

Rachel Kramer Green