Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format
Introduction and History
The PDBx/mmCIF file format and data dictionary is the basis of wwPDB data deposition, annotation, and archiving of PDB data from all supported experimental methods.
This PDB-101 resource is intended as an introductory guide. The primary PDBx/mmCIF resource mmcif.wwpdb.org contains relevant data dictionaries and documentation, as well as a detailed description of the format's development and history.
The initial CIF (Crystallographic Information File) format and dictionary was developed for archiving small molecule crystallographic experiments. In 1997, the dictionary was expanded (mmCIF) to include data items relevant to macromolecular crystallographic experiments (PDBx/mmCIF). This format overcomes limitations of the legacy PDB file format and supports data representing large structures, complex chemistry, and new and hybrid experimental methods. The legacy PDB file format is no longer modified or extended to support new content. As the PDBx/mmCIF format continues to evolve, PDB format files will become outdated.
PDBx/mmCIF is powerful. PDBx/mmCIF explicitly documents all relationships between common data items (e.g. atom and residue identifiers) which permits software applications to evaluate and validate referential integrity with any PDB entry, and maps information between the residue sequences of the experimental sample and the model coordinates. The mmCIF/PDBx Exchange Dictionary provides metadata (e.g. data types, allowed ranges, controlled vocabularies).
It is supported by visualization applications such as Jmol, Chimera, and OpenRasMol and structure determination systems such as CCP4 and Phenix.
Basics of Syntax and Format
PDBx/mmCIF format utilizes the ASCII character set.
All data items are identified by name, begin with the underscore character and are composed of a category name and an attribute name. The category name is separated from the attribute name by a period:
_citation.year
This combination of category and attribute may be termed an mmCIF token.
Data categories are presented in two styles: key-value and tabular.
In the key-value style, the mmCIF token is followed directly by a corresponding value. The following example shows the unit cell parameters from entry 4hhb:
_cell.entry_id
4HHB
_cell.length_a 63.150
_cell.length_b 83.590
_cell.length_c 53.800
_cell.angle_alpha 90.00
_cell.angle_beta 99.34
_cell.angle_gamma 90.00
_cell.Z_PDB 4
The tabular style is used when there are multiple values for each token. In this style, a loop_ token is followed by rows of data item names and then white-space delimited data values. The following example shows the beginning of the coordinate records from entry 4hhb. Here, data items in the atom_site category are used to describe the identities and atomic coordinates of the atoms in the entry:
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM 1 N N . VAL A 1 1 ? 6.204 16.869 4.854 1.00 49.05 ? 1 VAL
A N 1
ATOM 2 C CA . VAL A 1 1 ? 6.913 17.759 4.607 1.00 43.14 ? 1 VAL
A CA 1
ATOM 3 C C . VAL A 1 1 ? 8.504 17.378 4.797 1.00 24.80 ? 1 VAL
A C 1
ATOM 4 O O . VAL A 1 1 ? 8.805 17.011 5.943 1.00 37.68 ? 1 VAL
A O 1
ATOM 5 C CB . VAL A 1 1 ? 6.369 19.044 5.810 1.00 72.12 ? 1 VAL
A CB 1
ATOM 6 C CG1 . VAL A 1 1 ? 7.009 20.127 5.418 1.00 61.79 ? 1 VAL
A CG1 1
ATOM 7 C CG2 . VAL A 1 1 ? 5.246 18.533 5.681 1.00 80.12 ? 1 VAL
A CG2 1
The first data item name corresponds to the first data value, the second item to the next, and so on in each line of data. For example, the third data item _atom_site.type_symbol corresponds to the atom type given in column 13. The list of data items, is then “looped through” for each line of data values.
The hash symbol (#) at the beginning of a line is used to indicate a comma or to separate categories.
Syntax is described in detail at the PDBx/mmCIF Resource site.
Format Examples
One of the main benefits of PDBx/mmCIF format is that it imposes no limitations for the number of atoms, residues or chains that can be represented in a single PDB entry.
All data items in the current PDB format have corresponding data items in the PDBx/mmCIF format and each data item is precisely defined in the PDBx Exchange Data Dictionary. Correspondences between records in PDB file format and data items defined in the PDBx/mmCIF dictionary are described in detail.
For example, the PDB record that contains the authors of the deposition, AUTHOR:
AUTHOR G.FERMI,M.F.PERUTZ
Is represented in PDBx/mmCIF with data items in the audit_author category:
loop_
_audit_author.name
_audit_author.pdbx_ordinal
'Fermi, G.' 1
'Perutz, M.F.' 2
PDBx/mmCIF organizes information in categories containing related data items. In the example above, the audit_author category contains data items .name and .pdbx_ordinal. The .name data item contains the deposition authors last name and initials. The .pdbx_ordinal data item defines the order of the author's name in the list of authors.
A category is a tabular data structure where data items are the rows and the stored information are the columns:
audit_author |
||
.name |
Fermi, G. |
Perutz, M.F |
.pdbx_ordinal |
1 |
2 |
If there are multiple columns within a data item or group of data items in the same category, the category is preceded by a loop_ token. The list of data item names can then be followed by repeated rows of data values.
For example, the JRNL records of the PDB file for structure 4HHB includes a primary citation with four authors:
JRNL
AUTH G.FERMI,M.F.PERUTZ,B.SHAANAN,R.FOURME
JRNL TITL THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT 1.74 A
JRNL TITL 2 RESOLUTION
JRNL REF J.MOL.BIOL. V. 175 159 1984
JRNL REFN ISSN 0022-2836
JRNL PMID 6726807
JRNL DOI 10.1016/0022-2836(84)90472-8
The PDB format file then continues with several additional references such as:
REMARK
1 REFERENCE 1
REMARK 1 AUTH M.F.PERUTZ,S.S.HASNAIN,P.J.DUKE,J.L.SESSLER,J.E.HAHN
REMARK 1 TITL STEREOCHEMISTRY OF IRON IN DEOXYHAEMOGLOBIN
REMARK 1 REF NATURE V. 295 535 1982
REMARK 1 REFN ISSN 0028-0836
The _citation_author category loops through the authors of the various references:
loop_
_citation_author.citation_id
_citation_author.name
_citation_author.ordinal
primary 'Fermi, G.' 1
primary 'Perutz, M.F.' 2
primary 'Shaanan, B.' 3
primary 'Fourme, R.' 4
1 'Perutz, M.F.' 5
1 'Hasnain, S.S.' 6
1 'Duke, P.J.' 7
1 'Sessler, J.L.' 8
1 'Hahn, J.E.' 9
2 'Fermi, G.' 10
2 'Perutz, M.F.' 11
3 'Perutz, M.F.' 12
4 'Teneyck, L.F.' 13
4 'Arnone, A.' 14
5 'Fermi, G.' 15
6 'Muirhead, H.' 16
6 'Greer, J.' 17
The rest of the information contained in the JRNL records is included in related _citation category data items such as:
_citation.title
_citation.journal_abbrev
_citation.journal_volume
_citation.page_first
_citation.page_last
_citation.year
Categories have explicit relationships with one another. A category group is a named collection of categories. Category groups are typically used to organize groups of related categories. For instance, all of the mmCIF categories containing bibliographic information are members of citation_group category group. Included in this group are the citation, citation_author, and citation_editor categories.
Entities
One concept on which the PDBx/mmCIF format relies is that of entities. An entity is a chemically distinct part of a structure as represented in the PDBx/mmCIF data file. Data items in the _entity category, describe the chemistry and identity of the molecules under investigation. In any particular entry, there may be multiple copies of a given entity.
For example, structure 4hhb contains two copies of the hemoglobin alpha chain (or chains A and C) and two copies of the beta chain (or chains B and D). The entry also contains four heme groups. In the PDBx/mmCIF file, the two alpha chains are considered one entity, the two beta chains are another, and the heme groups a third. Water and phosphate ions make up the fourth and fifth entities:
Loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.pdbx_ec
_entity.pdbx_mutation
_entity.pdbx_fragment
_entity.details
1 polymer man 'HEMOGLOBIN (DEOXY) (ALPHA CHAIN)' 15150.353 2 ? ? ? ?
2 polymer man 'HEMOGLOBIN (DEOXY) (BETA CHAIN)' 15890.198 2 ? ? ? ?
3
non-polymer syn 'PROTOPORPHYRIN IX CONTAINING FE' 616.487 4 ? ? ? ?
4 non-polymer syn 'PHOSPHATE ION' 94.971 2 ? ? ? ?
5 water nat water 18.015 221 ? ? ? ?
Each entity is assigned a unique numerical identifier (in the entity.id category).
Parent-child Relationships
When data items occur in multiple categories a parent-child relationship is created. This most commonly occurs for labels and identifiers which are reused throughout the dictionary. For instance, the entity identifier _entity.id defined in category ENTITY is the parent definition of this item. This identifier is reused in the ATOM_SITE category as item _atom_site.label_entity_id. In this case, the data item in the ATOM_SITE category is defined as a child of the data item in the ENTITY category.
Chemical Component Dictionary
Chemical descriptions of all of the monomers and ligands in PDB structures are provided in PDBx/mmCIF format in the Chemical Component Dictionary. The collection of PDBx/mmCIF data categories used in the Chemical Component Dictionary are in the CHEM_COMP_DICTIONARY category group.
For example in the PDBx/mmCIF definition for the chemical component HEM ("PROTOPORPHYRIN IX CONTAINING FE"), the group is defined:
_chem_comp.id HEM
_chem_comp.name "PROTOPORPHYRIN IX
CONTAINING FE"
_chem_comp.type NON-POLYMER
_chem_comp.pdbx_type HETAIN
_chem_comp.formula "C34 H32 Fe N4 O4"
_chem_comp.mon_nstd_parent_comp_id ?
_chem_comp.pdbx_synonyms HEME
_chem_comp.pdbx_formal_charge 0
_chem_comp.pdbx_initial_date 1999-07-08
_chem_comp.pdbx_modified_date 2016-01-20
_chem_comp.pdbx_ambiguous_flag Y
_chem_comp.pdbx_release_status REL
_chem_comp.pdbx_replaced_by ?
_chem_comp.pdbx_replaces MHM
_chem_comp.formula_weight 616.487
_chem_comp.one_letter_code ?
_chem_comp.three_letter_code HEM
_chem_comp.pdbx_model_coordinates_details ?
_chem_comp.pdbx_model_coordinates_missing_flag N
_chem_comp.pdbx_ideal_coordinates_details Corina
_chem_comp.pdbx_ideal_coordinates_missing_flag N _chem_comp.pdbx_model_coordinates_db_code
3IA3 _chem_comp.pdbx_subcomponent_list ?
_chem_comp.pdbx_processing_site RCSB
This definition is then followed by data items describing the identity and ideal positions of each atom in the group.
Examples of software packages generating PDBx/mmCIF format files include
PDBx/mmCIF format files can be generated output from the following software packages:
- CCP4
- CCP4/REFMAC Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355-367.
- Dedicated PDB deposition tasks in CCP4i2: Potterton, L., Agirre, J., Ballard, C., Cowtan, K., Dodson, E., Evans, P. R., Jenkins, H. T., Keegan, R., Krissinel, E., Stevenson, K., Lebedev, A., McNicholas, S. J., Nicholls, R. A., Noble, M., Pannu, N. S., Roth, C., Sheldrick, G., Skubak, P., Turkenburg, J., Uski, V., von Delft, F., Waterman, D., Wilson, K., Winn, M. & Wojdyr, M. (2018). Acta Cryst. D74, 68-84
- CCP4 Cloud Krissinel, E., Uski, V., Lebedev, A., Winn, M. & Ballard, C. (2018). Acta Cryst. D74, 143-151.
- PHENIX: Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213-221.
- Global Phasing BUSTER: Bricogne, G., Blanc, E., Brandl, M., Flensburg, C., Keller, P., Paciorek, W., Roversi, P., Sharff, A., Smart, O. S., Vonrhein, C. & Womack, T. O. (2009). BUSTER, Global Phasing Ltd., Cambridge, UK.
Examples of visualization software applications supporting PDBx/mmCIF include
- CCP4
- CCP4mg: Potterton, L., McNicholas, S., Krissinel, E., Gruber, J., Cowtan, K., Emsley, P., Murshudov, G. N., Cohen, S., Perrakis, A. & Noble, M. (2004). Acta Cryst. D60, 2288-2294.
- Coot: Brown, A., Long, F., Nicholls, R. A., Toots, J., Emsley, P. & Murshudov, G. (2015). Acta Cryst. D71, 136-153.
- Chimera: Goddard, T. D., Huang, C. C., Meng, E. C., Pettersen, E. F., Couch, G. S., Morris, J. H. & Ferrin, T. E. (2018). Protein Sci. 27, 14-25.
- Jmol/JSMol: Hanson, R. M. (2010). J. Appl. Cryst. 43, 1250-1260.]; Hanson et al., 2013[Hanson, R. M., Prilusky, J., Renjian, Z., Nakane, T. & Sussman, J. L. (2013). Isr. J. Chem. 53, 207-216.
- LiteMole: Sehnal, D., Deshpande, M., Vařeková, R. S., Mir, S., Berka, K., Midlik, A., Pravda, L., Velankar, S. & Koča, J. (2017). Nat. Methods, 14, 1121-1122.
- Molmil: Bekker, G. J., Nakamura, H. & Kinjo, A. R. (2016). J. Cheminform, 8, 42, 1-5.
- NGL: Rose, A. S., Bradley, A. R., Valasatava, Y., Duarte, J. M., Prlić, A. & Rose, P. W. (2018). Bioinformatics, bty419.
- OpenRasMol: Bernstein, H. J. (2000). Trends Biochem. Sci. 25, 453-455.
- PyMoL: DeLano, W. (2002). The pyMOL molecular graphics system.
- VMD: Humphrey, W., Dalke, A. & Schulten, K. (1996). J. Mol. Graph. 14, 33-38.
Helpful Links
- PDBx/mmCIF Dictionary Resources
- PDB to PDBx/mmCIF Data Item Correspondences
- Glossary
- Chemical Component Dictionary
- PDBx/mmCIF Dictionary Search
- PDBx/mmCIF General FAQ
- PDBx/mmCIF Syntax
- Data Items Describing Molecular Entities
- Data Items Describing Atomic Positions
- Article Announcing mandatory submission of PDBx/mmCIF format files for crystallographic depositions to the Protein Data Bank (PDB)
- PDBx/mmCIF Ecosystem: Foundational Semantic Tools for Structural Biology (parser information included)