Training and outreach portal of

Introduction to RCSB PDB APIs

RCSB PDB APIs build on the data available in the PDB archive and additional internal and external annotations. They power the RCSB.org website. Because they are publicly available, they can be also used without restrictions by external resources.

This tutorial is intended for RCSB PDB users familiar with PDB data, but new to programming and looking for a starting place to perform complicated or repetitive searches and/or searches with large results sets. More detailed user guide documentation is available with links to specific API tutorials and references for more advanced users and developers.

An API (Application Programming Interface) contains data or features that other applications can access and implement.

The vast majority of APIs, including the RCSB PDB APIs are written in JavaScript Object Notation, or JSON. An API query is constructed as a series of key-value pairs as shown below with examples of key-value pairs from PDB data. What’s defined as a ‘key’ in JSON format, will be referred to as an ‘attribute’ in this tutorial.

The JSON format is constructed as a series of key-value pairs. Examples of key-value pairs from RCSB API are shown on the left.

The RCSB.org website is powered by APIs (also known as web services; see User Guide). RCSB PDB exposes these APIs to the public, so that they can be used in custom software or web-based applications by anyone around the world.

There are two main APIs:

Data API contains static data for each structure. The information seen on the Structure Summary Pages and Custom Reports is derived from the Data API (tutorial).
Search API combines attributes in a query and then retrieves the PDB IDs that fulfill these criteria (tutorial).

Search API powers the search features provided on the RCSB.org website. After finding a group of entries, the results are displayed in various layouts using information from the data API.

Additional APIs include the sequence coordinate service (for alignments between structural and sequence databases and protein positional features) and the volume server (for subsets of volumetric data).

The Advanced Search GUI on RCSB.org is powered by the Search API. The search attributes are parsed into a JSON object that returns the list of PDB IDs that fulfill the search criteria. The search results are then displayed with customizable layouts that pull additional information from the Data API.

Why APIs?

The advantages of using APIs include:

APIs follow a well-defined data model and provide a standardized way to access data
APIs offer flexibility in data retrieval by allowing developers to specify the exact data they need
APIs allow different software systems and applications to interoperate with PDB data, making it easier for developers to build complex data pipelines, extend their services, and create new applications

The PDB archive contains a vast amount of information about biomolecular structures. In order for the PDB to be a useful tool for research, it is important that there be a way to find and retrieve specific information quickly and easily. It is also important to be able to search the archive in the context of other life-science data that is available through external annotations.

The RCSB PDB APIs are designed to efficiently retrieve and deliver data. APIs employ data retrieval mechanisms, such as pagination and caching, to enhance performance and ensure efficient data processing even with large volumes of data.

For example, a search for all entries from a human source determined by X-ray crystallography returns over 50,000 entries. Manually reviewing all the information in any particular data field would be overwhelming. However, the APIs facilitate the automation of data processing. Developers can build workflows or scripts that utilize APIs to retrieve, process, and store a large number of entries automatically.

The data available in the PDB archive follows the PDBx/mmCIF dictionary data schema with attributes defining the 3D coordinates of each atom comprising the structure and additional data categories about the structure, for example, the molecular weight, the experimental method, the expression organism, or the details of the data collection and structure determination experiment.

The RCSB PDB Data API follows the PDBx/mmCIF dictionary. The PDBx/mmCIF data model is highly structured, defining a rich collection of biological, molecular, chemical, structural data, and data quality features. The Data API does not include raw atomic-level coordinates (the Model Server API provides access to atomic-level coordinate data; RCSB.org also hosts a tool to help download multiple data files in batches). The Data API includes only a subset of available primary (PDBx/mmCIF) data content that is well-populated across the entire archive.In addition, various internal and external annotations are integrated into the API for exploring the PDB archive in greater context of biological information.

In the Data API, PDBx/mmCIF data is organized to reflect the underlying macromolecular structure hierarchy progressing from atoms through amino acids and chains to assemblies of interacting macromolecules and ligands. Features describing a particular level in the macromolecular hierarchy are grouped into core objects.

Data API core objects with examples of specific attributes listed.

Attributes exposed with the Data API as a list that is divided by core objects. Any attribute that is not part of the public PDBx/mmCIF dictionary is preceded by the rcsb_ prefix, eg. rcsb_accession_info.major_revision, indicating that this data is computed or integrated by the RCSB PDB.

Search API operates on the many attributes of the Data API, which allows for extremely granular queries.

Search-relevant attributes from mmCIF files are parsed into the Data API. The Search API uses these attributes along with attributes from external annotations along with search parameters to find corresponding structures in the PDB archive.

Search API attributes are available for macromolecular structures and small molecules are provided together with the API documentation.

Accessing Data API through REST services and GraphiQL

Two different interfaces can be used to retrieve data using the RCSB PDB Data API.

The REST API permits the retrieval of all data for one core object at a time.

The GraphQL interface offers more flexible data retrieval, essentially making it possible to grab any piece of data from any level of the hierarchy in a single query. Additional information on data organization in the GraphQL AP is available.

More detailed information about accessing the Data API is available.

PDB holdings data provides listings of all current, unreleased, and removed PDB IDs in JSON format, making it possible to access data from the entire archive.

Accessing RCSB.org APIs with the Python API Client

The rcsb-api package provides a Python interface to the RCSB PDB Search and Data APIs (an overview has been published in Journal for Molecular Biology). Use the rcsbapi.search module to fetch lists of PDB IDs corresponding to advanced query searches, and the rcsbapi.data module to fetch data about a given set of structure IDs. RCSB PDB maintains the current version of this package on GitHub, which will continue to be developed to add support for other RCSB.org APIs.

Website Access to the API

Query editors are provided for both the Data API and the Search API that allow users to build and try different queries in a graphic and interactive way.

When search is performed using the Advanced Search on the RCSB PDB website, the query editor can be accessed by selecting the “JSON” button at the top right:

Advanced Search Query Builder with the JSON button circled in red

Selecting the JSON button launches the query editor as shown below. You can modify the query, copy it as a JSON object, and use it to build API requests. The GET and POST HTTP (Hypertext Transfer Protocol) methods provide communication between a client (such as a web browser or software program) and a Search API server. A JSON query can be used as the URL parameter in a GET request or in the body of a POST request.

API Search for entries determined by electron microscopy (left); pressing the “play” icon will display the results in JSON (right)

This query editor can also be accessed independently to access Search API, create queries, and view the results.

Custom Report displaying ID and Structure Author information for a query results set with link to GraphQL Query in red.

Selecting the “View GraphQL Query” link launches the query editor as shown below. Press the “play” icon in the query editor to retrieve the data.

To build a custom report in JSON, the GraphiQL editor displays the request for data (author name and entry ID for a given set of entries, shown left); selecting the “play” button displays the requested data (right).

Similarly Data API queries can be constructed independently of the website by accessing the editor directly via the GraphiQL web interface.

Note: Currently, the Firefox web browser parses the JSON data for easy exploration. Chrome users may need to use an extension tool for easier data exploration.

Authors

Rachel Kramer Green, Yana Rose, and Maria Voigt