RCSB PDB: Data API Documentation

  • Data API Basics
  • Data Organization
  • Data Attributes
  • Examples
  • Acknowledgements
  • Contact Us
  • RCSB PDB Data API

    The RCSB PDB offers two ways to access data through application programming interfaces (APIs):

    REST-based API

    The REST-based API supports the HTTP GET method to access the PDB data through a set of endpoints (or URLs). See Data Organization section for more information on the underlying data organization.

    GET request

    The path of the endpoints starts with https://data.rcsb.org/rest/v1/core, followed by the type of the resource, e.g. entry, polymer_entity, and the identifier. Note, that compound identifiers, such as entity ID, assembly ID, and entity instance ID (or chain ID) are passed as path parameters.

    Example request endpoints:

    Response

    For any given request, if the data is found on the server, the API will return HTTP response code 200 (OK) – along with the response body in JSON format. For more information on the respond schema see the REST-API documentation or refer to the Data Schema section of this tutorial.

    In case data is NOT found on the server (e.g. https://data.rcsb.org/rest/v1/core/entry/xxxx) or the requested endpoint could not be found (e.g. https://data.rcsb.org/rest/v1/core/foo), then the API will return HTTP response code 404 (Fot Found).

    GraphQL-based API

    GraphQL server operates on a single URL/endpoint, https://data.rcsb.org/graphql, and all GraphQL requests for this service should be directed at this endpoint. GraphQL HTTP server handles the HTTP GET and POST methods.

    GET request

    If the "query" is passed in the URL as a query parameter, the request will be parsed and handled as the HTTP GET request. For example, to execute the following GraphQL query:

    This query string should be sent via an HTTP like so:

    In the example above, the query arguments are written inside the query string. The query arguments can also be passed as dynamic values that are called variables. The variable definition looks like ($id: String!) in the example below. It lists a variable, prefixed by $, followed by its type, in this case String (! indicates that a non-null argument is required).

    The following is equivalent to the previous query:

    With variable defined like so:

    Query variables, in this case, should be sent as a URL-encoded string in an additional query parameter called variables.

    POST request

    The GraphQL server accepts POST requests with a JSON-encoded body. A valid GraphQL POST request should use the application/json content type, must include query, and may include variables. Here's an example for a valid body of a POST request:

    Response

    Regardless of the method by which the query and variables were sent, the response is returned in JSON format. A query might result in some data and some errors. The successful response will be returned in the form of:

    Error Handling

    Error handling in REST is pretty straightforward, we simply check the HTTP headers to get the status of a response. Depending on the HTTP status code we get ( 200 or 404), we can easily tell what the error is and how to go about resolving it. GraphQL server, on the other hand, will always respond with a 200 OK status code. When an error occurs while processing GraphQL queries, the complete error message is sent to the client with the response. Below is a sample of a typical GraphQL error message when requesting a field that is not defined in the GraphQL schema:

    Using GraphQL vs REST API

    REST API offers a simple and easy-to-use way to fetch the data and returns a fixed data structure. If you need a full set of fields for a given object in the macromolecular data hierarchy, the REST API may be a better fit. GraphQL enables declarative data fetching and gives power to request exactly the data that is needed. Also, GraphQL query allows you to traverse the entire hierarchy of the macromolecular data in a single request. Conversely, with the REST API multiple round trips are needed to fetch the data from different levels in the macromolecular hierarchy.

    No matter which method is used, the data returned by the REST API and the GraphQL query will be identical as they query the same source.

    Data Organization

    Biological molecules have a natural structural hierarchy, building from atoms to residues to chains to assemblies. The following definitions are relevant to the way the atomic coordinates, experimental data, and metadata are organized for each PDB structure:

    Level Description
    Entry Annotations pertaining to a particular PDB structure (entry), designated with a 4-character alphanumeric identifier (PDB ID; e.g., 1Q2W). Annotations include the title of the entry, list of depositors, date of deposition, date of release, experimental details, etc.
    Entity Annotations describe the distinct (chemically unique) molecules present in PDB entries. Three types of entities are differentiated:
    • polymer_entity - protein (polypeptides), DNA (polydeoxyribonucleotide), and RNA (polyribonucleotide) identified by amino acids and nucleotides covalently linked in the order defined by the polymer sequence.
    • branched_entity - either linear or branched carbohydrates (sugars and oligosaccharides) that are composed of saccharide units covalently linked via one or more glycosidic bonds.
    • nonpolymer_entity - small chemicals (enzyme cofactors, ligands, ions, etc).
    Entity Instance Entity instances (also referred to as "chains") are distinct copies of entities present in PDB structures. There can be multiple instances of a given entity. Entity instance data contains information that can differ for each instance. For example, structural connectivity, secondary structure, validation data, etc. Note, that information common for all copies of the same molecule is stored at the entity level. Similarly to entity data, three types of entity instances are differentiated: polymer_entity_instance, branched_entity_instance, nonpolymer_entity_instance.
    Assembly Annotations describe structural elements that form a biological assembly (also sometimes referred to as the biological unit), such as transformations required to generate the biological assembly, the information regarding the evidence of assembly, the annotations on the symmetry of polymeric subunits, etc.
    Chemical Component Chemical components describe all residues and small molecules found in PDB entries. The annotations at this level include chemical descriptors (SMILES & InChI), chemical formula, systematic chemical names, etc.

    Data Schema

    All data stored in the PDB archive conform to the PDBx/mmCIF data dictionary. This data is augmented with annotations coming from external resources and internally added fields. The RCSB PDB data representation, powered by the JSON Schema language, is connected to the data hierarchy. Such data organisation groups annotations in objects defined as follows:

    Typically, integrated data will be added as additional fields to any of the objects above. Some data, however, has a substantial overlap with the source data in term of content. Such data appears as a separate object with dedicated schema, where original semantics preserved as much as possible:

    The relationships between these objects are explicitly implemented through attributes in a dedicated container object: rcsb_[...]_container_identifiers, where [...] should be replaces with the type of the object, e.g. entry, polymer_entity, assembly.

    For example, rcsb_entry_container_identifiers contains polymer_entity_ids, branched_entity_ids, non_polymer_entity_ids attributes that hold corresponding entity IDs.

    GraphQL Schema

    All GraphQL queries are validated and executed against the GraphQL schema. The GraphQL schema contains nodes and edges, where nodes being objects, that represent macromolecular data hierarchy, and edges being the relationships between those objects. See Nodes and Edges for more details.

    You can use GraphiQL, which is a "graphical interactive in-browser GraphQL IDE", to explore GraphQL schema. It lets you try different queries, helps with auto completion and built-in validation. The collapsible Docs panel (Documentation Explorer) on the right side of the page allows you to navigate through the schema definitions. Click on the root Query link to start exploring the GraphQL schema.

    GraphiQL
    Root Queries

    Root queries define entry-points from where you can start traversing the data hierarchy. You can start your query from any object in the hierarchy and visit adjacent objects through bi-directional links (edges) connecting nodes. See Nodes and Edges for more details.

    Root queries have parameters and except either a single identifier for requested object (e.g. entry ID, entity ID, etc) or multiple identifiers supplied as a list. The following example shows how to fetch experimental method name for multiple PDB entries:

    When requesting data for multiple objects compound identifiers should follow the format:

    For example:

    Nodes and Edges

    Node is an object that holds all fields for a given level in the data hierarchy. Nodes have fields that can be complex objects or scalar values. GraphQL queries are built by specifying fields within fields (also called nested subfields) until only scalars are returned.

    Edges represent connections between nodes. Through edges the API allows you to traverse the data hierarchy by visiting adjacent data objects, e.g. from entry to polymer_entity, from polymer_entity to polymer_entity_instance, etc. Traversing up the hierarchy is also possible. For example, you can fetch an organism name for a given polymer entity using the polymer_entity root query and in the same query fetch an experimental method name, that resides at the entry level, using the entry edge.

    Data Attributes

    The RCSB PDB data available through the APIs includes only commonly used annotations, rather than supporting all metadata available in the PDBx/mmCIF data dictionary. Refer to the Data Attributes page for a full list of objects and their attributes.

    Examples

    This section contains additional examples for using the GraphQL-based RCSB PDB Data API.

    Query Entries

    Fetch information about structure title and experimental method for PDB entries:

    Query Primary Citation

    Fetch primary citation information (structure authors, PubMed ID, DOI) and release date for PDB entries:

    Query Polymer Entities

    Fetch taxonomy information and information about membership in the sequence clusters for polymer entities:

    Query Polymer Instances

    Fetch information about the domain assignments for polymer entity instances:

    Note, that label_asym_id is used to identify polymer entity instances.

    Query Carbohydrates

    Query branched entities (sugars or oligosaccharides) for commonly used linear descriptors:

    Acknowledgements

    To cite this service, please reference:

    Contact Us

    Contact info@rcsb.org with questions or feedback about this service.