Introduction

× This document is still under development and incomplete.

CREDO is a relational database storing all pairwise atomic interactions of inter- as well as intra-molecular contacts between small- and macromolecules found in experimentally-determined structures from the Protein Data Bank (PDB).

Features

Structural Interactomics

CREDO contains the interactions between the atoms of surface-exposed residues of all molecules inside macromolecular structures from the Protein Data Bank (PDB). These molecules include proteins, nucleic acids, carbohydrates as well as small molecules.

Biological assemblies

Only top-ranking stable predictions from PDBe PISA are used to generate biological assemblies. The asymmetric unit (ASU) is only used if a stable prediction cannot be found or a prediction does simply not exist.

Structural Interaction Fingerprints (SIFts)

Interactions between atoms are stored as Structural Interaction Fingerprints (SIFts) that were described first by Deng et al.. CREDO currently implements 13 different interaction types such as hydrogen bonds, halogen bonds, carbonyl interactions and others. A few interaction types in CREDO are visualised below.

Interactions between aromatic rings

Interactions between the aromatic rings of PDB residues are recorded separately and classified into nine different interaction geometries.

Atom-aromatic ring interactions

Interactions between single atoms and aromatic rings can be energetically favourable and are frequently found in protein structures and their complexes. All occurrences of atoms within a certain distance of an aromatic ring's centroid are recorded including their interaction geometry and classified if possible. Examples for known atom-aromatic ring interactions are visualised below.

Sequence-to-Structure

All polypeptide residues in CREDO are mapped onto protein sequences from UniProt through a complete sequence-to-structure mapping if possible, using data from the PDBe Structure integration with function, taxonomy and sequence (SIFTS) initiative. This mapping allows the transfer of information from the sequence onto the structure (or vice-versa), including cross references to other databases. For example, information from UniProt is used to identify modified, non-standard or mutated peptides in PDB structures.

Structural Variations

Structural variations from EnsEMBL Variation are mapped onto all protein structures in CREDO through the sequence-to-structure mapping. EnsEMBL Variation contains variation data from the most important sources, including dbSNP, COSMIC and UniProt as well as information about (disease) phenotypes that can be linked to variations occurring in protein structures. This means that phenotypes can be linked directly to ligand binding sites or protein-protein interfaces.

Examples found in the PDB

Crystal structure of T3-bound thyroid hormone receptor (PDB entry: 3GWS) with structural variations highlighted in magenta. The G345V polymorphism (dbSNP: rs28999970) causes inappropriately elevated thyroid-stimulating hormone (TSH) levels, which ultimately leads to generalized thyroid hormone resistance (GTHR), as demonstrated in functional assays by Parrilla et al. (1991). Interestingly, this mutation is in the ligand binding site where the the main chain of the wild type glycine forms strong interactions with an iodine atom of T3. This polymorphism is also part of the SNPedia collection.

Human phytanoyl-CoA dioxygenase in complex with iron and 2-oxoglutarate (PDB entry: 2AIX) This enzyme is required for alpha-oxidation of certain branched fatty acids in peroxisomes and its deficiency is the major cause of Refsum's disease that results in the malformation of myelin sheaths around nerve cells. Highlighted in magenta are the polymorphisms affecting binding site-lining residues, particularly R275W (rs28939671). Only ionic interactions are shown in this representation. The substitution is also defined in SNPedia as a cause of Refsum's disease.

C-KIT tyrosine kinase in complex with Imatinib (PDB entry: 1T46) with the position of the T670I Imatinib-resistant mutation highlighted.

Small Molecules & Cheminformatics

Interactions between proteins and ligands play an important role in CREDO.

Descriptors

Physico-chemical properties

Conformation-independent physicochemical descriptors are calculated for all chemical components in the chemical component dictionary of the PDB. These descriptors are important to evaluate drug-likeness and filter molecules. All properties are calculated using the OEChem and OEMolProp toolkits from OpenEye.

Boolean descriptors

Cheminformatics

A variety of cheminformatics methods are supported in CREDO through two different PostgreSQL extensions (cartridges), one from RDKit and the other is an internal extension based on the OpenEye C++ toolkits. Structural queries are supported in the form of sub-/superstructure as well as SMARTS. You can fetch all compounds that contain 4-(3-pyridyl)pyrimidine by accessing this resource, for example.

2D Similarity methods

Chemical components can also be retrieved by topological similarity to a query structure. The RDKit cartridge supports circular, atom-pair and torsion fingerprints. The OpenEye cartridge supports circular, path, tree and MACCS166 fingerprints. The links will return a page containing the results of a similarity search for each method with Imatinib as query.

3D Shape-similarity methods

3D similarity searching is supported through Ultrafast Shape Recognition With CREDO Atom Types (USRCAT), an extension of the USR method that includes pharmacophoric information. An example with Imatinib as query can be seen here.

RECAP-fragmentation of PDB chemical components

Chemical components found in the PDB are fragmented using a modified version of the RECAP algorithm with additional rules for natural products. An example fragmentation is shown below for chemical component J07.



CREDO uses an alternative fragmentation strategy and not the one-pass method from the original method. Chemical components are fragmented exhaustively, i.e. every cleavage rule is applied to every fragment, hence the number of resulting fragments is be much larger. The fragmentation result is stored in the database as a hierarchy.

Data Validation

One of the design decisions that was made for CREDO was to be able to keep as much data from PDB as possible. Different users have different needs, and a low-resolution structure that is not of interest for drug discovery scientists might be more important for others. Therefore, data in CREDO is annotated with additional data that can be used assess the quality of a macromolecular complex.

Disordered regions

It is fairly common in protein X-ray crystallography that an electron density map does not reveal the position of all atoms, residues or even secondary structure elements. Several factors can contribute to this: experimental procedures or more interestingly, dynamical properties of the protein itself. The latter is called intrinsic protein disorder and occurs if parts of the protein structure exist as dynamic ensembles with significantly different atomic coordinates, making these regions very difficult to solve. Missing regions are identified in all protein structures in the PDB.

Visualisation of disordered regions in PDB entry 2P33. The chain breaks caused by missing residues are connected with a dashed red line to indicate the missing sequence.

Diffraction-component precision index

The quality of a protein crystal structure is commonly assessed by both nominal resolution and `R_{free}`. The resolution of a crystal structure is merely a quantitative measure for the coallesced data, not an indicator of how well the fitted model agrees with the experimental data. The `R_{free}` value measures the agreement between observed and calculated structure factor amplitudes for a test set of reflections that is omitted during the refinement process. Hence, it is a good indicator of model quality by distinguishing between well fitted and poorly fitted ones. A good indicator of structure quality that takes into account `R_{free}` and does not require an electron density map for calculation is the diffraction-component precision index (DPI) that was introduced by Cruickshank in order to estimate the uncertainty of atomic coordinates obtained by structural refinement of protein diffraction data. The whole concept of using the DPI as a metric to assess structure quality was introduced to the virtual screening community by Goto et al. whose formula to calculate the DPI is shown below: $$\sigma(r,B_{avg})=2.2N_{atoms}^{1/2}V_{a}^{1/2}N_{obs}^{-5/6}R_{free}$$ `N_{atoms}` is the number of atoms in the unit cell, `V_{a}` its volume and `N_{obs}` the number of unique crystallographic reflections. Blow also rearranged the formula to display the relationship between nominal resolution and atom coordinate precision (with Goto et al. coefficient): $$\sigma(r,B_{avg})=0.22(1+s)^{1/2}V_{m}^{-1/2}C^{-5/6}R_{free}d_{min}^{5/2}$$ In this arrangement `s` is the percent solvent present in the crystal, `V_{m}`the asymmetric unit volume to molecular weight ratio, `C` the completeness of the data and `d_{min}` the nominal resolution. With the help of this formula it is possible to calculate a theoretical minimum DPI value, i.e. uncertainty of the atomic coordinates for a given structure. For this purpose, the solvent content `s` was set to zero, the completeness of data `C` to 1.0 (100%) and a `V_{m}` assumed of 2.4ų.

Info The DPI and the theoretical minimum DPI is calculated for all structures in CREDO if the required structure factors are available.

Other methods used in CREDO

Besides the DPI a number of other validation methods are used. The chemical component dictionary allows the comparison of experimental with ideal coordinates which allows the identification of incomplete residues. As a general rule of thumb, atoms that are not covalently bound to other atoms according to the information given in the PDB connection table, are labelled as clashing if they were found to be within 1.2Å of a each other.