CATH Database Downloads ======================= ## What is CATH? The CATH database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues, and continues to be developed by the Orengo group at University College London. ## How is CATH-Gene3D created? Experimentally-determined protein three-dimensional structures are obtained from the Protein Data Bank and split into their consecutive polypeptide chains, where applicable. Protein domains are identified within these chains using a mixture of automatic methods and manual curation. The domains are then classified within the CATH structural hierarchy: at the Class \(C\) level, domains are assigned according to their secondary structure content, i.e. all alpha, all beta, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution, i.e. they are homologous. Additional sequence data for domains with no experimentally determined structures are provided by our sister resource, Gene3D, which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and Ensembl are scanned against CATH HMMs to predict domain sequence boundaries and make homologous superfamily assignments. ## Contact us If you have any comments/suggestions/criticisms, please let us know: http://www.cathdb.info/support/contact ## CATH Releases We aim to provide official releases of the CATH classification every 12 months. This release process is important because is allows us to provide internal validation, extra annotations and analysis. However, it can mean that there is a time delay between new structures appearing in the PDB and the latest official CATH release, In order to address this issue: CATH-B provides a limited amount of information to the very latest domain annotations (e.g. domain boundaries and superfamily classifications). The latest release of CATH-Gene3D (v4.1) was released in July 2016 and consists of: * 500,238 structural protein domain entries * 151,013,797 non-structural protein domain entries * 6,631 homologous superfamily entries * 212,872 functional family entries All CATH data can be downloaded via the FTP site: ftp://ftp.biochem.ucl.ac.uk/pub/cath or via the download section of the CATH website: http://www.cathdb.info/download The root directory contains this ```README```, a second README detailing known differences in file content across the different versions (```README-known-version-differences.txt```), and two immediate subdirectories: ```releases``` and ```supplementary-materials```. Data within these directories are described below. ## Accessing data files through the REST API The individual PDB files for domains, chains and whole proteins can be downloaded through the REST API. ### Accessing PDB files for a particular domain/chain/PDB id To access these files for a particular domain/chain/PDB id, these files can be accessed via the following URLs: ```http://www.cathdb.info/version/v4_3_0/api/rest/id/.pdb``` ```http://www.cathdb.info/version/v4_3_0/api/rest/id/.pdb``` ```http://www.cathdb.info/version/v4_3_0/api/rest/id/.pdb``` ### Accessing PDB files for all domains in CATH To access the files for all domain ids in all superfamilies, first obtain the entire list of domain ids from the latest official version from the file: ```http://download.cathdb.info/cath/releases/all-releases/latest_release/cath-classification-data/cath-domain-list.txt``` Or, to access the files for all domains ids in a particular superfamily, use: ```http://www.cathdb.info/version/v4_3_0/api/rest/cathtree/from_cath_id_to_depth//``` e.g. list all domains (i.e. CATH hierarchy depth=9 ) in the superfamily 1.10.8.10 ```http://www.cathdb.info/version/v4_3_0/api/rest/cathtree/from_cath_id_to_depth/1.10.8.10/9``` For each domain id retrieved, access its PDB file via the following URL: ```http://www.cathdb.info/version/v4_3_0/api/rest/id/.pdb``` ## Directory Structure ### ```releases/``` This directory contains all of the CATH-Gene3D database releases, from the first to the latest. All previous releases are in ```releases/previous-releases/``` and the latest release is in ```releases/latest-release/```. #### ```releases/daily-release/``` This directory provides summary information of protein domains putatively classified in CATH since the last release. For each date with a CATH-B entry, there should be five files, e.g.: cath-b-20170519-all.gz : combination of the 'latest release' and 'putative entries' files cath-b-20170519-latest-release.gz : all domains that were in the latest release of CATH cath-b-20170519-putative.gz : the domains assigned/rechopped/reassigned in CATH since the latest release cath-b-20170519-names-all.gz : name description of each node in the CATH hierarchy. A combination of the 'latest release' and 'putative entries' files cath-b-20170519-s35-all.gz : all domain ids in CATH-B with their S35 cluster id and domain boundary information. **Notes:** * These are compressed files; once downloaded you should uncompress them with a suitable program (e.g. gunzip) before use. * Please note that the 'latest release' and 'putative' files have no domains in common. * The first three files use one line per domain and use the following format: domain_id status putative_superfamily_id putative_chopping * The names file follows the file format described in ```./README-cath-names-file-format.txt``` (a.k.a. CNF format). ##### ```./archive/``` This subdirectory contains all of the CATH-B files, except the files for the current day. ##### ```./newest/``` This subdirectory contains the five CATH-B files for the current day: * cath-b-newest-all.gz * cath-b-newest-latest-release.gz * cath-b-newest-putative.gz * cath-b-newest-names-all.gz * cath-b-s35-newest.gz #### ```releases/latest-release/``` This subdirectory contains the latest release of CATH. Please note that these files do not contain a version number. #### ```releases/all-releases/``` Each subdirectory contains all of the CATH releases and is named according to its version number: v2_0 v2_4 v2_5 v2_5_1 v2_5_3 v2_6_0 v3_0_0 v3_1_0 v3_3_0 v3_4_0 v3_5_0 v4_0_0 v4_1_0 v4_2_0 v4_3_0 ##### ```releases///cath-classification-data/``` Files within this directory contain data describing the CATH classification. cath-chain-list-.txt : Lists all of the PDB chain IDs in CATH, whether they are chopped into domains or not. For file format description see ```./README-cath-list-file-format.txt``` (a.k.a. CLF format). * e.g. cath-chain-list-v4_3_0.txt cath-domain-boundaries-*-.txt : Description of domain and segment boundaries for domains classified into CATH. For file format description see ```./README-domain-boundaries-file-format.txt``` (a.k.a. CDF format). * e.g. cath-domain-boundaries-v4_3_0.txt * e.g. cath-domain-boundaries-seqreschopping-v4_3_0.txt cath-domain-description-file-.txt : Description of each protein domain in CATH (see README.CDDF_FORMAT_2.0 for more details). For file format description see ```./README-cath-domain-desc-file-format.txt``` (a.k.a. CDDF format). * e.g. cath-domain-description-file-v4_3_0.txt cath-domain-list--.txt : Lists of domains classified into CATH. For file format description see ```./README-cath-list-file-format.txt``` (a.k.a. CLF format). * e.g. cath-domain-list-S35-v4_3_0.txt cath-domain-pdb-*-.txt : Description of each domain PDB classified into CATH. * e.g. cath-domain-pdb-v4_3_0.tgz * e.g. cath-domain-pdb-S35-v4_3_0.tgz * These are compressed files; once downloaded you should uncompress them with a suitable program (e.g. gunzip) before use. cath-names-.txt : Name description of each node in the CATH hierarchy, along with an example domain. For file format description see ```./README-cath-names-file-format.txt``` (a.k.a. CNF format). * e.g. cath-names-v4_3_0.txt cath-superfamily-list-.txt : List of all the superfamilies in the CATH hierarchy. * e.g. cath-superfamily-list-v4_3_0.txt cath-unclassified-list-.txt : List of all unclassified protein chains and domains that are still being processed. For file format description see ```./README-cath-list-file-format.txt``` (a.k.a. CLF format). * e.g. cath-unclassified-list-v4_3_0.txt ##### ```releases///non-redundant-data-sets/``` The non-redundant data sets contain a non-redundant subset of CATH domains that: * have no pair of domains (according to BLAST) with >= 20 or 40% sequence identity (depending on the data set chosen), over 60% overlap (over the longer sequence * is as big as we could make it otherwise. **Files** cath-dataset-nonredundant-S[20|40]-v4_1_0.atom.fa : The ATOM sequences of the domains in the dataset (which only contain residues that have ATOM records in the PDB file) cath-dataset-nonredundant-S[20|40]-v4_3_0.fa : The sequences of the domains in the dataset cath-dataset-nonredundant-S[20|40]-v4_3_0.list : A list of the domains in the dataset; one domain ID per line cath-dataset-nonredundant-S[20|40]-v4_3_0.pdb.tgz : (A gzipped tar file containing) the PDB files of the domains in the data set **Method of Construction** The sequence comparisons are performed with an all-against-all BLAST of our domain sequences. We then use these results to identify any links with: * \>= 40% sequence identity ( ie pident \>= 40 ) and * \>= 60% overlap over the longer sequence ( ie 100.0 * length / max(slen, qlen) \>= 60 ) We use this to form a list of domains that contains no pair of linked entries. In an effort to make the list as large as possible, we build the list by iteratively choosing each domain to add to the list, ensuring that a domain is only added if it has as few linked neighbours as any other domain. This means the algorithm should nibble as many edges off a cluster as possible, rather than taking a small number of domains at the cluster's centre. ##### ```releases///sequence-data/``` This directory contains protein domain sequence-based data. cath-domain-seqs-*-.fa : Sequences for each CATH domain. * e.g. cath-domain-seqs-S35-v4_3_0.fa Hidden Markov model (HMMs) libraries are provided for the S35 rep sequence clusters and the functional families (FunFams). A HMM is generated for each S35 sequence cluster and for each functional family using ```hmmbuild``` from the HMMER3 software package. All of the S35 sequence cluster HMMs and functional family HMMs are concatenated to create these two HMM library files: * cath-S35--hmm3.lib.gz * funfam-hmm3-.lib.gz These are compressed files; once downloaded you should uncompress them with a suitable program (e.g. gunzip) before use. The program ```hmmpress``` should then be run on each of these files to construct binary compressed data files. ###### ```./sequence-by-superfamily/``` cath-superfamily-seqs--.fa : Sequences for each CATH superfamily in FASTA format. The files have this format: * e.g. cath-superfamily-seqs-1.10.10.10-v4_3_0.fa ###### ```./supplementary-files/``` This directory contains any supplementary files that are associated with a particular release. ### ```supplementary-materials/``` This directory contains supplementary material for published work from the group. Each subdirectory represents a different publication. #### ```./2015_nar_cath-funfhmmer-web-server/``` * FunFHMMER-web-server-supplementary-table.xls #### ```./2016_ploscompbiol_functionally-classifying-and-characterising-serine-beta-lactamases/``` * 151-types-uniprot-cath-gene3d.dat * SSPA-mutant-positions-extended-spectrum-resistance.dat