# Domain Annotation Pipeline: BFVD This repository demonstrates the output of the [UCL Orengo Group domain-annotation-pipeline](https://github.com/UCLOrengoGroup/domain-annotation-pipeline). The pipeline integrates predicted protein structures with domain boundary detection, structural annotations, and functional classification. --- ## Changelog - v0.1 -- Draft results --- ## Overview of the Pipeline 1. **Input sequences** Viral protein sequences (from UniProt / BFVD) are used as the starting point. 2. **Structure prediction** Structures are predicted with **AlphaFold**, yielding per-residue confidence scores (pLDDT). 3. **Domain boundary prediction** Three independent algorithms are applied to predict structural domain boundaries. A **consensus domain definition** is derived from agreement between methods. 4. **Domain extraction** Consensus domains are **chopped** from the full AlphaFold models and saved as individual PDB fragments. 5. **Domain annotation** Structural and functional features are calculated for each consensus domain, including: - Secondary structure content - Packing density and shape descriptors - Average per-domain pLDDT - Fold classification (via **Foldseek** and **CATH**) - Domain quality metrics --- ## Example Results File The example results are provided as a **tab-separated file** (`.tsv`). Each row corresponds to a **chopped consensus domain** from one UniProt sequence. ### Columns | Column | Description | |--------|-------------| | **uniprot_id** | UniProt identifier with domain index suffix (e.g. `A0A2H4U2D6_01`) | | **md5_domain** | MD5 checksum of the chopped PDB domain (unique identifier) | | **consensus_level** | Confidence in domain boundary assignment (`high`, `med`, etc.) | | **chopping** | Residue range(s) of the domain (e.g. `3-89` or `5-78_168-197` for multi-segment domains) | | **nres_domain** | Number of residues in the domain | | **num_segments** | Number of continuous segments in the chopped domain | | **num_helix_strand_turn** | Total secondary structure elements (helix + strand + turn) | | **num_helix** | Number of α-helices | | **num_strand** | Number of β-strands | | **num_helix_strand** | Helix + strand count | | **num_turn** | Number of turns | | **packing_density** | Packing density (measure of structural compactness) | | **normed_radius_gyration** | Radius of gyration normalized by domain length | | **avg_plddt** | Average AlphaFold pLDDT score for the domain (0–100) | | **proteome_id** | Identifier for the source proteome (from UniProt/NCBI taxonomy) | | **tax_common_name** | Common name of the source species/virus | | **tax_scientific_name** | Scientific name | | **tax_lineage** | Full taxonomic lineage string | | **foldseek_match_id** | Foldseek best match ID (e.g. PDB chain) | | **foldseek_evalue** | Foldseek E-value for best match | | **foldseek_tmscore** | Foldseek TM-score (0–1, higher = more confident structural similarity) | | **cath_label** | Assigned **CATH domain classification** (if available) | | **foldseek_match_type** | Type of Foldseek match (`H` = homologous, `T` = topological, `N` = no confident match) | | **foldseek_query_cov** | Fraction of query covered by Foldseek match | | **foldseek_target_cov** | Fraction of target covered | | **Q_score** | Domain quality score | | **Dom_Domain_Count** | Number of domains assigned by "Dom" algorithm (W. Taylor) (should be '1') | | **DomQual** | Composite **domain quality metric** (0–1, higher = better consensus confidence) | ---