# Domain Annotation Pipeline: BFVD

This repository demonstrates the output of the [UCL Orengo Group domain-annotation-pipeline](https://github.com/UCLOrengoGroup/domain-annotation-pipeline).  
The pipeline integrates predicted protein structures with domain boundary detection, structural annotations, and functional classification.

---

## Changelog

- v0.1 -- Draft results

---

## Overview of the Pipeline

1. **Input sequences**  
   Viral protein sequences (from UniProt / BFVD) are used as the starting point.

2. **Structure prediction**  
   Structures are predicted with **AlphaFold**, yielding per-residue confidence scores (pLDDT).

3. **Domain boundary prediction**  
   Three independent algorithms are applied to predict structural domain boundaries.  
   A **consensus domain definition** is derived from agreement between methods.

4. **Domain extraction**  
   Consensus domains are **chopped** from the full AlphaFold models and saved as individual PDB fragments.

5. **Domain annotation**  
   Structural and functional features are calculated for each consensus domain, including:
   - Secondary structure content  
   - Packing density and shape descriptors  
   - Average per-domain pLDDT  
   - Fold classification (via **Foldseek** and **CATH**)  
   - Domain quality metrics  

---

## Example Results File

The example results are provided as a **tab-separated file** (`.tsv`).  
Each row corresponds to a **chopped consensus domain** from one UniProt sequence.

### Columns

| Column | Description |
|--------|-------------|
| **uniprot_id** | UniProt identifier with domain index suffix (e.g. `A0A2H4U2D6_01`) |
| **md5_domain** | MD5 checksum of the chopped PDB domain (unique identifier) |
| **consensus_level** | Confidence in domain boundary assignment (`high`, `med`, etc.) |
| **chopping** | Residue range(s) of the domain (e.g. `3-89` or `5-78_168-197` for multi-segment domains) |
| **nres_domain** | Number of residues in the domain |
| **num_segments** | Number of continuous segments in the chopped domain |
| **num_helix_strand_turn** | Total secondary structure elements (helix + strand + turn) |
| **num_helix** | Number of α-helices |
| **num_strand** | Number of β-strands |
| **num_helix_strand** | Helix + strand count |
| **num_turn** | Number of turns |
| **packing_density** | Packing density (measure of structural compactness) |
| **normed_radius_gyration** | Radius of gyration normalized by domain length |
| **avg_plddt** | Average AlphaFold pLDDT score for the domain (0–100) |
| **proteome_id** | Identifier for the source proteome (from UniProt/NCBI taxonomy) |
| **tax_common_name** | Common name of the source species/virus |
| **tax_scientific_name** | Scientific name |
| **tax_lineage** | Full taxonomic lineage string |
| **foldseek_match_id** | Foldseek best match ID (e.g. PDB chain) |
| **foldseek_evalue** | Foldseek E-value for best match |
| **foldseek_tmscore** | Foldseek TM-score (0–1, higher = more confident structural similarity) |
| **cath_label** | Assigned **CATH domain classification** (if available) |
| **foldseek_match_type** | Type of Foldseek match (`H` = homologous, `T` = topological, `N` = no confident match) |
| **foldseek_query_cov** | Fraction of query covered by Foldseek match |
| **foldseek_target_cov** | Fraction of target covered |
| **Q_score** | Domain quality score |
| **Dom_Domain_Count** | Number of domains assigned by "Dom" algorithm (W. Taylor) (should be '1')  |
| **DomQual** | Composite **domain quality metric** (0–1, higher = better consensus confidence) |

---