PhyVirus Dataset


The phyVirus dataset is a phylogenetic dataset of single-strand RNA viruses described in the manuscript titled:

Biased mutation and selection in RNA viruses by Talia Kustin and Adi Stern.

Thy PhyVirus dataset contains 65,951 sequences of viral coding sequences that span a wide range of viral families and hosts (see figure below).


Sequences were primarily obtained from NIAID Virus Pathogen Database and Analysis Resource (ViPR) and were augmented by sequences of Influenza from the NIAID Influenza Research Database (IRD).

To generate multiple sequence alignments and associated phylogenetic trees we have implemented an in-house computational analysis (for more details see Materials and Methods section of the manuscript):

  1. We cluster homologous sequences using MegaBLAST

  2. We aligned using PRANK and reconstructed phylogenies using PhyML

  3. We implemented an iterative scheme where we “cut” phylogenies into two or more at branches whose length was larger than 0.5.


Each codon-alignment and corresponding midpoint rooted phylogeny represents one viral gene. Note that the same viral gene in a specific viral families may have several associated phylogenies.

Bellow, you can download the full PhyVirus dataset or alignments and phylogenies of specific Baltimore classification or viral family.

Please pay attention to the metadata files:

  1. Phylogeny metadata - contains information about each phylogeny.

  2. Sequence metadata - contains information about all sequences in the phyVirus.

If you have any question please contact Talia Kustin (taliakustin at or Adi Stern (sternadi at

