Professional Certificate in Molecular Epidemiology · Guide

Phylogenetics and Evolution

6 min read Updated 6 May 2026

Phylogenetics is the study of evolutionary relationships among organisms, which seeks to construct and interpret branching diagrams, known as phylogenetic trees, that illustrate the evolutionary history of a group of organisms. These trees represent the inferred pattern of evolutionary descent, with each node representing the common ancestor of the organisms below it, and the branches indicating the evolutionary pathways leading to the present-day organisms.

In molecular epidemiology, phylogenetics is used to investigate the evolution and spread of pathogens, by comparing the DNA or RNA sequences of different isolates. This can provide valuable insights into the origins and transmission dynamics of infectious diseases, and inform public health interventions.

Here are some key terms and vocabulary related to phylogenetics and evolution in the context of molecular epidemiology:

* **Homology**: Homology refers to the similarity of features (e.g., DNA sequences, anatomical structures) between different organisms that result from their descent from a common ancestor. In molecular epidemiology, homologous DNA sequences are used to infer evolutionary relationships between pathogen isolates. * **Multiple sequence alignment**: Multiple sequence alignment (MSA) is the process of aligning three or more DNA or protein sequences in such a way that identical or similar residues are in the same columns. This is a crucial step in phylogenetic analysis, as it allows for the identification of homologous sites and the estimation of evolutionary distances between sequences. * **Evolutionary distance**: Evolutionary distance is a measure of the amount of genetic change that has occurred between two sequences since they diverged from a common ancestor. There are several methods for estimating evolutionary distances, including the Jukes-Cantor, Kimura 2-parameter, and Tamura-Nei models. * **Phylogenetic tree**: A phylogenetic tree is a branching diagram that represents the evolutionary history of a group of organisms. The branches of the tree represent the inferred evolutionary pathways, and the nodes represent the common ancestors of the organisms below them. Phylogenetic trees can be either rooted or unrooted, depending on whether the position of the most recent common ancestor is known or not. * **Maximum likelihood**: Maximum likelihood (ML) is a statistical method for estimating the parameters of a model, given some observed data. In phylogenetics, ML is used to estimate the branch lengths and topology of a phylogenetic tree that best explains the observed sequence data. * **Bootstrap analysis**: Bootstrap analysis is a resampling method used to evaluate the confidence of a phylogenetic tree. It involves creating multiple replicates of the original dataset by randomly sampling with replacement from the aligned sequences, and then estimating a phylogenetic tree for each replicate. The support for each branch is then calculated as the percentage of replicates that contain that branch. * **Parsimony**: Parsimony is a method for inferring phylogenetic trees that seeks to minimize the number of evolutionary changes required to explain the observed data. It is based on the principle of parsimony, which states that the simplest explanation is usually the best. * **Maximum parsimony**: Maximum parsimony (MP) is a specific implementation of the parsimony method that seeks to find the tree that requires the fewest evolutionary changes to explain the observed data. It is often used as a baseline method for phylogenetic analysis, as it is computationally efficient and can provide reasonable estimates of evolutionary relationships. * **Neighbor-joining**: Neighbor-joining (NJ) is a distance-based method for inferring phylogenetic trees. It uses a matrix of pairwise evolutionary distances between sequences to construct a tree, by iteratively joining the closest pairs of sequences until all sequences are joined. * **Bayesian inference**: Bayesian inference is a statistical method for estimating the posterior probability of a model, given some observed data. In phylogenetics, Bayesian inference is used to estimate the posterior probability of a phylogenetic tree, given the observed sequence data and a prior probability distribution over trees. * **Markov Chain Monte Carlo**: Markov Chain Monte Carlo (MCMC) is a computational method for sampling from a probability distribution. In phylogenetics, MCMC is used to sample from the posterior probability distribution over trees, in order to estimate the posterior probability of a tree. * **Molecular clock**: A molecular clock is the assumption that the rate of molecular evolution is constant over time. This allows for the estimation of the timing of evolutionary events, such as the divergence of different lineages. * **Relaxed molecular clock**: A relaxed molecular clock is a more realistic assumption that allows for variation in the rate of molecular evolution over time. This can be modeled using a variety of methods, such as autocorrelated or uncorrelated models. * **Divergence time estimation**: Divergence time estimation is the process of estimating the timing of evolutionary events, such as the divergence of different lineages, using a molecular clock or relaxed molecular clock model. * **Coalescent theory**: Coalescent theory is a mathematical framework for describing the genealogical relationships among a sample of sequences. It models the random process of coalescence, whereby two lineages merge into a common ancestor, and can be used to infer the effective population size and demographic history of a population. * **Recombination**: Recombination is the process by which genetic material from two different parents is combined to form a new genome. This can occur through various mechanisms, such as genetic crossover or gene conversion. Recombination can complicate phylogenetic analysis, as it can lead to conflicting signals in the data. * **Reticulation**: Reticulation refers to the intermingling of evolutionary pathways, as a result of processes such as recombination or hybridization. This can lead to complex network-like phylogenetic relationships, rather than simple tree-like relationships.

Here are some examples and practical applications of phylogenetics and evolution in molecular epidemiology:

* **Tracking the spread of HIV**: HIV is a highly diverse and rapidly evolving virus, with a high mutation rate and a short generation time. Phylogenetic analysis of HIV sequences can provide valuable insights into the origins and transmission dynamics of the virus, and inform public health interventions. For example, a phylogenetic analysis of HIV sequences from men who have sex with men (MSM) in Amsterdam identified several clusters of closely related sequences, suggesting ongoing transmission within the MSM community. * **Investigating foodborne outbreaks**: Foodborne outbreaks can be caused by a variety of pathogens, including bacteria, viruses, and parasites. Phylogenetic analysis of foodborne pathogens can provide valuable insights into the source and transmission of the outbreak, and inform public health interventions. For example, a phylogenetic analysis of Listeria monocytogenes sequences from a multi-state outbreak in the US identified a common source of contamination, and led to a recall of contaminated products. * **Monitoring antimicrobial resistance**: Antimicrobial resistance (AMR) is a major global health threat, with increasing rates of resistance to commonly used antibiotics. Phylogenetic analysis of AMR pathogens can provide valuable insights into the evolution and spread of resistance, and inform strategies for controlling AMR. For example, a phylogenetic analysis of methicillin-resistant Staphylococcus aureus (MRSA) sequences from a hospital outbreak in the UK identified a common clone of MRSA, and led to the implementation of infection control measures to prevent further spread.

Here are some challenges and limitations of phylogenetics and evolution in molecular epidemiology:

* **Sequencing errors**: Sequencing errors can introduce noise into the data, and lead to erroneous inferences about evolutionary relationships. Careful quality control and validation of sequencing data are essential for accurate phylogenetic analysis. * **Recombination**: Recombination can complicate phylogenetic analysis, as it can lead to conflicting signals in the data. Methods for detecting and accounting for recombination are an active area of research in molecular evolution. * **Incomplete lineage sorting**: Incomplete lineage sorting occurs when different loci within a genome have different evolutionary histories, due to the random sampling of alleles during speciation. This can lead to apparent incongruence between gene trees, and complicate the inference of species relationships. * **Computational complexity**: Phylogenetic analysis can be computationally intensive, particularly for large datasets with many sequences. Efficient algorithms and heuristics are an active area of research in computational biology. * **Model misspecification**: Phylogenetic analysis relies on models of molecular evolution, which may be misspecified or oversimplified. Careful evaluation

Key takeaways

These trees represent the inferred pattern of evolutionary descent, with each node representing the common ancestor of the organisms below it, and the branches indicating the evolutionary pathways leading to the present-day organisms.
In molecular epidemiology, phylogenetics is used to investigate the evolution and spread of pathogens, by comparing the DNA or RNA sequences of different isolates.
* **Divergence time estimation**: Divergence time estimation is the process of estimating the timing of evolutionary events, such as the divergence of different lineages, using a molecular clock or relaxed molecular clock model.
For example, a phylogenetic analysis of HIV sequences from men who have sex with men (MSM) in Amsterdam identified several clusters of closely related sequences, suggesting ongoing transmission within the MSM community.
* **Incomplete lineage sorting**: Incomplete lineage sorting occurs when different loci within a genome have different evolutionary histories, due to the random sampling of alleles during speciation.

Phylogenetics and Evolution

Key takeaways

More from Professional Certificate in Molecular Epidemiology