Using Distance Matrices to Build Evolutionary Trees

A first attempt at defining a computational problem for tree reconstruction

The scientific problem that we wish to solve has now come into focus.

Tree Reconstruction Problem

Input: A collection of species.

Output: A tree that best represents the evolutionary relationships among the species.

However, although this problem makes sense as a high-level scientific problem, a computer scientist looks at a problem like this with some puzzlement. What data do we have for each of the species in the input? Can we define a tree precisely? And what does it mean for this tree to “best represent” the relationships among the species?

The pitfalls of reconstructing trees from characters

Although these questions are all important, they were far from the mind of the 19th century biologists, who classified organisms using characters, or physical and anatomical characteristics that divide the organisms into groups.

For example, the predominant evolutionary division of dinosaurs is neither number of legs, nor dietary preferences, but rather hip bone shape. In 1888, Harry Seeley proposed this division after noticing that some dinosaur fossils had a pubis bone pointing downward and forward like modern reptiles, while others had a pubis pointing backward like modern birds. Seeley named the former group Saurischia (meaning “lizard-hipped”) and the latter group Ornithischia (meaning “bird-hipped), a division shown in the simplified figure below that has remained a cornerstone of paleontology for over a century.

Figure: Seeley’s use of hip bone shape as a character separates the evolutionary tree of dinosaurs into Saurischia and Ornithischia.
Note: Paradoxically, most paleontologists now believe that modern birds evolved from the lizard-hipped Saurischia, reiterating that age-old maxim that in biology, nothing makes sense.
STOP: Say that we are forming an evolutionary tree of all stick insects. What character might you use to divide the insects?

As students will reliably reply that Christopher Columbus (not Eratosthenes) discovered that the world is round, they will universally respond to the preceding question, “Wings!” And yet when we look at a published evolutionary tree of stick insects (figure below), our hypothesis that wings cleanly separate these species into two groups disintegrates.

Figure: A stick insect evolutionary tree consisting of winged species (blue) and wingless (red), showing seven transitions between winged and wingless species (shown as stars).

Because using characters for tree reconstruction is flawed, we should transition toward an approach based on genetic data (i.e., DNA and proteins). This approach will prove much superior when it comes to analyzing viruses; after all, no microscope that has ever been manufactured would allow us to differentiate a coronavirus causing a pandemic from one that is harmless to humans.

Distance matrices provide a rich source of input data for tree construction

Fortunately, we live in an era in which genomic data, like those that we analyzed in Chapter 1, are plentiful. One common method for comparing n species is to form an n × n distance matrix D in which entry D[i, j] represents the distance between the i-th and j-th species. We should be comfortable with quantifying the distance between biological objects, since in Chapter 1’s practice exercises, we quantified the “beta diversity” between two metagenomic samples.

When building a distance matrix for a collection of species having known genomes, biologists often start with a multiple alignment of the same gene constructed from these species. Each row corresponds to a gene from one species, allowing us to analyze specific differences across the columns of the multiple alignment. The figure below shows a toy multiple alignment.

Figure: A toy multiple alignment for a collection of four species. Real genes range from a few hundred to several million nucleotides long, depending on the organism and gene.
Note: If you are interested in studying algorithms for building multiple alignments of genes, then please check out our Bioinformatics Algorithms platform, where we devote two chapters to alignment algorithms.

Once we are equipped with a multiple alignment, we can infer a distance matrix from it in a number of ways. One simple method is to count the number of differing symbols between corresponding columns. The distance matrix resulting from this computation for the toy multiple alignment above is shown in the figure below.

Figure: The distance matrix D accompanying the multiple alignment from the previous figure, in which D[i, j] is computed by taking the number of differing symbols between the sequences in rows i and j of the alignment. The values D[0, 1] and D[1, 0] are highlighted, which are both equal to 3 because the number of differing symbols between the sequences representing “Chimp” and “Human” in rows 0 and 1 (shown in red) is equal to 3.

We now can reformulate our Tree Reconstruction Problem to include a distance matrix, as shown below.

Distance-Based Tree Reconstruction Problem

Input: A distance matrix from a collection of species.

Output: A tree corresponding to the distance matrix that represents the evolutionary relationships between the species.

We have now resolved the concerns about the input of this problem, but we must focus on the output as well. How can a tree “correspond” to a distance matrix? And what exactly is a tree, anyway?

Scroll to Top
Programming for Lovers banner no background
programming for lovers logo cropped

Join our community!

programming for lovers logo cropped
Programming for Lovers banner no background

Join our community!