Recursion and Completing our UPGMA Implementation in Python

In this code along, we learn about recursion — a programming technique in which a function calls itself. We use recursion to finish our UPGMA implementation by counting the leaves of a subtree, and then to convert a tree into Newick format, the standard way of representing a phylogenetic tree as a string. Once everything is in place, we run UPGMA on two real biological datasets and visualize the resulting trees.

Recursion

In the previous code-along, we wrote the full UPGMA algorithm but could not run it because we had not yet implemented count_leaves() on our Node class. To do that, we first need to understand recursion.

A recursive function is one that calls itself. At first this sounds circular, but every valid recursive function has two ingredients: a base case that stops the recursion, and an inductive step that reduces the problem to a smaller version of itself.

Create a new file called recursion.py and add the following function, which computes 1 + 2 + ... + n.

def summing_integers(n: int) -> int:
    """
    Recursively compute 1 + 2 + ... + n.
    Args:
        n (int): A non-negative integer.
    Returns:
        int: The sum of all integers from 1 to n inclusive.
    """
    # Base case: the sum of integers up to 0 is 0
    if n == 0:
        return 0
    # Inductive step: assume summing_integers(n-1) is correct,
    # then the answer for n is just n more
    return n + summing_integers(n - 1)
def main():
    print(summing_integers(5))   # 15
    print(summing_integers(10))  # 55
    print(summing_integers(0))   # 0
if __name__ == "__main__":
    main()

Trace through summing_integers(3) by hand. Python will evaluate 3 + summing_integers(2), which requires 2 + summing_integers(1), which requires 1 + summing_integers(0), which hits the base case and returns 0. The results then bubble back up: 1 + 0 = 1, then 2 + 1 = 3, then 3 + 3 = 6.

Exercise: Write two more recursive functions in recursion.py: factorial(n) that computes n! (with factorial(0) = 1), and fibonacci(n) that returns the nth Fibonacci number (with fibonacci(0) = 0 and fibonacci(1) = 1).

Here are the solutions.

def factorial(n: int) -> int:
    """
    Recursively compute n!.
    Args:
        n (int): A non-negative integer.
    Returns:
        int: n factorial.
    """
    if n == 0:
        return 1
    return n * factorial(n - 1)
def fibonacci(n: int) -> int:
    """
    Recursively compute the nth Fibonacci number.
    Args:
        n (int): A non-negative integer.
    Returns:
        int: The nth Fibonacci number.
    """
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fibonacci(n - 1) + fibonacci(n - 2)

Try calling fibonacci(40) or fibonacci(50). You will notice that it is very slow. Why? Each call to fibonacci(n) makes two recursive calls, and each of those makes two more, leading to an exponential number of function calls. For example, fibonacci(50) ends up computing fibonacci(1) billions of times. This is sometimes called exponential recursion, and it is one reason why naive recursive implementations of Fibonacci are impractical for large inputs. Fortunately, our count_leaves() function will not have this problem — each node in the tree is visited exactly once.

Counting leaves

Now we are ready to add recursion to our Node class. Open datatypes.py and add two methods: is_leaf(), which returns True when a node has no children, and count_leaves(), which recursively counts the leaves in the subtree rooted at this node.

Let’s think through the logic of count_leaves() before writing code.

count_leaves(node):
    base case: if node is a leaf, return 1
    inductive step: return count_leaves(child1) + count_leaves(child2)

Now add both methods inside the Node class in datatypes.py.

    def is_leaf(self) -> bool:
        """Return True if this node has no children (i.e., it is a leaf)."""
        if self.child1 is None and self.child2 is None:
            return True
        return False
    def count_leaves(self) -> int:
        """
        Count the number of leaf nodes in the subtree rooted at this node.
        Returns:
            int: 1 if this node is a leaf; otherwise the sum of leaf counts
                 of its children.
        """
        # Base case: this node is a leaf
        if self.is_leaf():
            return 1
        # Inductive step: sum leaves under each child
        leaves = 0
        if self.child1 is not None:
            leaves += self.child1.count_leaves()
        if self.child2 is not None:
            leaves += self.child2.count_leaves()
        return leaves

Test count_leaves() in main() by constructing a small tree by hand.

def main():
    leaf1 = Node(label="A")
    leaf2 = Node(label="B")
    leaf3 = Node(label="C")
    internal1 = Node(child1=leaf1, child2=leaf2)
    root = Node(child1=internal1, child2=leaf3)
    print(root.count_leaves())    # 3
    print(internal1.count_leaves())  # 2
    print(leaf1.count_leaves())   # 1

Newick format

To visualize our tree, we need to write it to a file. The standard format for phylogenetic trees is Newick format, which encodes the tree as a nested string of parentheses. Each leaf is represented by its label, and each internal node is represented by a pair of parentheses surrounding a comma-separated list of its children. Branch lengths follow a colon after each child’s subtree.

For example, a tree with leaves A, B, and C — where A and B merge first — looks like this in Newick format.

((A:0.5,B:0.5)Ancestor:1.0,C:1.5);

The branch length after each child is parent.age - child.age, which gives the height of each branch in the tree. The semicolon at the very end is part of the Newick standard.

We implement this as a recursive method on Node called to_newick_ages(). Add it to the Node class in datatypes.py.

    def to_newick_ages(self) -> str:
        """
        Recursively serialize this subtree to Newick format with branch lengths.
        Each branch length is computed as parent.age - child.age.
        Returns:
            str: Newick string for this subtree (no trailing semicolon).
        """
        # Base case: leaf node — just return the label
        if self.is_leaf():
            return self.label
        parts: list[str] = []
        if self.child1 is not None:
            bl1 = max(self.age - self.child1.age, 0.0)
            parts.append(f"{self.child1.to_newick_ages()}:{bl1:.6f}")
        if self.child2 is not None:
            bl2 = max(self.age - self.child2.age, 0.0)
            parts.append(f"{self.child2.to_newick_ages()}:{bl2:.6f}")
        inside = ",".join(parts)
        return f"({inside}){self.label}" if self.label else f"({inside})"

The top-level to_newick() function in io_util.py calls this method on the root (the last node in the tree) and adds the final semicolon.

def to_newick(tree: Tree) -> str:
    """
    Convert an entire tree to a Newick-formatted string.
    Args:
        tree (Tree): The phylogenetic tree (root is at tree[-1]).
    Returns:
        str: The Newick representation of the tree, terminated with a semicolon.
    """
    return f"({tree[-1].to_newick_ages()});"

Running our code

With all the pieces in place, open main.py. It already imports upgma, read_matrix_from_file, and write_newick_to_file. Let’s add a function to build and save the hemoglobin tree.

from functions import upgma
from io_util import read_matrix_from_file, write_newick_to_file
def hemoglobin_upgma() -> None:
    filename = "Data/HBA1/hemoglobin.mtx"
    species_names, mtx = read_matrix_from_file(filename)
    t = upgma(mtx, species_names)
    write_newick_to_file(t, "Output/HBA1", "hba1.tre")
    print("Hemoglobin tree written to Output/HBA1/hba1.tre")
def main() -> None:
    hemoglobin_upgma()
if __name__ == "__main__":
    main()

Run the program. If everything is implemented correctly, you should see the confirmation message and find a .tre file in your output directory.

Hemoglobin tree written to Output/HBA1/hba1.tre

Now add a second function for the SARS-CoV-2 dataset and call both from main().

def sars2_upgma() -> None:
    filename = "Data/SARS2/sars2.mtx"
    species_names, mtx = read_matrix_from_file(filename)
    t = upgma(mtx, species_names)
    write_newick_to_file(t, "Output/SARS2", "sars2.tre")
    print("SARS-CoV-2 tree written to Output/SARS2/sars2.tre")
def main() -> None:
    hemoglobin_upgma()
    sars2_upgma()
if __name__ == "__main__":
    main()

Visualizing our trees

Open Newick.R and run it to visualize the .tre files we just created. The script reads each Newick file and plots the resulting tree. You should see the species grouped in a way that reflects their evolutionary distances — closely related species appear as neighbors in the tree.

Time to practice!

After you have the chance to check your work below, you are ready to apply what you have learned to the chapter’s exercises.

Scroll to Top
Programming for Lovers banner no background
programming for lovers logo cropped

Join our community!

programming for lovers logo cropped
Programming for Lovers banner no background

Join our community!