Data for Trees in Python

In this code along, we will work across four files that together form our UPGMA project. Here is a quick overview of each.

  • datatypes.py: starts blank — we will fill it in during this lesson with our core data type definitions.
  • io_util.py: already contains helper functions for reading a distance matrix from a CSV file and writing a tree to a file in Newick format.
  • main.py: contains starter code that reads a distance matrix from a file, calls our UPGMA function, and writes the resulting tree to a file.
  • functions.py: contains function stubs that we will implement in the next code along.
  • Newick.R: an R script we will use to visualize the trees we produce.

Dataclasses

By now, you have written __init__() and __repr__() methods for every class we’ve defined. Writing these by hand every time is tedious and error-prone — and because they follow a predictable pattern, Python can generate them for us automatically.

Python’s @dataclass decorator, from the built-in dataclasses module, reads the class-level attribute declarations you write and generates __init__() and __repr__() automatically. Attribute declarations look more like other languages: you simply write the attribute name, its type annotation, and its default value at the class level, without any self. assignments inside __init__. Here is how we would rewrite our Rectangle and Circle classes using @dataclass.

from dataclasses import dataclass

@dataclass
class Rectangle:
    """
    Represents a rectangle in 2D space.

    Attributes:
        width (float): The width of the rectangle.
        height (float): The height of the rectangle.
        rotation (float): Rotation angle in degrees (default: 0.0).
        x1 (float): X-coordinate of the lower-left corner (default: 0.0).
        y1 (float): Y-coordinate of the lower-left corner (default: 0.0).
    """

    width: float = 1.0
    height: float = 1.0
    rotation: float = 0.0
    x1: float = 0.0
    y1: float = 0.0

    def area(self) -> float:
        """Return the area of the rectangle."""
        return self.width * self.height


@dataclass
class Circle:
    """
    Represents a circle in 2D space.

    Attributes:
        radius (float): The radius of the circle.
        x1 (float): X-coordinate of the circle center (default: 0.0).
        y1 (float): Y-coordinate of the circle center (default: 0.0).
    """

    radius: float = 1.0
    x1: float = 0.0
    y1: float = 0.0

    def area(self) -> float:
        """Return the area of the circle."""
        return 3.0 * self.radius ** 2

Let’s test our dataclasses in def main(). Notice that print(r) and print(c) now produce nicely formatted output automatically, with no __repr__ written by hand.

def main():
    # Custom rectangle
    r = Rectangle(width=3.0, height=4.0)
    print("Custom rectangle area:", r.area())

    # Default circle
    c = Circle(radius = 5.0)
    print("Default circle area:", c.area())

    # printing r and c
    print(r)
    print(c)

Establishing data for trees

Now let’s apply @dataclass to build the data structures we need for UPGMA. We will put all of our type definitions in datatypes.py. Our first definition is a type alias for a distance matrix — a two-dimensional list of floats representing the pairwise distances between species. A type alias simply gives a meaningful name to an existing type so that our function signatures are easier to read.

DistanceMatrix = list[list[float]]
"""A two-dimensional list of floats representing pairwise distances between species."""

We also need a type to represent a tree. For our UPGMA implementation, a tree can be stored as a flat list of Node objects, where the root is always the last element. We therefore define Tree as a type alias for list[Node]. Because this alias refers to Node, it must be placed after the Node declaration in the file.

Tree = list[Node]
"""A list of Node objects representing a phylogenetic tree structure."""

Next, we define the Node class using @dataclass. A node in a phylogenetic tree needs to track its numeric identifier, its age (how far it is from the leaves, used as the branch height), a label (the species name for leaf nodes, or an ancestor name for internal nodes), and references to its two children.

from dataclasses import dataclass

@dataclass
class Node:
    """
    Represents a node in a phylogenetic tree.

    Attributes:
        num (int): Numeric identifier for the node (e.g., index in a tree list).
        age (float): Age (or height) of the node, typically half the distance between clusters.
        label (str): Label of the node, usually the species name for leaves.
        child1: The first child node, or None if this node is a leaf.
        child2: The second child node, or None if this node is a leaf.
    """

    num: int = 0
    age: float = 0.0
    label: str = ""
    child1: Node = None
    child2: Node = None

Let’s test our Node class in def main().

def main():
    v = Node(num = 2, age = 3, label = "New Node")
    print(v)

Unfortunately, when we run our code, Python raises an error.

NameError: name 'Node' is not defined. Did you mean: 'None'? 
STOP: Why do you think there is an issue here?

The problem is that inside the class body of Node, the name Node has not yet been fully defined — Python hasn’t finished constructing the class when it encounters child1: Node = None. We are essentially asking Python to use Node as a type annotation before Node exists. This is called a forward reference, and Python’s standard type annotation system does not allow it.

To fix this, we import Self from Python’s typing module. Self is a special type that always refers to the class currently being defined, so it sidesteps the forward reference problem entirely. We annotate child1 and child2 as Self | None, which says: this field holds either an instance of this same class, or None. The | None part uses Python’s union type syntax, and = None sets the default value to None, so every new node is a leaf by default.

from dataclasses import dataclass
from typing import Self

@dataclass
class Node:
    """
    Represents a node in a phylogenetic tree.

    Attributes:
        num (int): Numeric identifier for the node (e.g., index in a tree list).
        age (float): Age (or height) of the node, typically half the distance between clusters.
        label (str): Label of the node, usually the species name for leaves.
        child1 (Self | None): The first child node, or None if this node is a leaf.
        child2 (Self | None): The second child node, or None if this node is a leaf.
    """

    num: int = 0
    age: float = 0.0
    label: str = ""
    child1: Self | None = None
    child2: Self | None = None

Looking ahead

We now have all the data structures we need: a Node class that can represent both leaves and internal nodes of a phylogenetic tree, a DistanceMatrix type alias for the pairwise distance data we will read from files, and a Tree type alias for the list of nodes that our algorithm will build. In the next code along, we will implement the UPGMA algorithm itself.

Scroll to Top
Programming for Lovers banner no background
programming for lovers logo cropped

Join our community!

programming for lovers logo cropped
Programming for Lovers banner no background

Join our community!