In this code along, we will work across four files that together form our UPGMA project. Here is a quick overview of each.
- datatypes.py: starts blank — we will fill it in during this lesson with our core data type definitions.
- io_util.py: already contains helper functions for reading a distance matrix from a CSV file and writing a tree to a file in Newick format.
- main.py: contains starter code that reads a distance matrix from a file, calls our UPGMA function, and writes the resulting tree to a file.
- functions.py: contains function stubs that we will implement in the next code along.
- Newick.R: an R script we will use to visualize the trees we produce.
Dataclasses
By now, you have written __init__() and __repr__() methods for every class we’ve defined. Writing these by hand every time is tedious and error-prone — and because they follow a predictable pattern, Python can generate them for us automatically.
Python’s @dataclass decorator, from the built-in dataclasses module, reads the class-level attribute declarations you write and generates __init__() and __repr__() automatically. Attribute declarations look more like other languages: you simply write the attribute name, its type annotation, and its default value at the class level, without any self. assignments inside __init__. Here is how we would rewrite our Rectangle and Circle classes using @dataclass.
from dataclasses import dataclass
@dataclass
class Rectangle:
"""
Represents a rectangle in 2D space.
Attributes:
width (float): The width of the rectangle.
height (float): The height of the rectangle.
rotation (float): Rotation angle in degrees (default: 0.0).
x1 (float): X-coordinate of the lower-left corner (default: 0.0).
y1 (float): Y-coordinate of the lower-left corner (default: 0.0).
"""
width: float = 1.0
height: float = 1.0
rotation: float = 0.0
x1: float = 0.0
y1: float = 0.0
def area(self) -> float:
"""Return the area of the rectangle."""
return self.width * self.height
@dataclass
class Circle:
"""
Represents a circle in 2D space.
Attributes:
radius (float): The radius of the circle.
x1 (float): X-coordinate of the circle center (default: 0.0).
y1 (float): Y-coordinate of the circle center (default: 0.0).
"""
radius: float = 1.0
x1: float = 0.0
y1: float = 0.0
def area(self) -> float:
"""Return the area of the circle."""
return 3.0 * self.radius ** 2
Let’s test our dataclasses in def main(). Notice that print(r) and print(c) now produce nicely formatted output automatically, with no __repr__ written by hand.
def main():
# Custom rectangle
r = Rectangle(width=3.0, height=4.0)
print("Custom rectangle area:", r.area())
# Default circle
c = Circle(radius = 5.0)
print("Default circle area:", c.area())
# printing r and c
print(r)
print(c)
Establishing data for trees
Now let’s apply @dataclass to build the data structures we need for UPGMA. We will put all of our type definitions in datatypes.py. Our first definition is a type alias for a distance matrix — a two-dimensional list of floats representing the pairwise distances between species. A type alias simply gives a meaningful name to an existing type so that our function signatures are easier to read.
DistanceMatrix = list[list[float]] """A two-dimensional list of floats representing pairwise distances between species."""
We also need a type to represent a tree. For our UPGMA implementation, a tree can be stored as a flat list of Node objects, where the root is always the last element. We therefore define Tree as a type alias for list[Node]. Because this alias refers to Node, it must be placed after the Node declaration in the file.
Tree = list[Node] """A list of Node objects representing a phylogenetic tree structure."""
Next, we define the Node class using @dataclass. A node in a phylogenetic tree needs to track its numeric identifier, its age (how far it is from the leaves, used as the branch height), a label (the species name for leaf nodes, or an ancestor name for internal nodes), and references to its two children.
from dataclasses import dataclass
@dataclass
class Node:
"""
Represents a node in a phylogenetic tree.
Attributes:
num (int): Numeric identifier for the node (e.g., index in a tree list).
age (float): Age (or height) of the node, typically half the distance between clusters.
label (str): Label of the node, usually the species name for leaves.
child1: The first child node, or None if this node is a leaf.
child2: The second child node, or None if this node is a leaf.
"""
num: int = 0
age: float = 0.0
label: str = ""
child1: Node = None
child2: Node = None
Let’s test our Node class in def main().
def main():
v = Node(num = 2, age = 3, label = "New Node")
print(v)
Unfortunately, when we run our code, Python raises an error.
NameError: name 'Node' is not defined. Did you mean: 'None'?
STOP: Why do you think there is an issue here?
The problem is that inside the class body of Node, the name Node has not yet been fully defined — Python hasn’t finished constructing the class when it encounters child1: Node = None. We are essentially asking Python to use Node as a type annotation before Node exists. This is called a forward reference, and Python’s standard type annotation system does not allow it.
To fix this, we import Self from Python’s typing module. Self is a special type that always refers to the class currently being defined, so it sidesteps the forward reference problem entirely. We annotate child1 and child2 as Self | None, which says: this field holds either an instance of this same class, or None. The | None part uses Python’s union type syntax, and = None sets the default value to None, so every new node is a leaf by default.
from dataclasses import dataclass
from typing import Self
@dataclass
class Node:
"""
Represents a node in a phylogenetic tree.
Attributes:
num (int): Numeric identifier for the node (e.g., index in a tree list).
age (float): Age (or height) of the node, typically half the distance between clusters.
label (str): Label of the node, usually the species name for leaves.
child1 (Self | None): The first child node, or None if this node is a leaf.
child2 (Self | None): The second child node, or None if this node is a leaf.
"""
num: int = 0
age: float = 0.0
label: str = ""
child1: Self | None = None
child2: Self | None = None
Looking ahead
We now have all the data structures we need: a Node class that can represent both leaves and internal nodes of a phylogenetic tree, a DistanceMatrix type alias for the pairwise distance data we will read from files, and a Tree type alias for the list of nodes that our algorithm will build. In the next code along, we will implement the UPGMA algorithm itself.