Data for Trees in Python

Setup

Create a new directory python/src/trees and add a new main.py file there. Copy the following starter code into your file.

from dataclasses import dataclass, field

def main() -> None:
    pass

if __name__ == "__main__":
    main()

Starting trees: idea 1 for representing data

We will be building an algorithm for constructing evolutionary trees in the coming lessons. As a first step, let’s think about how to represent a tree using classes. One natural approach is to define separate classes for nodes, edges, and the tree itself.

@dataclass
class Node:
    label: str = "Unlabeled"
    age: int = 0

@dataclass
class Edge:
    parent: Node
    child: Node

@dataclass
class Tree:
    nodes: list[Node] = field(default_factory=list)
    edges: list[Edge] = field(default_factory=list)

Let’s construct a small tree in def main() and see how the data structure behaves.

def main() -> None:
    n1 = Node("A", age=5)
    n2 = Node("B", age=10)
    t = Tree(nodes=[n1, n2], edges=[Edge(n1, n2)])

    t.nodes[0] = Node("A'", age=0)  # replace with a new Node object

    print(t.nodes[0])         # Node(label="A'", age=0)
    print(t.edges[0].parent)  # Node(label='A', age=5) -- still the old one!

When we wrote t.nodes[0] = Node("A'", age=0), we replaced the entry in the node list with a brand-new object. But the Edge still holds a reference to the original Node("A", age=5), because we never updated it. The two prints show different nodes, and the data structure is now inconsistent.

But inconsistency is really a symptom of a deeper problem: this data design is redundant. The edges already hold references to the nodes, so the separate nodes list stores information that is already captured elsewhere: the same node appears in two places and can fall out of sync. This echoes what we saw earlier with shapes: just as adding an area field to Rectangle would be redundant (area is always computable from width and height), storing nodes separately when the edges already reference them is unnecessary. Minimalism is king.

Idea 2

Here is another approach: instead of a separate edge list, each node stores direct references to its children. Let’s build the design up in three steps.

Step 1. Our first attempt uses Node directly as the type annotation for child1 and child2.

@dataclass
class Node:
    label: str = "Unlabeled"
    child1: Node
    child2: Node

@dataclass
class Tree:
    nodes: list[Node] = field(default_factory=list)

Running this raises NameError: name 'Node' is not defined before we even reach def main(). Python evaluates the annotation Node immediately as it reads through the class body, but the class is not finished being defined yet, so the name does not exist.

Step 2. Wrapping Node in quotes tells Python to store the annotation as a string and resolve it later, after the class is fully defined.

@dataclass
class Node:
    label: str = "Unlabeled"
    child1: 'Node'
    child2: 'Node'

def main() -> None:
    n = Node(label="Axolotl")

The NameError is gone, but now we hit a different error at class-definition time: TypeError: non-default argument 'child1' follows default argument. Because label has a default value ("Unlabeled"), it becomes an optional parameter in the __init__ that @dataclass generates. But child1 and child2 have no defaults, making them required parameters — and Python does not allow required parameters to follow optional ones. This is the same rule that makes def f(a=1, b) illegal in any Python function.

Step 3. The fix is = None, which makes the children optional, and | None in the annotation, which tells Python that None is a genuinely valid value for these attributes, not just a workaround.

@dataclass
class Node:
    label: str = "Unlabeled"
    child1: 'Node | None' = None
    child2: 'Node | None' = None

@dataclass
class Tree:
    nodes: list[Node] = field(default_factory=list)

def main() -> None:
    n = Node(label="Axolotl")
    print(n)  # Node(label='Axolotl', child1=None, child2=None)

We now have a base case: a leaf node with no children simply leaves child1 and child2 as None. This is the data structure we will use when we implement the UPGMA phylogenetic tree-building algorithm in the next code-along.

Looking ahead

In this lesson, we designed two approaches to representing evolutionary trees in Python and arrived at a clean recursive Node dataclass. In the next code-along, we will put this data structure to use and implement the full UPGMA phylogenetic tree-building algorithm.

Scroll to Top
Programming for Lovers banner no background
programming for lovers logo cropped

Join our community!

programming for lovers logo cropped
Programming for Lovers banner no background

Join our community!