Finding a replication origin in a bacterial genome

When a computer scientist looks at a strand of DNA, they infer that it can be represented by the order of its nucleotides. To represent a strand of DNA letters, we will use a string, or a collection of symbols joined into a contiguous “word”. Symbols and strings are built-in variable types in most programming languages, just like integers, decimal numbers, and boolean variables.

We also note that if we know one strand of DNA, then we will automatically know the complementary strand because of base pairing: adenine always pairs with thymine, and cytosine always pairs with guanine. As a result, we only need one string to represent a double-stranded DNA molecule. To be precise, we use the term DNA string to refer to a string of nucleotides from the four-letter alphabet {A,C,G,T}.

The problem of determining the DNA string making up an organism’s genome, or genome sequencing, is its own very challenging problem. (See Chapter 3 of Bioinformatics Algorithms if you are interested.) Sequencing the genome of an organism like E. coli, whose genome consists of 4.6 million nucleotides, was a substantial achievement in 1997, and three years later came the first draft of a human genome (3 billion nucleotides), which cost about $3 billion and that led to a boom in vastly cheaper sequencing technologies.

Still, a computer scientist might not imagine that DNA replication has any computational interest — we only need to take a string corresponding to a DNA strand and return two copies of it! Yet if we take the time to review the underlying biological process, we will be amazed at the complex symphony coordinating genome replication, as well as how computation can help us answer biological questions about replication.

Throughout this chapter, we will consider bacterial genomes, which consist of a single circular chromosome. Bacterial genome replication begins in a single genomic region called the replication origin (denoted ori) and is performed by molecular copy machines called DNA polymerases that attach free-floating nucleotides to the growing strand of DNA, in keeping with the semiconservative hypothesis.

Our goal, then, is to determine where ori is hiding in the genome of a bacterium like E. coli, which consists of around 3 million nucleotides. As we have done in Chapter 0, we could state this as a problem in terms of input and output.

Origin of Replication Problem

Input: A DNA string genome.

Output: The location of ori in genome.

To a laboratory biologist, the Origin of Replication Problem has a straightforward solution: hack out one short segment from the genome at a time until we find a region where replication is disrupted. However, these types of “knockout” experiments take time to design and implement, and they are not guaranteed to be accurate. What happens, for instance, if we find several regions of the genome whose deletion disrupts replication?

Yet a computer scientist shakes their head and points out that we don’t have a clearly defined computational problem because we haven’t defined precisely what we are looking for to qualify as ori. In other words, what precisely characterizes the replication origin that will help us train a computer to find it?

Finding hidden messages in a known replication origin

Our plan for finding ori in bacterial genomes is to begin with a bacterium in which the location of ori has been found experimentally, and then to determine what makes this genomic region special to design a computational approach for finding ori in other bacteria. The species that we will use with an experimentally verified ori is Vibrio cholerae, the bacterium that causes cholera. The nucleotide sequence appearing in its ori is shown below:


How does the bacterial cell know to begin replication exactly in this short region within the much larger Vibrio cholerae genome, which consists of 1,108,250 nucleotides? There must be some “hidden message” in the ori region telling the cell, “Begin replication here!” The question is how to find this hidden message without knowing what it looks like in advance.

Hidden Message Problem

Input: A string text (representing the replication origin of a genome).

Output: A hidden message in text.

Unfortunately, the Hidden Message Problem is still not a computational problem because the notion of a “hidden message” may make sense in terms of human language, but it has not been defined precisely in terms of concepts that we can use to program a computer.

We now have two biological problems to solve as the central goals of this chapter.

  1. Given a shorter ori within a longer genome, what is the hidden message indicating that replication should start in this region?
  2. Given a bacterial genome, where is ori?

We hope to formulate these two biological problems as computational problems and to develop algorithms that can quickly and accurately find where ori is lurking in the genomes of thousands of bacterial species, along with identifying the hidden message making each ori special. In so doing, we will provide you a glimpse into the rapidly growing field of computational biology, in which computers are answering questions about biology that we could have only dreamed of answering experimentally.


Love P4❤️? Join us and help share our journey!

Page Index