Solving the Minimum Skew Problem now provides us with an approximate location of ori at position 3923620 in E. coli. It is remarkable that such a simple analysis of the nucleotide frequencies in a genome would lead us to such a precise biological hypothesis.

In an attempt to confirm this hypothesis, let’s look for a hidden message representing a potential DnaA box near this location. Solving the Frequent Words Problem in a window of length 500 starting at position 3923620 (shown below) reveals no 9-mers (along with their reverse complements) that appear three or more times! Even if we have located ori in E. coli, it appears that we still have not found the DnaA boxes that jump-start replication in this bacterium …


Before we give up, let’s examine the ori of Vibrio cholerae one more time to see if it provides us with any insights on how to alter our algorithm to find DnaA boxes in E. coli. You may have noticed that in addition to the three occurrences of "ATGATCAAG" and three occurrences of its reverse complement "CTTGATCAT", the Vibrio cholerae ori contains additional occurrences of "ATGATCAAC" and "CATGATCAT", which differ from "ATGATCAAG" and "CTTGATCAT" in only a single nucleotide:


Finding eight approximate occurrences of our target 9-mer and its reverse complement in a short region is even more statistically surprising than finding the six exact occurrences of "ATGATCAAG" and its reverse complement "CTTGATCAT" that we stumbled upon in the beginning of our investigation. Furthermore, the discovery of these approximate 9-mers makes sense biologically, since DnaA can bind not only to “perfect” DnaA boxes but to their slight variations as well.

Let’s cross our fingers and identify the most frequent 9-mers (with 1 mismatch and reverse complements) within a window of length 500 starting at position 3923620 of the E. coli genome. Bingo! The experimentally confirmed DnaA box in E. coli ("TTATCCACA") is a most frequent 9-mer with 1 mismatch, along with its reverse complement "TGTGGATAA":


We were fortunate that the DnaA boxes of E. coli are captured in the window that we chose. Moreover, while "TTATCCACA" represents a most frequent 9-mer with 1 mismatch and reverse complements in this 500-nucleotide window, it is not the only one: "GGATCCTGG", "GATCCCAGC", "GTTATCCAC", "AGCTGGGAT", and "CTGGGATCA" also appear four times with 1 mismatch and reverse complements.

STOP: Every time we find ori, we seem to find some other surprisingly frequent 9-mers. Why do you think this is?

We do not know what purpose — if any — these other 9-mers serve in the E. coli genome, but we do know that there are many different types of hidden messages in genomes. These hidden messages have a tendency to cluster within a genome, and most of them have nothing to do with replication. However, even providing biologists with a small collection of 9-mers as candidate DnaA boxes is a great aid as long as one of these 9-mers is correct.

The moral is that existing approaches to ori prediction remain imperfect and sometimes inconclusive. Furthermore, even though computational predictions can be powerful, computational biologists should still collaborate with experimental biologists where possible. Yet at the same time, it is clear that a revolution has struck biology. In an era in which we are bombarded by large-scale biological datasets, a computational revolution is very much in progress and is already answering the big unresolved questions of this field.

It’s time to code

Now that we have learned more about string algorithms and shown how we can apply them to analyze bacterial genomes, let’s implement what we have learned in a specific language.


Love P4❤️? Join us and help share our journey!