Looking for hidden messages in multiple genomes

We should not jump to the conclusion that "ATGATCAAG"/"CTTGATCAT" is a hidden message for all bacterial genomes without first checking whether it even appears in known ori regions from other bacteria. After all, maybe the clumping effect of "ATGATCAAG"/"CTTGATCAT" in the ori region of Vibrio cholerae is simply a statistical fluke that has nothing to do with replication. Or maybe different bacteria have different DnaA boxes . . .

Let’s check the proposed ori region of Thermotoga petrophila, a bacterium that thrives in extremely hot environments; its name derives from its discovery in the water beneath oil reservoirs, where temperatures can exceed 80° Celsius.


This region does not contain a single occurrence of "ATGATCAAG" or "CTTGATCAT"! Thus, different bacteria may use different DnaA boxes as “hidden messages” to the DnaA protein.

Applying our algorithm for finding frequent words to the ori region above reveals that the following six 9-mers appear in this region 3 or more times:


Something peculiar must be happening because it is extremely unlikely that six different 9-mers will occur so frequently within a short region in a random string. We will cheat a little and consult Ori-Finder, a software tool for finding replication origins in DNA sequences. This software chooses "CCTACCACC" (along with its reverse complement "GGTGGTAGG") as a working hypothesis for the DnaA box in Thermotoga petrophila. Together, these two complementary 9-mers appear five times in the replication origin:


The Clump Finding Problem

We now have a good sense of how to solve our first big scientific question of identifying the hidden message in a given replication origin indicating that replication should begin. Yet it remains to answer our second, larger question, which is where this replication origin lives within a much longer bacterial genome consisting of millions of nucleotides.

Imagine that you are trying to find ori in a newly sequenced bacterial genome. Searching for “clumps” of "ATGATCAAG"/"CTTGATCAT" or "CCTACCACC"/"GGTGGTAGG" is unlikely to help, since this new genome may use a completely different hidden message! Before we lose all hope, let’s change our computational focus: instead of finding clumps of a specific k-mer, let’s try to find every k-mer that forms a clump in the genome. Hopefully, the locations of these clumps will shed light on the location of ori.

Our plan is to slide a window of fixed length L along the genome, looking for a region where a k-mer appears several times in short succession. The parameter value L = 500 reflects the typical length of ori in bacterial genomes.

More formally, given integers L and t, a k-mer pattern forms an (L, t)- clump inside a (longer) string genome if there is an interval of genome of length L in which this k-mer appears at least t times. (This definition assumes that the k-mer completely fits within the interval.) For example, "TGCA" forms a (25,3)-clump in the following string:


From our previous examples of ori regions, we say that "ATGATCAAG" forms a (500,3)-clump in the Vibrio cholerae genome, and that "CCTACCACC" forms a (500,3)-clump in the Thermotoga petrophila genome. We are now ready to formulate the following problem.

Clump Finding Problem

Input: A string text, and integers k, L, and t.

Output: All distinct k-mers forming (L, t)-clumps in text.

The Clump Finding Problem is a more complex problem than we have encountered thus far, and writing a function solving it from scratch would be difficult. However, this is where modularity in writing programs is so helpful. We already have a FrequencyTable() function that will produce a frequency table for a given window of a string of length L. If we apply it to a given window, then we simply need to check if there are any string keys in the table whose values are at least equal to t. We will append any such keys that we have not already seen in some other window of text to a growing list of strings. In the end, this list of strings will contain the (L, t)-clumps of text. This is handled by the following FindClumps() function.

FindClumps(text, k, L, t)
    patterns ← an array of strings of length 0
    n ← length(text)
    for every integer i between 0 and n − L
        window ← text[i, i + L]
        freqMap ← FrequencyTable(window, k)
        for every key s in freqMap
            if freqMap[s] ≥ t and Contains(patterns, s) = false
                patterns ← append(patterns, s)
    return patterns

FindClumps() relies on a subroutine Contains(), which takes as input an array of strings patterns and a string s. This function returns true if s is contained in patterns and false if it is not. It works by ranging over every string in patterns; if it finds a match, then it returns true, and if it makes it through every string in patterns without finding a match, it returns false by default.

Contains(patterns, s)
    for every string pattern in patterns
        if s = pattern
            return true
    return false

Let’s use FindClumps() to look for clumps in the Escherichia coli (E. coli) genome, the workhorse of bacterial genomics. We find hundreds of different 9-mers forming (500, 3)-clumps in the E. coli genome, and it is absolutely unclear which of these 9-mers might represent a DnaA box in the bacterium’s ori region.

At this point, an unseasoned researcher might give up, since it appears that we do not have enough information to locate ori in E. coli. But a veteran computational biologist would dig into the biological details of replication in the hope that they provide new algorithmic insights into finding ori.


Love P4❤️? Join us and help share our journey!

Page Index