Learning Objectives
In the previous code along, we used dictionaries to implement a frequency table in order to find the most frequent words of given length in a text. Now that we are comfortable with dictionaries, we will return to the problem of finding clumps in a genome, a problem that can help us scan the genome for areas that have surprisingly frequent k-mers, which may imply that these k-mers serve as “hidden messages” (i.e., that they are involved in protein-DNA binding).
Recall from the core text that we say that a k-mer pattern forms an (L, t)- clump inside a (longer) string genome if there is an interval of genome of length L in which this k-mer appears at least t times. We now restate the Clump Finding Problem.
Clump Finding Problem
Input: A string text, and integers k, L, and t.
Output: All distinct k-mers forming (L, t)-clumps in text.
We then introduced pseudocode for an algorithm called , which ranges over every substring FindClumps() of length window of a given string L. For each such string text, it builds a frequency table for all substrings of length window. Any substrings in this table occurring k or more times are clumps, and we will add them to some list t if they do not already occur in patterns.patterns
FindClumps(text, k, L, t)
patterns ← an array of strings of length 0
n ← length(text)
for every integer i between 0 and n − L
window ← text[i, i + L]
freqMap ← FrequencyTable(window, k)
for every key s in freqMap
if freqMap[s] ≥ t and Contains(patterns, s) = false
patterns ← append(patterns, s)
return patterns
Contains(patterns, s)
for every string pattern in patterns
if s = pattern
return true
return false
In this lesson, we will implement along with the FindClumps() subroutine that identifies whether a given slice of strings Contains() contains a string patterns. We will then apply sFindClumps() to the E. coli genome to identify all short strings that occur frequently in short regions and may serve as hidden messages to the cell.
Code along summary
Setup
Create a folder called clumps in your python/src directory and create a text file called main.py in the python/src/clumps folder. We will edit main.py, which should have the following starter code.
def main():
print("Clumps with Python!")
if __name__ == "__main__":
main()
A first implementation of finding clumps
We begin with implementing . We start with some parameter checking. We will also change the window length L from the pseudocode into a more descriptive parameter find_clumps().window_length
def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Output:
- list: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
# to fill in
Because we don’t know the number of strings that will form clumps in advance, we first create a list of strings of length zero, to which we will append any clumps that we find. We will eventually return this list patterns.patterns
def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Output:
- list: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
patterns: list[str] = []
n = len(text)
# To fill in...
return patterns
We next will fill in the meat of the function. To do so, we will need to range over every possible “window” substring of find_clumps()text having length , forming the frequency table for each such window. We know from our work with ranging over all substrings of a text of given length that the starting positions of all these windows will range from window_length to 0, inclusively, where n - window_lengthn is the length of . We can now fill in the beginning of our for loop.text
def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Output:
- list: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
patterns: list[str] = []
n = len(text)
# range over all possible windows of text
for i in range(0, n - window_length + 1):
# set the current window
window = text[i:i + window_length]
# let's make the frequency table for this window
freq_map = frequency_table(window, k)
# To fill in...
return patterns
We now will consider what to do once we have formed the frequency table of a given window. We want to range through this dictionary and grab all freq_mapk-mer keys whose corresponding integer values are at least equal to , the threshold value for considering a string to be a clump. If we find such a string, and it does not already occur in t, then we should append this string to patterns. We now have completed our patterns function.find_clumps()
def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Output:
- list: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
patterns: list[str] = []
n = len(text)
# range over all possible windows of text
for i in range(0, n - window_length + 1):
# set the current window
window = text[i:i + window_length]
# let's make the frequency table for this window
freq_map = frequency_table(window, k)
# what occurs frequently (i.e., t or more times)?
for s, val in freq_map.items():
# append s to patterns if s occurs frequently and s doesn't already appear in patterns
if val >= t and contains(patterns, s) == False:
patterns.append(s)
return patterns
As for , this type of subroutine might start feeling comfortable now. To implement this function, we can use our in operator to find if a an element is contained in a list. If we ever find a value of the slice contains() that is equal to our query string patterns, then we return s, and if we range over all values of true without finding any value that is equal to patterns, then we return s.false
However, we don’t need to write at all, because in the previous code along, we introduced the contains()in operator to check whether an element is a key within a dictionary. In this case, we can use the same operator to determine if an element is a value in a list (in this case, we are checking if it is not in the list). This allows us to update as follows, without using a find_clumps() subroutine.contains()
def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Output:
- list: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
patterns: list[str] = []
n = len(text)
# range over all possible windows of text
for i in range(0, n - window_length + 1):
# set the current window
window = text[i:i + window_length]
# let's make the frequency table for this window
freq_map = frequency_table(window, k)
# what occurs frequently (i.e., t or more times)?
for s, val in freq_map.items():
# append s to patterns if s occurs frequently and s doesn't already appear in patterns
if val >= t and not (s in patterns):
patterns.append(s)
return patterns
Running the clump finding algorithm
We would like to apply to a real genome. First, let’s apply it to a smaller sample dataset just to make sure that everything appears to be in order. Using the following dataset, we should print only a single occurrence of find_clumps()."AA"
def main():
print("Clumps with Python!")
text = "AAAACGTCGAAAAA"
k = 2
window_length = 4
t = 2
print(find_clumps(text, k, window_length, t)) # Should print "AA".
STOP: Open a terminal, and navigate to theclumpsdirectory using the commandcd python/src/clumps. Then, run your code using the commandpython3 main.py(on Mac OS) orpython main.py(on Windows).
Applying the clump finding algorithm to a bacterial genome, and installing modules
We are ready to scale up our work to a real bacterial genome to see how many strings appear surprisingly frequently in short regions of this genome. We will choose E. coli, the most commonly studied bacterium. You can view the E. coli genome as a .txt file at the Bioinformatics Algorithms website (click here to view). However, the genome has over 4.5 million nucleotides, and copying it into would pose a struggle. Instead, let’s write some code to read this genome from the URL over the internet.main()
First, however, we need to install the module, which will allow us to access a web page’s contents. To do so, we will use pip, the Python package installer, which is included in the installation of Python. To do so, open a command-line terminal and execute requestspip3 install requests (macOS/Linux) or pip install requests (Windows).
Then, include the following import at the top of main.py.
import requests
We next define the address that we wish to access from above as a string url, and we pass this string into a function from the requests module called that attempts to access the address. This function returns a “response” object (we will say more about objects about later in this course).get()
def main():
# Code omitted for clarity
url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
response = requests.get(url)
We next ensure that there was no issue with opening the web page. To do so, we call a function that will give an error if there was any problem. If not, we proceed to the next line, where we access raise_for_status(), which is a string representing the content of the web page; this string is exactly what we want, and so we declare a variable response.text equal to this string.genome
def main():
# Code omitted for clarity
url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
response = requests.get(url)
response.raise_for_status() # give us an error if there was a problem
genome = response.text
We now will print the length of to ensure that it has been read in, which will show that it has over 4.6 million nucleotides. After we have done so, we establish some parameters, call genome on find_clumps() with these parameters, and print the number of resulting patterns that we found.genome
def main():
# Code omitted for clarity
url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
response = requests.get(url)
response.raise_for_status() # give us an error if there was a problem
genome = response.text
print("The number of nucleotides in E. coli genome is", len(genome))
k = 9
window_length = 500
t = 3
clumps = find_clumps(genome, k, window_length, t)
print("Found " + str(len(clumps)) + " total patterns as clumps.")
STOP: Compile and run your code. Be patient! It may take a few minutes to complete. What can you conclude about the number of surprisingly frequent patterns in the E. coli genome?
In our code, we found nearly 2,000 9-mers that are (500, 3)-clumps in the E. coli genome, meaning that a great deal of strings appear surprisingly frequently in short regions and may be involved as “hidden messages”. Yet before returning to the main text, we would like to observe that we can make faster.find_clumps()
Optimizing the clump finding algorithm by avoiding many subroutine calls
About 97% of the time, premature optimization is the root of all evil.
In light of Knuth’s famous quotation, we should always be careful pursuing optimizations before we have a working solution to our problems. Yet in this particular case, we already have an intuitive solution, and so our desire to optimize it is not premature.
The critical insight is that the frequency tables of consecutive substring “windows” in any string are nearly identical. For example, say that we are working with the text text, the "BANANASPLIT"window length is equal to 6, and k is equal to 3. Then the frequency table of the first window () is shown in the table below."BANANA"
| Pattern | Count |
"BAN" | 1 |
"ANA" | 2 |
"NAN" | 1 |
"BANANA" with k equal to 3.
The frequency table of the second window () is shown in the table below. The only difference between this frequency table and the one above is that we lost one occurrence of "ANANAS", the first "BAN"k-mer in , and we gained one occurrence of "BANANA", the final "NAS"k-mer in ."ANANAS"
| Pattern | Count |
"ANA" | 2 |
"NAN" | 1 |
"NAS" | 1 |
"ANANAS" with k equal to 3.STOP: How might this observation help us speed up our clump-finding algorithm?
In general, say that and window_1 are consecutive length-window_2L substrings of a string . The frequency table of text will be the same as that of window_1, except that the former will include one additional occurrence of the first window_2k-mer in , and the latter will include one additional occurrence of the final window_1k-mer in .window_2
Note: You may find the remainder of this section tricky. If so, no worries! It’s an advanced discussion that isn’t necessary for following the remainder of this chapter.
To use this observation to optimize , we note that after generating the frequency table of the first window of find_clumps(), we do not need to call text to generate the frequency table of the second window of text. We need only to modify the frequency table of frequency_table() by decreasing the value associated with text[0:window_length] by 1, and increasing the value associated with text[0:k] by 1.text[window_length + 1 - k:window_length + 1]
More generally, after calling to form the frequency table of frequency_table(), we will range an integer text[0:window_length] between 1 and i (to access all remaining windows). For each such n - window_length + 1, we update the current frequency table by decreasing the value in the frequency table associated with i and increasing the value associated with text[i - 1:i - 1 + k].text[i + window_length - k:i + window_length]
This idea is implemented with the updated function below. First, we perform some parameter checks and declare our list find_clumps_faster() (which we will eventually return) as well as our dictionary patterns.found_patterns
def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Returns:
- list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
patterns: list[str] = []
n = len(text)
# map to track whether I have added a string to patterns yet
found_patterns = {}
# to fill in
return patterns
In the first window, every k-mer is being seen for the first time, so no check is needed as to whether we have encountered it; if a substring occurs or more times within the window, then we add it to t.patterns
def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Returns:
- list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
patterns: list[str] = []
n = len(text)
# map to track whether I have added a string to patterns yet
found_patterns = {}
first_window = text[:window_length]
freq_map = frequency_table(first_window, k)
# append any patterns we find to patterns slice
for s, freq in freq_map.items():
if freq >= t:
patterns.append(s)
found_patterns[s] = True
# range over all remaining possible windows of text
# to fill in
return patterns
We then range over all remaining windows. In each window, we first decrease by 1 the value associated with the first substring of length in the preceding window, which is k. If the value associated with this string in text[i - 1:i - 1 + k] becomes zero as a result, then we clean up the dictionary by using Python’s built-in freq_map operator to delete this pattern from del.freq_map
Then it is simply a matter of incrementing the frequency table associated with the substring at the end of the current window, which is . Once we have done so, we are ready to check any text[i + window_length - k:i + window_length]k-mers that occurred at least times in the window, and add them to t if we have not yet seen them.patterns
def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
"""
Finds a list of strings representing all k-mers that appear at least t times
in a window of given length in the string.
Parameters:
- text (str): An input string.
- k (int): k-mer's of size k.
- window_length (int): the size of substrings of text in which we are looking for clumps
- t (int): The k-mers must appear at least t amount of times.
Returns:
- list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
"""
if len(text) == 0:
raise ValueError("text must not be empty")
if k <= 0 or window_length <= 0 or t <= 0:
raise ValueError("k, window_length, and t must be positive")
if k > len(text):
return []
if window_length < k:
raise ValueError("window length must be at least equal to k")
patterns: list[str] = []
n = len(text)
# map to track whether I have added a string to patterns yet
found_patterns = {}
first_window = text[:window_length]
freq_map = frequency_table(first_window, k)
# append any patterns we find to patterns slice
for s, freq in freq_map.items():
if freq >= t:
patterns.append(s)
found_patterns[s] = True
# range over all remaining possible windows of text
for i in range(1, n - window_length + 1):
# decrease by 1 the value associated with the first substring of length k in the preceding window
old_pattern = text[i - 1:i - 1 + k]
freq_map[old_pattern] -= 1
# clean up the map if the frequency of old_pattern was 1
if freq_map[old_pattern] == 0:
del freq_map[old_pattern]
# add the pattern from the end of the current window
new_pattern = text[i + window_length - k:i + window_length]
freq_map[new_pattern] = freq_map.get(new_pattern, 0) + 1
# I have updated the frequency map :)
for s, freq in freq_map.items():
is_key = found_patterns.get(s, False)
if freq >= t and is_key == False:
patterns.append(s)
found_patterns[s] = True
return patterns
STOP: When we encounter the first for loop appending frequent words to patterns, we don’t check thatfound_pattern[s]isFalse. Why do you think this is the case?
Timing the optimized approach
Now that we have optimized our clump finding algorithm, let’s see if it truly is any faster by timing both and find_clumps(). On our machine, the former runs in about 5 minutes, whereas the latter completes in under 50 seconds. Once again, we encounter the paradigm that we introduced in Chapter 0 that efficient algorithms are particularly important when applying them to large datasets.find_clumps_faster()
def main():
# code for reading in genome from file ...
# (assume genome, k, window_length, t are defined above)
start = time.time()
_ = find_clumps(genome, k, window_length, t)
elapsed1 = time.time() - start
print(f"find_clumps {elapsed1:.6f} seconds")
start = time.time()
clumps = find_clumps_faster(genome, k, window_length, t)
elapsed2 = time.time() - start
print(f"find_clumps_faster {elapsed2:.6f} seconds")
print(f"Speedup: {elapsed1/elapsed2:.2f}x faster")
STOP: How would increasing or decreasing each of the parametersk,window_length, andtaffect the number of clumps we find? Runfind_clumps_faster()multiple times with different values ofk,window_length, andt. How does changing these parameter values affect the number of clumps that you find?
Looking ahead
Unfortunately, although our clump finding algorithm finds candidate frequent words across the entire genome, it finds too many clumps, which means that if we are looking for a specific region like the origin of replication, we will need a targeted approach. We will introduce just such a method in the next code along.
Check your work from the code along
We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:
find_clumps()