Code Along - Finding Clumps in Python

Code along video

Although we strongly suggest coding along with us by following the video above, you can find completed code from the code along in our course code repository.

Learning Objectives

In the previous code along, we used dictionaries to implement a frequency table in order to find the most frequent words of given length in a text. Now that we are comfortable with dictionaries, we will return to the problem of finding clumps in a genome, a problem that can help us scan the genome for areas that have surprisingly frequent k-mers, which may imply that these k-mers serve as “hidden messages” (i.e., that they are involved in protein-DNA binding).

Recall from the core text that we say that a k-mer pattern forms an (L, t)- clump inside a (longer) string genome if there is an interval of genome of length L in which this k-mer appears at least t times. We now restate the Clump Finding Problem.

Clump Finding Problem

Input: A string text, and integers k, L, and t.

Output: All distinct k-mers forming (L, t)-clumps in text.

We then introduced pseudocode for an algorithm called FindClumps(), which ranges over every substring window of length L of a given string text. For each such string window, it builds a frequency table for all substrings off length k. Any substrings in this table occurring t or more times are clumps, and we will add them to some list patterns if they do not already occur in patterns.

FindClumps(text, k, L, t)
    patterns ← an array of strings of length 0
    n ← length(text)
    for every integer i between 0 and n − L
        window ← text[i, i + L]
        freqMap ← FrequencyTable(window, k)
        for every key s in freqMap
            if freqMap[s] ≥ t and Contains(patterns, s) = false
                patterns ← append(patterns, s)
    return patterns

Contains(patterns, s)
    for every string pattern in patterns
        if s = pattern
            return true
    return false

In this lesson, we will implement FindClumps() along with the Contains() subroutine that identifies whether a given list of strings patterns contains a string s. We will then apply FindClumps() to the E. coli genome to identify all short strings that occur frequently in short regions and may serve as hidden messages to the cell.

Code along summary

Setup

Create a folder called clumps in your python/src directory and create a text file called main.py in the python/src/clumps folder. We will edit main.py, which should have the following starter code.

def main():
    print("Clumps with Python!")


if __name__ == "__main__":
    main()

A first implementation of finding clumps

We begin with implementing find_clumps(). We start with some parameter checking. We will also change the window length L from the pseudocode into a more descriptive parameter window_length.

def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Output:
    - list: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    # to fill in

Because we don’t know the number of strings that will form clumps in advance, we first create a list of strings patterns of length zero, to which we will append any clumps that we find. We will eventually return this list patterns.

def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Output:
    - list: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # To fill in...

    return patterns

We next will fill in the meat of the find_clumps() function. To do so, we will need to range over every possible “window” substring of text having length window_length, forming the frequency table for each such window, which allows us to call frequency_table() from a previous code along. We know from our work with ranging over all substrings of a text of given length that the starting positions of all these windows will range from 0 to n - window_length, inclusively, where n is the length of text. We can now fill in the beginning of our for loop.

def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Output:
    - list: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # range over all possible windows of text
    for i in range(n - window_length + 1):

        # set the current window
        window = text[i:i + window_length]

        # let's make the frequency table for this window
        freq_map = frequency_table(window, k)

    # To fill in...

    return patterns

We now will consider what to do once we have formed the frequency table freq_map of a given window. We want to range through this dictionary and grab all k-mer keys whose corresponding integer values are at least equal to t, the threshold value for considering a string to be a clump. If we find such a string, and it does not already occur in patterns, then we should append this string to patterns. We now have completed our find_clumps() function.

def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Output:
    - list: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # range over all possible windows of text
    for i in range(n - window_length + 1):

        # set the current window
        window = text[i:i + window_length]

        # let's make the frequency table for this window
        freq_map = frequency_table(window, k)

        # what occurs frequently (i.e., t or more times)?
        for s, val in freq_map.items():

            # append s to patterns if s occurs frequently and s doesn't already appear in patterns
            if val >= t and contains(patterns, s) == False:
                patterns.append(s)

    return patterns

As for contains(), this type of subroutine might start feeling comfortable now. To implement this function, we can use our in operator to find if a an element is contained in a list. If we ever find a value of the list patterns that is equal to our query string s, then we return true, and if we range over all values of patterns without finding any value that is equal to s, then we return false.

However, we don’t need to write contains() at all, because in the previous code along, we introduced the in operator to check whether an element is a key within a dictionary. In this case, we can use the same operator to determine if an element is a value in a list (in this case, we are checking if it is not in the list). This allows us to update find_clumps() as follows, without using a contains() subroutine.

def find_clumps(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Output:
    - list: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # range over all possible windows of text
    for i in range(n - window_length + 1):

        # set the current window
        window = text[i:i + window_length]

        # let's make the frequency table for this window
        freq_map = frequency_table(window, k)

        # what occurs frequently (i.e., t or more times)?
        for s, val in freq_map.items():

            # append s to patterns if s occurs frequently and s doesn't already appear in patterns
            if val >= t and not (s in patterns):
                patterns.append(s)

    return patterns

Importing our own code

We already wrote frequency_table() in an earlier code along, and one thing we could do is to copy it into main.py. To indicate that this function is helper, code, we will place this function into a second file named helper.py belonging to the same folder as main.py.

Then, in main.py, we will import the function by adding an import statement that only imports the frequency_table() function.

from helper import frequency_table

Note: We could just as well have used the import statement import helper, in which case we would need to call the function using helper.frequency_table().

Running the clump finding algorithm

We would like to apply find_clumps() to a real genome. First, let’s apply it to a smaller sample dataset just to make sure that everything appears to be in order. Using the following dataset, we should print only a single occurrence of "AA".

def main():

    print("Clumps with Python!")

    text = "AAAACGTCGAAAAA"
    k = 2
    window_length = 4
    t = 2

    print(find_clumps(text, k, window_length, t)) # Should print "AA".

Click Run 👇 to try it!

STOP: Open a terminal, and navigate to the clumps directory using the command cd python/src/clumps. Then, run your code using the command python3 main.py (on Mac OS) or python main.py (on Windows).

Applying the clump finding algorithm to a bacterial genome, and installing modules

We are ready to scale up our work to a real bacterial genome to see how many strings appear surprisingly frequently in short regions of this genome. We will choose E. coli, the most commonly studied bacterium. You can view the E. coli genome as a .txt file at the Bioinformatics Algorithms website (click here to view). However, the genome has over 4.5 million nucleotides, and copying it into main() would pose a struggle. Instead, let’s write some code to read this genome from the URL over the internet.

First, however, we need to install the requests module, which will allow us to access a web page’s contents. To do so, we will use pip, the Python package installer, which is included in the installation of Python. To do so, open a command-line terminal and execute pip3 install requests (macOS/Linux) or pip install requests (Windows).

Then, include the following import at the top of main.py.

import requests

We next define the address that we wish to access from above as a string url, and we pass this string into a function from the requests module called get() that attempts to access the address. This function returns a “response” object (we will say more about objects about later in this course).

def main():

    # Code omitted for clarity

    url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    response = requests.get(url)

We next ensure that there was no issue with opening the web page. To do so, we call a function raise_for_status() that will give an error if there was any problem. If not, we proceed to the next line, where we access response.text, which is a string representing the content of the web page; this string is exactly what we want, and so we declare a variable genome equal to this string.

def main():

    # Code omitted for clarity

    url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    response = requests.get(url)

    response.raise_for_status() # give us an error if there was a problem 

    genome = response.text

We now will print the length of genome to ensure that it has been read in, which will show that it has over 4.6 million nucleotides. After we have done so, we establish some parameters, call find_clumps() on genome with these parameters, and print the number of resulting patterns that we found.

def main():

    # Code omitted for clarity

    url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    response = requests.get(url)

    response.raise_for_status() # give us an error if there was a problem 

    genome = response.text

    print("The number of nucleotides in E. coli genome is", len(genome))

    k = 9
    window_length = 500
    t = 3

    clumps = find_clumps(genome, k, window_length, t)
    print("Found " + str(len(clumps)) + " total patterns as clumps.")

Click Run 👇 to try it!

STOP: Compile and run your code. Be patient! It may take a few minutes to complete. What can you conclude about the number of surprisingly frequent patterns in the E. coli genome?

In our code, we found nearly 2,000 9-mers that are (500, 3)-clumps in the E. coli genome, meaning that a great deal of strings appear surprisingly frequently in short regions and may be involved as “hidden messages”. Yet before returning to the main text, we would like to observe that we can make find_clumps() faster. This optimization is important if we want to extend our work from one to multiple genomes.

Optimizing the clump finding algorithm by avoiding many subroutine calls

About 97% of the time, premature optimization is the root of all evil.
Donald Knuth

In light of Knuth’s famous quotation, we should always be careful pursuing optimizations before we have a working solution to our problems. Yet in this particular case, we already have an intuitive solution, and so our desire to optimize it is not premature.

The critical insight is that the frequency tables of consecutive substring “windows” in any string text are nearly identical. For example, say that we are working with the text "BANANASPLIT", the window length is equal to 6, and k is equal to 3. Then the frequency table of the first window ("BANANA") is shown in the table below.

Pattern	Count
`"BAN"`	1
`"ANA"`	2
`"NAN"`	1

The frequency table for the text "BANANA" with k equal to 3.

The frequency table of the second window ("ANANAS") is shown in the table below. The only difference between this frequency table and the one above is that we lost one occurrence of "BAN", the first k-mer in "BANANA", and we gained one occurrence of "NAS", the final k-mer in "ANANAS".

Pattern	Count
`"ANA"`	2
`"NAN"`	1
`"NAS"`	1

The frequency table for the text "ANANAS" with k equal to 3.

STOP: How might this observation help us speed up our clump-finding algorithm?

In general, say that window_1 and window_2 are consecutive length-L substrings of a string text. The frequency table of window_1 will be the same as that of window_2, except that the former will include one additional occurrence of the first k-mer in window_1, and the latter will include one additional occurrence of the final k-mer in window_2.

Note: You may find the remainder of this section tricky. If so, no worries! It’s an advanced discussion that isn’t necessary for following the remainder of this chapter.

To use this observation to optimize find_clumps(), we note that after generating the frequency table of the first window of text, we do not need to call frequency_table() to generate the frequency table of the second window of text. We need only to modify the frequency table of text[0:window_length] by decreasing the value associated with text[0:k] by 1, and increasing the value associated with text[window_length + 1 - k:window_length + 1] by 1.

More generally, after calling frequency_table() to form the frequency table of text[0:window_length], we will range an integer i between 1 and n - window_length + 1 (to access all remaining windows). For each such i, we update the current frequency table by decreasing the value in the frequency table associated with text[i - 1:i - 1 + k] and increasing the value associated with text[i + window_length - k:i + window_length].

This idea is implemented with the updated find_clumps_faster() function below. First, we perform some parameter checks and declare our list patterns (which we will eventually return) as well as our dictionary found_patterns.

def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Returns:
    - list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # map to track whether I have added a string to patterns yet
    found_patterns: dict[str, bool] = {}

    # to fill in

    return patterns

In the first window, every k-mer is being seen for the first time, so no check is needed as to whether we have encountered it; if a substring occurs t or more times within the window, then we add it to patterns.

def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Returns:
    - list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # map to track whether I have added a string to patterns yet
    found_patterns = {}
    first_window = text[:window_length]
    freq_map = frequency_table(first_window, k)

    # append any patterns we find to our list
    for s, freq in freq_map.items():
        if freq >= t:
            patterns.append(s)
            found_patterns[s] = True

    # range over all remaining possible windows of text

    # to fill in

    return patterns

We then range over all remaining windows. In each window, we first decrease by 1 the value associated with the first substring of length k in the preceding window, which is text[i - 1:i - 1 + k]. If the value associated with this string in freq_map becomes zero as a result, then we clean up the dictionary by using Python’s built-in del operator to delete this pattern from freq_map.

Then it is simply a matter of incrementing the frequency table associated with the substring at the end of the current window, which is text[i + window_length - k:i + window_length]. Once we have done so, we are ready to check any k-mers that occurred at least t times in the window, and add them to patterns if we have not yet seen them.

def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Returns:
    - list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # map to track whether I have added a string to patterns yet
    found_patterns = {}
    first_window = text[:window_length]
    freq_map = frequency_table(first_window, k)

    # append any patterns we find to patterns list
    for s, freq in freq_map.items():
        if freq >= t:
            patterns.append(s)
            found_patterns[s] = True

    # range over all remaining possible windows of text
    for i in range(1, n - window_length + 1):

        # decrease by 1 the value associated with the first substring of length k in the preceding window
        old_pattern = text[i - 1:i - 1 + k]
        freq_map[old_pattern] -= 1

        # clean up the map if the frequency of old_pattern was 1
        if freq_map[old_pattern] == 0:
            del freq_map[old_pattern]

        # add the pattern from the end of the current window
        new_pattern = text[i + window_length - k:i + window_length]
        freq_map[new_pattern] = freq_map.get(new_pattern, 0) + 1

        # I have updated the frequency map :)
        for s, freq in freq_map.items():
            is_key = found_patterns.get(s, False)
            if freq >= t and is_key == False:
                patterns.append(s)
                found_patterns[s] = True

    return patterns

STOP: When we encounter the first for loop appending frequent words to patterns, we don’t check that found_pattern[s] is False. Why do you think this is the case?

Yet we can make our algorithm a bit faster with a final clever trick. When we slide the current window over by a single letter, the only k-mer that can occur more times in the new frequency table compared to the previous frequency table is new_pattern. As a result, we don’t need to range over the entire frequency table to look for frequent patterns: we only need to check the frequency of new_pattern (and whether we have identified it as frequent in a previous window).

def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Returns:
    - list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    patterns: list[str] = []
    n = len(text)

    # map to track whether I have added a string to patterns yet
    found_patterns = {}
    first_window = text[:window_length]
    freq_map = frequency_table(first_window, k)

    # append any patterns we find to patterns list
    for s, freq in freq_map.items():
        if freq >= t:
            patterns.append(s)
            found_patterns[s] = True

    # range over all remaining possible windows of text
    for i in range(1, n - window_length + 1):

        # decrease by 1 the value associated with the first substring of length k in the preceding window
        old_pattern = text[i - 1:i - 1 + k]
        freq_map[old_pattern] -= 1

        # clean up the map if the frequency of old_pattern was 1
        if freq_map[old_pattern] == 0:
            del freq_map[old_pattern]

        # add the pattern from the end of the current window
        new_pattern = text[i + window_length - k:i + window_length]
        freq_map[new_pattern] = freq_map.get(new_pattern, 0) + 1

        # the only pattern that wasn't found to be frequent in the previous frequency table is new_pattern
        is_key = found_patterns.get(new_pattern, False)
        if freq_map[new_pattern] >= t and is_key == False:
            patterns.append(new_pattern)
            found_patterns[new_pattern] = True

    return patterns

Timing the optimized approach

Now that we have optimized our clump finding algorithm, let’s see if it truly is any faster by timing both find_clumps() and find_clumps_faster(). On our machine, the former runs in about 5 minutes, whereas the latter completes in under 50 seconds. Once again, we encounter the paradigm that we introduced in Chapter 0 that efficient algorithms are particularly important when applying them to large datasets.

def main():

    # code for reading in genome from file ...
    # (assume genome, k, window_length, t are defined above)

    start = time.time()
    find_clumps(genome, k, window_length, t)
    elapsed1 = time.time() - start
    print(f"find_clumps took {elapsed1:.6f} seconds")

    start = time.time()
    find_clumps_faster(genome, k, window_length, t)
    elapsed2 = time.time() - start
    print(f"find_clumps_faster took {elapsed2:.6f} seconds")

    print(f"Speedup: {elapsed1/elapsed2:.2f}x faster")

Click Run 👇 to try it!

STOP: How would increasing or decreasing each of the parameters k, window_length, and t affect the number of clumps we find? Run find_clumps_faster() multiple times with different values of k, window_length, and t. How does changing these parameter values affect the number of clumps that you find?

Using a set for clump finding

In clump finding, our goal is to output which k-mers form clumps, not how many times we discovered them while sliding the window. But when we scan many overlapping windows, the same k-mer can become “frequent” in many different windows. If we store results in a list and keep appending, we’ll get duplicates, and we’ll need extra logic (or extra time) to deduplicate later.

Python’s built-in set data structure is the perfect tool here:

it stores unique elements only.
Adding an element that’s already present is harmless (it does nothing).
Membership and insertion are fast.

So instead of juggling a second dictionary like found_patterns to remember whether we’ve already appended something, we can store clumps directly in a set, which will require a few changes.

Change 1: Replace the output list with a set

In the original version, we accumulated results in a list:

patterns: list[str] = []

That makes it easy to accidentally append the same k-mer many times (because the same k-mer can be frequent in multiple windows). Instead, we create a set:

clumps: set[str] = set()

A set stores each k-mer at most once. If we “add” the same k-mer again later, nothing changes, which is exactly what we want for clump finding.

Change 2: Remove the found_patterns dictionary

The list-based version needed a found_patterns dictionary to remember whether a pattern had already been appended, so we could avoid duplicates. Once we use a set, this bookkeeping becomes unnecessary. The set itself is the bookkeeping: it enforces uniqueness automatically. As a result, we do not need found_patterns.

Change 3: When we discover a frequent k-mer, use the add() function

In the first window, the list-based code did two things:

patterns.append(s)
found_patterns[s] = True

With a set, those two lines become one:

clumps.add(s)

Change 4: In the sliding loop, replace found_pattern with our set

The original implementation of find_clumps_faster() looked up whether new_pattern had already been identified before adding it to our list of patterns. With a set, if new_pattern occurs at least t times in the current window, then we simply call

clumps.add(new_pattern)

This will add new_pattern to the set only if it is not already present.

Change 5: Convert the set to a list at return time

To ensure that we return a list, at the end of the function we sort the strings in the set and return the sorted list:

return sorted(clumps)

The updated find_clumps_faster() function using a set is shown below.

def find_clumps_faster(text: str, k: int, window_length: int, t: int) -> list[str]:
    """
    Finds a list of strings representing all k-mers that appear at least t times
    in a window of given length in the string.
    Parameters:
    - text (str): An input string.
    - k (int): k-mer's of size k.
    - window_length (int): the size of substrings of text in which we are looking for clumps
    - t (int): The k-mers must appear at least t amount of times.
    Returns:
    - list[str]: A list of k-mers that occur at least t times in a window of length window_length in text.
    """

    if len(text) == 0:
        raise ValueError("text must not be empty")

    if k <= 0 or window_length <= 0 or t <= 0:
        raise ValueError("k, window_length, and t must be positive")

    if k > len(text):
        return []

    if window_length < k:
        raise ValueError("window length must be at least equal to k")

    n = len(text)

    # We'll store clump-forming patterns in a set so we never add duplicates.
    clumps: set[str] = set()

    # Build the frequency table for the first window.
    first_window = text[:window_length]
    freq_map = frequency_table(first_window, k)

    # Any k-mer that appears at least t times in the first window is a clump.
    for s, freq in freq_map.items():
        if freq >= t:
            clumps.add(s)

    # Slide the window across the text.
    for i in range(1, n - window_length + 1):

        # Remove the k-mer that fell out of the window on the left.
        old_pattern = text[i - 1:i - 1 + k]
        freq_map[old_pattern] -= 1
        if freq_map[old_pattern] == 0:
            del freq_map[old_pattern]

        # Add the new k-mer that entered the window on the right.
        new_pattern = text[i + window_length - k:i + window_length]
        freq_map[new_pattern] = freq_map.get(new_pattern, 0) + 1

        if freq_map[new_pattern] >= t:
            clumps.add(new_pattern)

    # Return as a list (sorted for nicer output).
    return sorted(clumps)

Looking ahead

Unfortunately, although our clump finding algorithm finds candidate frequent words across the entire genome, it finds too many clumps, which means that if we are looking for a specific region like the origin of replication, we will need a targeted approach. We will introduce just such a method in the next code along.

Next lesson

Check your work from the code along

We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:

find_clumps()

Finding Clumps in Python

Code along video

Learning Objectives

Code along summary

Setup

A first implementation of finding clumps

Importing our own code

Running the clump finding algorithm

Applying the clump finding algorithm to a bacterial genome, and installing modules

Optimizing the clump finding algorithm by avoiding many subroutine calls

Timing the optimized approach

Using a set for clump finding

Change 1: Replace the output list with a set

Change 2: Remove the found_patterns dictionary

Change 3: When we discover a frequent k-mer, use the add() function

Change 4: In the sliding loop, replace found_pattern with our set

Change 5: Convert the set to a list at return time

Looking ahead

Check your work from the code along

Join our community!

Join our community!