Code Along - Finding Frequent Words in Python

Code along video

Although we strongly suggest coding along with us by following the video above, you can find completed code from the code along in our course code repository.

Learning objectives

In the previous code along, we introduced dictionaries in Python, and we are now ready to use what we have learned to find the most frequent strings (of some fixed length) within a text. This task is useful when hunting through a short region of a genome to find a frequently occurring “hidden message” of interest (such as a protein binding site), and it relies on the following computational problem.

Frequent Words Problem

Input: A string text and an integer k.

Output: All most frequent k-mers in text.

In the core text, we saw that we could solve this problem quickly using a dictionary (which in the core text we often called a “map”). To do so, given a string text and an integer k, we first use a function FrequencyTable() to generate a map that associates each substring of length k identified in text to its number of occurrences in text.

FrequencyTable(text, k)
    freqMap ← empty map
    n ← length(text)
    for every integer i between 0 and n − k
        pattern ← text[i, i + k]
        if freqMap[pattern] doesn't exist
            freqMap[pattern] = 1
        else
            freqMap[pattern]++
    return freqMap

We can now use this function as a subroutine when solving the Frequent Words Problem using the following BetterFrequentWords() function.

BetterFrequentWords(text, k)
    frequentPatterns ← an array of strings of length 0
    freqMap ← FrequencyTable(text, k)
    max ← MaxMapValue(freqMap)
    for all strings pattern in freqMap
        if freqMap[pattern] = max
            frequentPatterns ← append(frequentPatterns, pattern)
    return frequentPatterns

This function also relies on a subroutine MaxMapValue() that identifies the maximum value in a given dictionary mapping strings to integers, which we wrote as follows.

MaxMapValue(dict)
    m ← 0
    firstTime = true
    for every key pattern in dict
        if firstTime = true or dict[pattern] > m
            firstTime = false
            m ← dict[pattern]
    return m

In this lesson, we will apply what we learned about dictionaries in Python to implement all these functions and find frequent words in a real bacterial genome.

Code along summary

Setup

Create a folder called frequent_words in your python/src directory, and create a text file called main.py in the python/src/frequent_words folder. We will edit main.py, which should have the following starter code.

def main():
    print("Finding frequent words in Python!")

if __name__ == "__main__":
    main()

Finding frequent words

To implement find_frequent_words(), we start with a parameter check, and then we declare our list of strings that we eventually want to return; we do not know in advance how many elements we will return because text may have multiple most frequent k-mers.

def find_frequent_words(text: str, k: int) -> list[str]:
    """
    find_frequent_words returns a list containing the most frequent k-mers occurring 
    in text, including overlaps.

    Parameters:
    - text (str): A given text for the function.
    - k (int): The size of the k-mers.

    Returns:
    - The list of the most frequent k-mers occurring in text, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return []

    freq_patterns = []

    # To fill in...

    return freq_patterns

We then build the frequency table for text and k, which we pass to a subroutine.

def find_frequent_words(text: str, k: int) -> list[str]:
    """
    find_frequent_words returns a list containing the most frequent k-mers occurring 
    in text, including overlaps.

    Parameters:
    - text (str): A given text for the function.
    - k (int): The size of the k-mers.

    Returns:
    - The list of the most frequent k-mers occurring in text, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return []

    freq_patterns = []

    # Build frequency table.
    freq_map = frequency_table(text, k)

    # To fill in...

    return freq_patterns

Now that we have the frequency table, we can identify its maximum value max_val, a task that we pass to another subroutine.

def find_frequent_words(text: str, k: int) -> list[str]:
    """
    find_frequent_words returns a list containing the most frequent k-mers occurring 
    in text, including overlaps.

    Parameters:
    - text (str): A given text for the function.
    - k (int): The size of the k-mers.

    Returns:
    - The list of the most frequent k-mers occurring in text, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return []

    freq_patterns = []

    # Build frequency table.
    freq_map = frequency_table(text, k)

    # Once I have the frequency table, find the maximum value.
    max_val = max_map_value(freq_map)

    # To fill in...

    return freq_patterns

Because we have relied on subroutines, find_frequent_words() is nearly complete. We just need to range over the dictionary, identifying which keys have values equal to max, and appending these strings to freq_patterns. To do so, we can range over the keys and values of the dictionary using the ranging trick that we learned in the previous code along.

def find_frequent_words(text: str, k: int) -> list[str]:
    """
    find_frequent_words returns a list containing the most frequent k-mers occurring 
    in text, including overlaps.

    Parameters:
    - text (str): A given text for the function.
    - k (int): The size of the k-mers.

    Returns:
    - The list of the most frequent k-mers occurring in text, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return []

    freq_patterns = []

    # Build frequency table.
    freq_map = frequency_table(text, k)
     
    # Once I have the frequency table, find the maximum value.
    max_val = max_map_value(freq_map)

    # what keys of this table reach the max value?
    for pattern, val in freq_map.items():
        if val == max_val:
            # append!
            freq_patterns.append(pattern)

    return freq_patterns

Great! Now we just need to write the max_map_value() and frequency_table() subroutines.

Exercise: Try implementing max_map_value() and frequency_table().

Finding the maximum value in a map

We start with max_map_value() and declare a variable m to be equal to zero, which will hold our maximum value, and which we will eventually return.

def max_map_value(map: dict[str, int]) -> int:
    """
    max_map_value finds the maximum value of a given dictionary.

    Parameters:
    - map: A dictionary that has integers as values.

    Returns:
    - int: The maximum value in the given dictionary.
    """
    if len(map) == 0:
        raise ValueError("Error: empty map given as input.")

    m = 0 # will hold the maximum value

    # to be filled in

    return m

This type of function should hopefully start to feel natural. We seemingly just need to range over the values of the map, and if we find one that is larger than m, then we will update m to be equal to this value. Remember from the previous code along that we can range over the keys and values of a dictionary dict using the notation for key, val in dict.items(), where key and val are variable names.

The only difference in this case is that we are not interested in the keys of the dictionary, only their values. Accordingly, we use an underscore (_) to refer to the current key of the map when we perform the ranging.

def max_map_value(dictionary: dict[str, int]) -> int:
    """
    max_map_value finds the maximum value of a given dictionary.

    Parameters:
    - dictionary: A dictionary that has integers as values.

    Returns:
    - int: The maximum value in the given dictionary.
    """

    if len(map) == 0:
        raise ValueError("Error: empty map given as input.")

    m = 0  # will hold the maximum value

    for _, val in dictionary.items():
        if val > m:
            m = val

    return m

STOP: This function has a small bug. What is it?

The astute learner will recall that when implementing a function max_integer_array() to find the minimum value in a given slice of integers, we needed to be careful to consider the case when every element in the array was positive.

One workaround is that we could apply what we learned in the previous code along, namely that dict.values() gives the collection of values in a dictionary. As a result, we can simply range over list(dictionary.values()), updating m if we are at the initial value of dict.values() or if the current value in dictionary.values() is larger than the current value of m.

def max_map_value(dictionary: dict[str, int]) -> int:
    """
    max_map_value finds the maximum value of a given dictionary.

    Parameters:
    - dictionary: A dictionary that has integers as values.

    Returns:
    - int: The maximum value in the given dictionary.
    """

    if len(dictionary) == 0:
        raise ValueError("Error: empty map given as input.")

    vals = list(dictionary.values())
    m = vals[0]  # will hold the maximum value

    for val in vals:
        if val > m:
            m = val

    return m

One implementation of max_integer_array() is shown below, which handles this case by updating the current maximum value if we are considering the first element of the slice.

def max_integer_array(integers: list[int]) -> int:
    """
    max_integer_array finds the maximum value of a given integer list.

    Parameters:
    - integers (list): A list of integers.

    Returns:
    - int: The maximum value in the given list.
    """

    if len(integers) == 0:
        raise ValueError("list is empty")

    m = 0

    # Range over the list and if current element is 0-th element or larger than the current maximum,
    # update m to be this value.
    for i in range(len(integers)):
        if i == 0 or integers[i] > m:
            m = integers[i]
    
    return m

We should also handle the case that all the values in the dictionary given to max_map_value() are negative (even though this won’t be the case for frequency tables). The problem with extending the above idea to implement max_map_value() is that dictionaries do not have a set order, and so there is no such thing as the initial element of a map.

STOP: How can we circumvent this problem and fix max_map_value()?

The workaround that we will use is to declare a boolean variable, which we call first_time_through, and set it equal to True before the loop. Rather than check if i is equal to zero, we will check if first_time_through is equal to False.

def max_map_value(dictionary: dict[str, int]) -> int:
    """
    max_map_value finds the maximum value of a given dictionary.

    Parameters:
    - dictionary: A dictionary that has integers as values.

    Returns:
    - int: The maximum value in the given dictionary.
    """

    if len(dictionary.keys()) == 0:
        raise ValueError("dictionary is empty")

    m = 0
    first_time_through = True    
    for _, val in dictionary.items():  # range over map and update m every time I find a value that is larger
        if first_time_through == True or val > m:
            m = val
            first_time_through = False
    return m

We make one more simplification. Because first_time_through is boolean, we don’t need to check whether it is equal to True; we can simply use it as is.

def max_map_value(dictionary: dict[str, int]) -> int:
    """
    max_map_value finds the maximum value of a given dictionary.

    Parameters:
    - dictionary: A dictionary that has integers as values.

    Returns:
    - int: The maximum value in the given dictionary.
    """

    if len(dictionary.keys()) == 0:
        raise ValueError("dictionary is empty")

    m = 0
    first_time_through = True    
    for _, val in dictionary.items():  # range over map and update m every time I find a value that is larger
        if first_time_through or val > m:
            m = val
            first_time_through = False
    return m

Although the above way of writing our code will work effectively, Python provides a shortcut. Recall from the previous code along that we can access a list containing the keys of a given dictionary dict by calling dict.keys(). We can also obtain a list of the values of such a dictionary by calling dict.values(). In this case, we are looking for the maximum value of that list, which we can identify using the built-in function max(). As a result, we could just as easily write max_map_value() as a single-line function that returns max(dict.values()).

def max_map_value(dictionary: dict[str, int]) -> int:
    """
    max_map_value finds the maximum value of a given dictionary.

    Parameters:
    - dictionary: A dictionary that has integers as values.

    Returns:
    - int: The maximum value in the given dictionary.
    """

    if len(dictionary.keys()) == 0:
        raise ValueError("dictionary is empty")

    return max(dictionary.values())

Constructing the frequency table of a string

Our second subroutine to implement is frequency_table(). We will start with declaring the map freq_map that will hold the frequency table, which we will eventually return.

def frequency_table(text: str, k: int) -> dict[str, int]:
    """
    frequency_table finds the frequencies of each k-mer occurring in a given text, 
    including overlaps.

    Parameters:
    - text (str): The string text to search for k-mers.
    - k (int): The size of the k-mers.

    Returns:
    - dict[str, int]: The dictionary of k-mers to their frequencies in the given
    text string, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return {}

    freq_map = {}

    # To fill in...

    return freq_map

We then will range over all k-mer substrings of text, which we already know how to do from our work with counting the number of occurrences of a k-mer in text using pattern_count(). Each time through the loop, we will check whether the current k-mer pattern appears as a key in the frequency table. If so, then we will increase freq_map[pattern] by 1; if not, we will set freq_map[pattern] equal to 1.

def frequency_table(text: str, k: int) -> dict[str, int]:
    """
    frequency_table finds the frequencies of each k-mer occurring in a given text, 
    including overlaps.

    Parameters:
    - text (str): The string text to search for k-mers.
    - k (int): The size of the k-mers.

    Returns:
    - dict[str, int]: The dictionary of k-mers to their frequencies in the given
    text string, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return {}

    freq_map = {}
    n = len(text)

    # Range over all substrings of length k
    for i in range(n - k + 1):
        # Grab current pattern
        pattern = text[i:i + k]

        # Does pattern exist in freq_map?
        # To fill in...

    return freq_map

The question is how to check whether freq_map[pattern] exists. As we saw briefly in the previous code along, we can check if a key is in a dictionary or an element is in a list using the in operator. We can use this fact to identify if freq_map[pattern] exists; if so, we increment it, and otherwise, we set it equal to 1.

def frequency_table(text: str, k: int) -> dict[str, int]:
    """
    frequency_table finds the frequencies of each k-mer occurring in a given text, 
    including overlaps.

    Parameters:
    - text (str): The string text to search for k-mers.
    - k (int): The size of the k-mers.

    Returns:
    - dict[str, int]: The dictionary of k-mers to their frequencies in the given
    text string, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return {}

    freq_map = {}
    n = len(text)

    # Range over all substrings of length k
    for i in range(n - k + 1):
        # Grab current pattern
        pattern = text[i:i + k]

        # updating the value of freq_map associated with pattern
        if pattern in freq_map:
            # Pattern has been encountered.
            freq_map[pattern] += 1
        else:
            # Pattern has not been encountered.
            freq_map[pattern] = 1

    return freq_map

In this particular case, we will have a shortcut approach. Rather than check whether freq_map[pattern] exists, we can simply use the built-in function freq_map.get(). The get() function takes two parameters: the key to retrieve, and a default value to assign to this key if it does not currently exist in the dictionary. As a result, we can consolidate the above if/else statement into the one-line statement freq_map[pattern] = freq_map.get(pattern, 0) + 1.

def frequency_table(text: str, k: int) -> dict[str, int]:
    """
    frequency_table finds the frequencies of each k-mer occurring in a given text, 
    including overlaps.

    Parameters:
    - text (str): The string text to search for k-mers.
    - k (int): The size of the k-mers.

    Returns:
    - dict[str, int]: The dictionary of k-mers to their frequencies in the given
    text string, including overlaps.
    """

    if k <= 0:
        raise ValueError("k must be positive")
    if k > len(text):
        return {}

    freq_map = {}
    n = len(text)

    # Range over all substrings of length k.
    for i in range(n - k + 1):
        # Grab current pattern
        pattern = text[i:i + k]

        # updating the value of freq_map associated with pattern
        freq_map[pattern] = freq_map.get(pattern, 0) + 1
        # if pattern is a key, this is what we want
        # if it's not, freq_map[pattern] gets created,
        # gets set equal to zero, then incremented.

    return freq_map

Testing our frequent words finder and applying it to a real dataset

We now will run a brief test of our find_frequent_words() function. We will leave tests of its two subroutines to you in the “check your work” section below.

def main():
    print("Finding frequent words in Python!")

    # We take a string of text.
    text = "ACGTTTTGAGACGTTTACGC"
    k = 3

    print(find_frequent_words(text, k))

STOP: Open a terminal, and navigate to the frequent_words directory using the command cd python/src/frequent_words. Then, run your code using the command python3 main.py (on Mac) or python main.py (on Windows). You should see [ACG, TTT] printed to the console, since each of these strings occurs three times in text.

Exercise: Update text to be equal to the verified replication origin of Vibrio cholerae, "ATCAATGATCAACGTAAGCTTCTAAGCATGATCAAGGTGCTCACACAGTTTATCCACAACCTGAGTGGATGACATCAAGATAGGTCGTTGTATCTCCTTCCTCTCGTACTCTCATGACCACGGAAAGATGATCAAGAGAGGATGATTTCTTGGCCATATCGCAATGAATACTTGTGACTTGTGCTTCCAATTGACATCTTCAGCGCCATATTGCGCTGGCCAAGGTGACGGAGCGGGATTACGAAAGCATGATCATGGCTGTTGTTCTGTTTATCTTGTTTTGACTGAGACTTGTTAGGATAGACGGTTTTTCATCACTGACTAGCCAAAGCCTTACTCTGCCTGACATCGACCGTAAATTGATAATGAATTTACATGCTTCCGCGACGATTTACCTCTTGATCATCGATCCGATTGAAGATCTTCAATTGTTAATTCTCTTGCCTCGACTCATAGCCATGATGAGCTCTTGATCATGTTTCCTTAACCCTCTATTTTTTACGGAAGAATGATCAAGCTGCTGCTCTTGATCATCGTTTC". Run find_frequent_words(text, k) with k ranging between 3 and 9, inclusively, to see if you can reproduce the table from the core text, shown below.

**Figure:** The most frequent k-mers in the *ori* region of *Vibrio cholerae* for k from 3 to 9, along with the number of times that each k-mer occurs.

Click Run 👇 to try it!

Looking ahead

Now that we can find frequent words in a small region of a text, we are ready to adapt this algorithm to the broader problem of finding frequently occurring words in all short regions of a longer genome. Join us in the next code along!

Next lesson