Finding Frequent Words in Go

Learning objectives

In the previous code along, we introduced maps in Go, and we are now ready to use what we have learned to find the most frequent strings (of some fixed length) within a text. This task is useful when hunting through a short region of a genome to find a frequently occurring “hidden message” of interest (such as a protein binding site), and it relies on the following computational problem.

Frequent Words Problem

Input: A string text and an integer k.

Output: All most frequent k-mers in text.

In the core text, we saw that we could solve this problem quickly using a map (also known as a dictionary). First, given a string text and an integer k, we use a function FrequencyTable() to generate a map that associates each substring of length k identified in text to its number of occurrences in text.

FrequencyTable(text, k)
    freqMap ← empty map
    n ← length(text)
    for every integer i between 0 and n − k
        pattern ← text[i, i + k]
        if freqMap[pattern] doesn't exist
            freqMap[pattern] = 1
        else
            freqMap[pattern]++
    return freqMap

We can now use this function as a subroutine when solving the Frequent Words Problem using the following BetterFrequentWords() function.

BetterFrequentWords(text, k)
    frequentPatterns ← an array of strings of length 0
    freqMap ← FrequencyTable(text, k)
    max ← MaxMapValue(freqMap)
    for all strings pattern in freqMap
        if freqMap[pattern] = max
            frequentPatterns ← append(frequentPatterns, pattern)
    return frequentPatterns

This function also relies on a subroutine MaxMapValue() that identifies the maximum value in a given dictionary mapping strings to integers, which we wrote as follows.

MaxMapValue(dict)
    m ← 0
    firstTime = true
    for every key pattern in dict
        if firstTime = true or dict[pattern] > m
            firstTime= false
            m ← dict[pattern]
    return m

In this lesson, we will learn how to work with maps in Go, and we will then return to implement all of these functions.

Setup

Create a folder called frequentWords in your go/src directory and create a text file called main.go in the go/src/frequentWords folder. We will edit main.go, which should have the following starter code.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Finding frequent words.")
}

Code along video

Beneath the video, we provide a detailed summary of the topics and code covered in the code along.

At the bottom of this page, you will have the opportunity to validate your work via auto-graded assessments that evaluate the functions covered in the code along.

Although we strongly suggest completing the code along on your own, you can find completed code from the code along in our course code repository.

Code along summary

Implementing FrequentWords()

To implement FindFrequentWords(), we start by declaring our list of strings that we eventually want to return; we do not know in advance how many elements we will return because text may have multiple most frequent k-mers.

func FindFrequentWords(text string, k int) []string {
    freqPatterns := make([]string, 0)

    // to be filled in

    return freqPatterns
}

We then build the frequency table for text and k, which we pass to a subroutine.

//FindFrequentWords takes as input a string text and an integer k.
//It returns a slice containing the most frequent k-mers occurring in text, including overlaps.
func FindFrequentWords(text string, k int) []string {
    freqPatterns := make([]string, 0)
    freqMap := FrequencyTable(text, k) // will map k-mer substrings of text to their number of occurrences

    // to be filled in

    return freqPatterns
}

Now that we have the frequency table, we can identify its maximum value max, a task that we pass to another subroutine.

//FindFrequentWords takes as input a string text and an integer k.
//It returns a slice containing the most frequent k-mers occurring in text, including overlaps.
func FindFrequentWords(text string, k int) []string {
    freqPatterns := make([]string, 0)
    freqMap := FrequencyTable(text, k) // will map k-mer substrings of text to their number of occurrences
    
    max := MaxMapValue(freqMap)

    // to be filled in

    return freqPatterns
}

Because we have relied on subroutines, FindFrequentWords() is nearly complete. We just need to range over the map, identifying which keys have values equal to max, and appending these strings to freqPatterns.

//FindFrequentWords takes as input a string text and an integer k.
//It returns a slice containing the most frequent k-mers occurring in text, including overlaps.
func FindFrequentWords(text string, k int) []string {
    freqPatterns := make([]string, 0)
    freqMap := FrequencyTable(text, k) // will map k-mer substrings of text to their number of occurrences
    
    max := MaxMapValue(freqMap)

    //range over map, looking for strings (keys) whose corresponding values achieve the maximum
    for pattern, count := range freqMap {
        if count == max {
            //append the pattern to freqPatterns
            freqPatterns = append(freqPatterns, pattern)
        }
    }
    
    return freqPatterns
}

Great! Now we just need to write the MaxMapValue() and FrequencyTable() subroutines.

Exercise: Try implementing MaxMapValue() and FrequencyTable().

Implementing MaxMapValue()

We start with MaxMapValue() and declare a variable m to be equal to zero, which will hold our maximum value, and which we will eventually return.

//MaxMapValue takes as input a map of strings to integers.
//It returns the maximum value in the map.
func MaxMapValue(dict map[string]int) int {
    m := 0 // will hold the maximum value

    // to be filled in

    return m
}

This type of function should hopefully start to feel natural. We seemingly just need to range over the values of the map, and if we find one that is larger than m, then we will update m to be equal to this value. Note in the code below that we use an underscore (_) to refer to the current key of the map because this key is not needed.

//MaxMapValue takes as input a map of strings to integers.
//It returns the maximum value in the map.
func MaxMapValue(dict map[string]int) int {
    m := 0 // will hold the maximum value

    for _, value := range dict {
        if value > m {
            m = value
        }
    }

    return m
}
STOP: This function has a bug. What is it?

The astute learner will recall that when implementing a function MaxIntegerArray() to find the minimum value in a given slice of integers, we needed to be careful to consider the case when every element in the array was positive. One implementation of MaxIntegerArray() is shown below, which handles this case by updating the current maximum value if we are considering the first element of the slice.

//MaxIntegerArray takes as input a slice of integers.
//It returns the maximum value in the slice.
func MaxIntegerArray(a []int) int {
    m := 0

    //range over slice, and if current element is 0-th element or larger than the current maximum,
    //update m to be this value
    for i := range a {
        if i == 0 || a[i] > m {
            a[i] = m
        }
    }

    return m
}

The problem with extending this idea to implement MaxMapValue() is that maps do not have a set order, and so there is no such thing as the initial element of a map.

STOP: How can we circumvent this problem and fix MaxMapValue()?

The workaround that we will use is to declare a boolean variable, which we call firstTimeThrough, and set it equal to true before the loop. Rather than check if i is equal to zero, we will check if firstTimeThrough is equal to false.

//MaxMapValue takes as input a map of strings to integers.
//It returns the maximum value in the map.
func MaxMapValue(dict map[string]int) int {
    m := 0 // will hold the maximum value

    firstTimeThrough := true // turn on a Boolean variable to indicate our "first" element

    for _, value := range dict {
        if firstTimeThrough == true {
            m = value
            firstTimeThrough = false // turn off our "first time" switch, will never again be true
        } else if value > m {
            m = value
        }
    }

    return m
}

We make one more simplification. Because firstTimeThrough is boolean, we don’t need to check whether it is equal to true. We can simply use it as the first part of our condition.

//MaxMapValue takes as input a map of strings to integers.
//It returns the maximum value in the map.
func MaxMapValue(dict map[string]int) int {
    m := 0 // will hold the maximum value

    firstTimeThrough := true // turn on a Boolean variable to indicate our "first" element

    for _, value := range dict {
        if firstTimeThrough { // this is equivalent to if firstTimeThrough == true
            m = value
            firstTimeThrough = false // turn off our "first time" switch, will never again be true
        } else if value > m {
            m = value
        }
    }

    return m
}

Implementing FrequencyTable()

Our second subroutine to implement is FrequencyTable(). We will start with declaring the map freqMap that will hold the frequency table, which we will eventually return.

//FrequencyTable takes as input a string text and an integer k.
//It returns the frequency table (as a map) of this string,
//mapping each k-mer occurring in text to its number of occurrences (with overlaps).
func FrequencyTable(text string, k int) map[string]int {
    freqMap := make(map[string]int)

    // to be filled in

    return freqMap
}

We then will range over all k-mer substrings of text, which we already know how to do from our work with counting the number of occurrences of a k-mer in text using PatternCount(). Each time through the loop, we will check whether the current k-mer pattern appears as a key in the frequency table. If so, we will increase freqMap[pattern] by 1; if not, we will set freqMap[pattern] equal to 1.

//FrequencyTable takes as input a string text and an integer k.
//It returns the frequency table (as a map) of this string,
//mapping each k-mer occurring in text to its number of occurrences (with overlaps).
func FrequencyTable(text string, k int) map[string]int {
    freqMap := make(map[string]int)

    //range over all k-mer substrings of text
    for j := 0; j < len(text) - k + 1; j++ {
        pattern := text[j:j+k]
        //does freqMap[pattern] exist?
    
        // to be filled in
    }

    return freqMap
}

The question is how to check whether freqMap[pattern] exists. In Go, accessing freqMap[pattern] technically gives two values: the current value of freqMap[pattern] if pattern is a key of the map, and an optional second boolean value that is true if pattern is a key of the map and false otherwise. We can use this fact to identify if freqMap[pattern] exists by creating a variable exists to hold this boolean value.

//FrequencyTable takes as input a string text and an integer k.
//It returns the frequency table (as a map) of this string,
//mapping each k-mer occurring in text to its number of occurrences (with overlaps).
func FrequencyTable(text string, k int) map[string]int {
    freqMap := make(map[string]int)

    //range over all k-mer substrings of text
    for j := 0; j < len(text) - k + 1; j++ {
        pattern := text[j:j+k]
        _, exists := freqMap[pattern] // exists is true if pattern is a key of freqMap and false otherwise
        if exists == false { // need a new key
            freqMap[pattern] = 1
        } else { // key already is present
            freqMap[pattern]++
        }
    }

    return freqMap
}

In this particular case, we will have a shortcut approach. Rather than check whether freqMap[pattern] exists, and if so add one to it, we can simply use freqMap[pattern]++. This command will add one to freqMap[pattern] if it exists, and if it does not exist, it will create freqMap[pattern] with the default value of zero, and then add one to this value, which in this case is what we desire.

//FrequencyTable takes as input a string text and an integer k.
//It returns the frequency table (as a map) of this string,
//mapping each k-mer occurring in text to its number of occurrences (with overlaps).
func FrequencyTable(text string, k int) map[string]int {
    freqMap := make(map[string]int)

    //range over all k-mer substrings of text
    for j := 0; j < len(text) - k + 1; j++ {
        pattern := text[j:j+k]
        freqMap[pattern]++
        // if freqMap[pattern] already exists, then this is the desired behavior.
        // if freqMap[pattern] does not exist, then Go creates freqMap[pattern],
        //sets it equal to zero, and adds one to it, as desired.
    }

    return freqMap
}

Testing FindFrequentWords() and applying it to a real dataset

We now will run a brief test of our FindFrequentWords() function. We will leave tests of its two subroutines to you in the “check your work” section below.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Maps and frequent words.")

    // ...

    text := "ACGTTTTGACGACGTTT"
    k := 3
    fmt.Println(FindFrequentWords(text, k)) // should print ACG TTT

}
Exercise: Update text to be equal to the verified replication origin of Vibrio cholerae, "ATCAATGATCAACGTAAGCTTCTAAGCATGATCAAGGTGCTCACACAGTTTATCCACAACCTGAGTGGATGACATCAAGATAGGTCGTTGTATCTCCTTCCTCTCGTACTCTCATGACCACGGAAAGATGATCAAGAGAGGATGATTTCTTGGCCATATCGCAATGAATACTTGTGACTTGTGCTTCCAATTGACATCTTCAGCGCCATATTGCGCTGGCCAAGGTGACGGAGCGGGATTACGAAAGCATGATCATGGCTGTTGTTCTGTTTATCTTGTTTTGACTGAGACTTGTTAGGATAGACGGTTTTTCATCACTGACTAGCCAAAGCCTTACTCTGCCTGACATCGACCGTAAATTGATAATGAATTTACATGCTTCCGCGACGATTTACCTCTTGATCATCGATCCGATTGAAGATCTTCAATTGTTAATTCTCTTGCCTCGACTCATAGCCATGATGAGCTCTTGATCATGTTTCCTTAACCCTCTATTTTTTACGGAAGAATGATCAAGCTGCTGCTCTTGATCATCGTTTC". Run FindFrequentWords(text, k) with k ranging between 3 and 9, inclusively, to see if you can reproduce the table from the main text, shown below.
The most frequent k-mers in the ori region of Vibrio cholerae for k from 3 to 9, along with the number of times that each k-mer occurs. 

Check your work from the code along

We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:

  • FindFrequentWords()
  • FrequencyTable()
  • MaxMapValue()
Page Contents
Scroll to Top