Learning objectives
In the previous code along, we introduced maps in Go, and we are now ready to use what we have learned to find the most frequent strings (of some fixed length) within a text. This task is useful when hunting through a short region of a genome to find a frequently occurring “hidden message” of interest (such as a protein binding site), and it relies on the following computational problem.
Frequent Words Problem
Input: A string text and an integer k.
Output: All most frequent k-mers in text.
In the core text, we saw that we could solve this problem quickly using a map (also known as a dictionary). First, given a string text
and an integer k
, we use a function FrequencyTable()
to generate a map that associates each substring of length k
identified in text
to its number of occurrences in text
.
FrequencyTable(text, k) freqMap ← empty map n ← length(text) for every integer i between 0 and n − k pattern ← text[i, i + k] if freqMap[pattern] doesn't exist freqMap[pattern] = 1 else freqMap[pattern]++ return freqMap
We can now use this function as a subroutine when solving the Frequent Words Problem using the following BetterFrequentWords()
function.
BetterFrequentWords(text, k) frequentPatterns ← an array of strings of length 0 freqMap ← FrequencyTable(text, k) max ← MaxMapValue(freqMap) for all strings pattern in freqMap if freqMap[pattern] = max frequentPatterns ← append(frequentPatterns, pattern) return frequentPatterns
This function also relies on a subroutine MaxMapValue()
that identifies the maximum value in a given dictionary mapping strings to integers, which we wrote as follows.
MaxMapValue(dict) m ← 0 firstTime = true for every key pattern in dict if firstTime = true or dict[pattern] > m firstTime= false m ← dict[pattern] return m
In this lesson, we will learn how to work with maps in Go, and we will then return to implement all of these functions.
Setup
Create a folder called frequentWords
in your go/src
directory and create a text file called main.go
in the go/src/frequentWords
folder. We will edit main.go
, which should have the following starter code.
package main import ( "fmt" ) func main() { fmt.Println("Finding frequent words.") }
Code along video
Beneath the video, we provide a detailed summary of the topics and code covered in the code along.
At the bottom of this page, you will have the opportunity to validate your work via auto-graded assessments that evaluate the functions covered in the code along.
Although we strongly suggest completing the code along on your own, you can find completed code from the code along in our course code repository.
Code along summary
Implementing FrequentWords()
To implement FindFrequentWords()
, we start by declaring our list of strings that we eventually want to return; we do not know in advance how many elements we will return because text
may have multiple most frequent k-mers.
func FindFrequentWords(text string, k int) []string { freqPatterns := make([]string, 0) // to be filled in return freqPatterns }
We then build the frequency table for text
and k
, which we pass to a subroutine.
//FindFrequentWords takes as input a string text and an integer k. //It returns a slice containing the most frequent k-mers occurring in text, including overlaps. func FindFrequentWords(text string, k int) []string { freqPatterns := make([]string, 0) freqMap := FrequencyTable(text, k) // will map k-mer substrings of text to their number of occurrences // to be filled in return freqPatterns }
Now that we have the frequency table, we can identify its maximum value max
, a task that we pass to another subroutine.
//FindFrequentWords takes as input a string text and an integer k. //It returns a slice containing the most frequent k-mers occurring in text, including overlaps. func FindFrequentWords(text string, k int) []string { freqPatterns := make([]string, 0) freqMap := FrequencyTable(text, k) // will map k-mer substrings of text to their number of occurrences max := MaxMapValue(freqMap) // to be filled in return freqPatterns }
Because we have relied on subroutines, FindFrequentWords()
is nearly complete. We just need to range over the map, identifying which keys have values equal to max
, and appending these strings to freqPatterns
.
//FindFrequentWords takes as input a string text and an integer k. //It returns a slice containing the most frequent k-mers occurring in text, including overlaps. func FindFrequentWords(text string, k int) []string { freqPatterns := make([]string, 0) freqMap := FrequencyTable(text, k) // will map k-mer substrings of text to their number of occurrences max := MaxMapValue(freqMap) //range over map, looking for strings (keys) whose corresponding values achieve the maximum for pattern, count := range freqMap { if count == max { //append the pattern to freqPatterns freqPatterns = append(freqPatterns, pattern) } } return freqPatterns }
Great! Now we just need to write the MaxMapValue()
and FrequencyTable()
subroutines.
Exercise: Try implementingMaxMapValue()
andFrequencyTable()
.
Implementing MaxMapValue()
We start with MaxMapValue()
and declare a variable m
to be equal to zero, which will hold our maximum value, and which we will eventually return.
//MaxMapValue takes as input a map of strings to integers. //It returns the maximum value in the map. func MaxMapValue(dict map[string]int) int { m := 0 // will hold the maximum value // to be filled in return m }
This type of function should hopefully start to feel natural. We seemingly just need to range over the values of the map, and if we find one that is larger than m
, then we will update m
to be equal to this value. Note in the code below that we use an underscore (_
) to refer to the current key of the map because this key is not needed.
//MaxMapValue takes as input a map of strings to integers. //It returns the maximum value in the map. func MaxMapValue(dict map[string]int) int { m := 0 // will hold the maximum value for _, value := range dict { if value > m { m = value } } return m }
STOP: This function has a bug. What is it?
The astute learner will recall that when implementing a function MaxIntegerArray()
to find the minimum value in a given slice of integers, we needed to be careful to consider the case when every element in the array was positive. One implementation of MaxIntegerArray()
is shown below, which handles this case by updating the current maximum value if we are considering the first element of the slice.
//MaxIntegerArray takes as input a slice of integers. //It returns the maximum value in the slice. func MaxIntegerArray(a []int) int { m := 0 //range over slice, and if current element is 0-th element or larger than the current maximum, //update m to be this value for i := range a { if i == 0 || a[i] > m { a[i] = m } } return m }
The problem with extending this idea to implement MaxMapValue()
is that maps do not have a set order, and so there is no such thing as the initial element of a map.
STOP: How can we circumvent this problem and fixMaxMapValue()
?
The workaround that we will use is to declare a boolean variable, which we call firstTimeThrough
, and set it equal to true
before the loop. Rather than check if i
is equal to zero, we will check if firstTimeThrough
is equal to false
.
//MaxMapValue takes as input a map of strings to integers. //It returns the maximum value in the map. func MaxMapValue(dict map[string]int) int { m := 0 // will hold the maximum value firstTimeThrough := true // turn on a Boolean variable to indicate our "first" element for _, value := range dict { if firstTimeThrough == true { m = value firstTimeThrough = false // turn off our "first time" switch, will never again be true } else if value > m { m = value } } return m }
We make one more simplification. Because firstTimeThrough
is boolean, we don’t need to check whether it is equal to true
. We can simply use it as the first part of our condition.
//MaxMapValue takes as input a map of strings to integers. //It returns the maximum value in the map. func MaxMapValue(dict map[string]int) int { m := 0 // will hold the maximum value firstTimeThrough := true // turn on a Boolean variable to indicate our "first" element for _, value := range dict { if firstTimeThrough { // this is equivalent to if firstTimeThrough == true m = value firstTimeThrough = false // turn off our "first time" switch, will never again be true } else if value > m { m = value } } return m }
Implementing FrequencyTable()
Our second subroutine to implement is FrequencyTable()
. We will start with declaring the map freqMap
that will hold the frequency table, which we will eventually return.
//FrequencyTable takes as input a string text and an integer k. //It returns the frequency table (as a map) of this string, //mapping each k-mer occurring in text to its number of occurrences (with overlaps). func FrequencyTable(text string, k int) map[string]int { freqMap := make(map[string]int) // to be filled in return freqMap }
We then will range over all k-mer substrings of text, which we already know how to do from our work with counting the number of occurrences of a k-mer in text using PatternCount()
. Each time through the loop, we will check whether the current k-mer pattern
appears as a key in the frequency table. If so, we will increase freqMap[pattern]
by 1; if not, we will set freqMap[pattern]
equal to 1.
//FrequencyTable takes as input a string text and an integer k. //It returns the frequency table (as a map) of this string, //mapping each k-mer occurring in text to its number of occurrences (with overlaps). func FrequencyTable(text string, k int) map[string]int { freqMap := make(map[string]int) //range over all k-mer substrings of text for j := 0; j < len(text) - k + 1; j++ { pattern := text[j:j+k] //does freqMap[pattern] exist? // to be filled in } return freqMap }
The question is how to check whether freqMap[pattern]
exists. In Go, accessing freqMap[pattern]
technically gives two values: the current value of freqMap[pattern]
if pattern
is a key of the map, and an optional second boolean value that is true
if pattern
is a key of the map and false
otherwise. We can use this fact to identify if freqMap[pattern]
exists by creating a variable exists
to hold this boolean value.
//FrequencyTable takes as input a string text and an integer k. //It returns the frequency table (as a map) of this string, //mapping each k-mer occurring in text to its number of occurrences (with overlaps). func FrequencyTable(text string, k int) map[string]int { freqMap := make(map[string]int) //range over all k-mer substrings of text for j := 0; j < len(text) - k + 1; j++ { pattern := text[j:j+k] _, exists := freqMap[pattern] // exists is true if pattern is a key of freqMap and false otherwise if exists == false { // need a new key freqMap[pattern] = 1 } else { // key already is present freqMap[pattern]++ } } return freqMap }
In this particular case, we will have a shortcut approach. Rather than check whether freqMap[pattern]
exists, and if so add one to it, we can simply use freqMap[pattern]++
. This command will add one to freqMap[pattern]
if it exists, and if it does not exist, it will create freqMap[pattern]
with the default value of zero, and then add one to this value, which in this case is what we desire.
//FrequencyTable takes as input a string text and an integer k. //It returns the frequency table (as a map) of this string, //mapping each k-mer occurring in text to its number of occurrences (with overlaps). func FrequencyTable(text string, k int) map[string]int { freqMap := make(map[string]int) //range over all k-mer substrings of text for j := 0; j < len(text) - k + 1; j++ { pattern := text[j:j+k] freqMap[pattern]++ // if freqMap[pattern] already exists, then this is the desired behavior. // if freqMap[pattern] does not exist, then Go creates freqMap[pattern], //sets it equal to zero, and adds one to it, as desired. } return freqMap }
Testing FindFrequentWords() and applying it to a real dataset
We now will run a brief test of our FindFrequentWords()
function. We will leave tests of its two subroutines to you in the “check your work” section below.
package main import ( "fmt" ) func main() { fmt.Println("Maps and frequent words.") // ... text := "ACGTTTTGACGACGTTT" k := 3 fmt.Println(FindFrequentWords(text, k)) // should print ACG TTT }
Exercise: Updatetext
to be equal to the verified replication origin of Vibrio cholerae,"ATCAATGATCAACGTAAGCTTCTAAGCATGATCAAGGTGCTCACACAGTTTATCCACAACCTGAGTGGATGACATCAAGATAGGTCGTTGTATCTCCTTCCTCTCGTACTCTCATGACCACGGAAAGATGATCAAGAGAGGATGATTTCTTGGCCATATCGCAATGAATACTTGTGACTTGTGCTTCCAATTGACATCTTCAGCGCCATATTGCGCTGGCCAAGGTGACGGAGCGGGATTACGAAAGCATGATCATGGCTGTTGTTCTGTTTATCTTGTTTTGACTGAGACTTGTTAGGATAGACGGTTTTTCATCACTGACTAGCCAAAGCCTTACTCTGCCTGACATCGACCGTAAATTGATAATGAATTTACATGCTTCCGCGACGATTTACCTCTTGATCATCGATCCGATTGAAGATCTTCAATTGTTAATTCTCTTGCCTCGACTCATAGCCATGATGAGCTCTTGATCATGTTTCCTTAACCCTCTATTTTTTACGGAAGAATGATCAAGCTGCTGCTCTTGATCATCGTTTC"
. RunFindFrequentWords(text, k)
withranging between 3 and 9, inclusively, to see if you can reproduce the table from the main text, shown below.
k
Check your work from the code along
We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:
FindFrequentWords()
FrequencyTable()
MaxMapValue()