Finding Clumps in Go

Learning objectives

In the previous code along, we used maps to implement a frequency table in order to find the most frequent words of given length in a text. Now that we are comfortable with maps, we will return to the problem of finding clumps in a genome, a problem that can help us scan the genome for areas that have surprisingly frequent k-mers, which may imply that these k-mers serve as “hidden messages” (i.e., that they are involved in protein-DNA binding).

Recall from the core text that we say that a k-mer pattern forms an (L, t)- clump inside a (longer) string genome if there is an interval of genome of length L in which this k-mer appears at least t times. We now restate the Clump Finding Problem.

Clump Finding Problem

Input: A string text, and integers k, L, and t.

Output: All distinct k-mers forming (L, t)-clumps in text.

We then introduced pseudocode for an algorithm called

FindClumps()
FindClumps(), which ranges over every substring
window
window of length
L
L of a given string
text
text. For each such string
window
window, it builds a frequency table for all substrings of length
k
k. Any substrings in this table occurring
t
t or more times are clumps, and we will add them to some slice
patterns
patterns if they do not already occur in
patterns
patterns.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
FindClumps(text, k, L, t)
patterns ← an array of strings of length 0
n ← length(text)
for every integer i between 0 and n − L
window ← text[i, i + L]
freqMap ← FrequencyTable(window, k)
for every key s in freqMap
if freqMap[s] ≥ t and Contains(patterns, s) = false
patterns ← append(patterns, s)
return patterns
Contains(patterns, s)
for every string pattern in patterns
if s = pattern
return true
return false
FindClumps(text, k, L, t) patterns ← an array of strings of length 0 n ← length(text) for every integer i between 0 and n − L window ← text[i, i + L] freqMap ← FrequencyTable(window, k) for every key s in freqMap if freqMap[s] ≥ t and Contains(patterns, s) = false patterns ← append(patterns, s) return patterns Contains(patterns, s) for every string pattern in patterns if s = pattern return true return false
FindClumps(text, k, L, t)
    patterns ← an array of strings of length 0
    n ← length(text)
    for every integer i between 0 and n − L
        window ← text[i, i + L]
        freqMap ← FrequencyTable(window, k)
        for every key s in freqMap
            if freqMap[s] ≥ t and Contains(patterns, s) = false
                patterns ← append(patterns, s)
    return patterns

Contains(patterns, s)
    for every string pattern in patterns
        if s = pattern
            return true
    return false

In this lesson, we will implement

FindClumps()
FindClumps() along with the
Contains()
Contains() subroutine that identifies whether a given slice of strings
patterns
patterns contains a string
s
s. We will then apply
FindClumps()
FindClumps() to the E. coli genome to identify all short strings that occur frequently in short regions and may serve as hidden messages to the cell.

Setup

Create a folder called clumps in your go/src directory and create a text file called main.go in the go/src/clumps folder. We will edit main.go, which should have the following starter code.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
)
func main() {
fmt.Println("Finding clumps.")
}
package main import ( "fmt" ) func main() { fmt.Println("Finding clumps.") }
package main

import (
    "fmt"
)

func main() {
    fmt.Println("Finding clumps.")
}

Code along video

Beneath the video, we provide a detailed summary of the topics and code covered in the code along.

At the bottom of this page, you will have the opportunity to validate your work via auto-graded assessments that evaluate the functions covered in the code along.

Although we strongly suggest completing the code along on your own, you can find completed code from the code along in our course code repository.

Code along summary

A first implementation of finding clumps

We begin with implementing

FindClumps()
FindClumps(). Because we don’t know the number of strings that will form clumps in advance, we first create a slice of strings
patterns
patterns of length zero, to which we will append any clumps that we find. We will eventually return this slice
patterns
patterns.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
patterns := make([]string, 0)
n := len(text)
// to fill in
return patterns
}
//FindClumps takes as input a string, an integer k, an integer L, and an integer t. //It returns a slice of strings representing all k-mers that appear at least t times //in a window of length L in the input string. func FindClumps(text string, k, L, t int) []string { patterns := make([]string, 0) n := len(text) // to fill in return patterns }
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
    patterns := make([]string, 0)
    n := len(text)

    // to fill in

    return patterns
}

We next will fill in the meat of the

FindClumps()
FindClumps() function. To do so, we will need to range over every possible “window” substring of
text
text having length
L
L, forming the frequency table for each such window. We know from our work with ranging over all substrings of a text of given length that the starting positions of all these windows will range from 0 to
n-L
n-L, inclusively, where
n
n is the length of
text
text. We can now fill in the beginning of our for loop.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
patterns := make([]string, 0)
n := len(text)
//range over every window of text of length L
for i := 0; i < n-L+1; i++ {
window := text[i : i+L]
freqMap := FrequencyTable(window, k)
// to fill in
}
return patterns
}
//FindClumps takes as input a string, an integer k, an integer L, and an integer t. //It returns a slice of strings representing all k-mers that appear at least t times //in a window of length L in the input string. func FindClumps(text string, k, L, t int) []string { patterns := make([]string, 0) n := len(text) //range over every window of text of length L for i := 0; i < n-L+1; i++ { window := text[i : i+L] freqMap := FrequencyTable(window, k) // to fill in } return patterns }
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
    patterns := make([]string, 0)
    n := len(text)

    //range over every window of text of length L
    for i := 0; i < n-L+1; i++ {
        window := text[i : i+L]
        freqMap := FrequencyTable(window, k)
        
        // to fill in
    }

    return patterns
}

We now will consider what to do once we have formed the frequency table

freqMap
freqMap of a given window. We want to range through this map and grab all k-mer keys whose corresponding integer values are at least equal to
t
t, the threshold value for considering a string to be a clump. If we find such a string, and it does not already occur in
patterns
patterns, then we should append this string to
patterns
patterns. We now have completed our
FindClumps()
FindClumps() function.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
patterns := make([]string, 0)
n := len(text)
//range over every window of text of length L
for i := 0; i < n-L+1; i++ {
window := text[i : i+L]
freqMap := FrequencyTable(window, k)
for s := range freqMap {
//add frequently occurring k-mers s to patterns if they don't appear already
if freqMap[s] >= t && Contains(patterns, s) == false {
patterns = append(patterns, s)
}
}
}
return patterns
}
//FindClumps takes as input a string, an integer k, an integer L, and an integer t. //It returns a slice of strings representing all k-mers that appear at least t times //in a window of length L in the input string. func FindClumps(text string, k, L, t int) []string { patterns := make([]string, 0) n := len(text) //range over every window of text of length L for i := 0; i < n-L+1; i++ { window := text[i : i+L] freqMap := FrequencyTable(window, k) for s := range freqMap { //add frequently occurring k-mers s to patterns if they don't appear already if freqMap[s] >= t && Contains(patterns, s) == false { patterns = append(patterns, s) } } } return patterns }
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
    patterns := make([]string, 0)
    n := len(text)

    //range over every window of text of length L
    for i := 0; i < n-L+1; i++ {
        window := text[i : i+L]
        freqMap := FrequencyTable(window, k)
        for s := range freqMap {
            //add frequently occurring k-mers s to patterns if they don't appear already
            if freqMap[s] >= t && Contains(patterns, s) == false {
                patterns = append(patterns, s)
            }
        }
    }

    return patterns
}

As for

Contains()
Contains(), this type of subroutine might start feeling comfortable now. To implement this function, we can use our shorthand for ranging over the values of a slice. If we ever find a value of the slice
patterns
patterns that is equal to our query string
s
s, then we return
true
true, and if we range over all values of
patterns
patterns without finding any value that is equal to
s
s
, then we return
false
false.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
//Contains takes as input a slice of strings patterns and a string s.
//It returns true if s is in the slice, and false otherwise.
func Contains(patterns []string, s string) bool {
for _, val := range patterns {
if val == s {
return true
}
}
return false
}
//Contains takes as input a slice of strings patterns and a string s. //It returns true if s is in the slice, and false otherwise. func Contains(patterns []string, s string) bool { for _, val := range patterns { if val == s { return true } } return false }
//Contains takes as input a slice of strings patterns and a string s.
//It returns true if s is in the slice, and false otherwise.
func Contains(patterns []string, s string) bool {
	for _, val := range patterns {
		if val == s {
			return true
		}
	}
	return false
}

Using a map instead of a Contains() subroutine

In fact, we don’t need a

Contains()
Contains() subroutine. Now that we have learned a little bit about maps, we can use a map
foundPatterns
foundPatterns to keep track of whether we have encountered a given string
s
s. The map
foundPatterns
foundPatterns maps strings to booleans, and we will set
foundPatterns[s]
foundPatterns[s] equal to
true
true the first time that we add
s
s to
patterns
patterns. This way, rather than check if
Contains(patterns, s)
Contains(patterns, s) is equal to
true
true, we can instead just check if
foundPatterns[s]
foundPatterns[s] is equal to
true
true.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
patterns := make([]string, 0)
n := len(text)
//map to track patterns that have already been found
foundPatterns := make(map[string]bool)
//range over every window of text of length L
for i := 0; i < n-L+1; i++ {
window := text[i : i+L]
freqMap := FrequencyTable(window, k)
for s := range freqMap {
//add frequently occurring k-mers s to patterns if they don't appear already
if freqMap[s] >= t && foundPatterns[s] == false {
patterns = append(patterns, s)
foundPatterns[s] = true // make sure to set this!
}
}
}
return patterns
}
//FindClumps takes as input a string, an integer k, an integer L, and an integer t. //It returns a slice of strings representing all k-mers that appear at least t times //in a window of length L in the input string. func FindClumps(text string, k, L, t int) []string { patterns := make([]string, 0) n := len(text) //map to track patterns that have already been found foundPatterns := make(map[string]bool) //range over every window of text of length L for i := 0; i < n-L+1; i++ { window := text[i : i+L] freqMap := FrequencyTable(window, k) for s := range freqMap { //add frequently occurring k-mers s to patterns if they don't appear already if freqMap[s] >= t && foundPatterns[s] == false { patterns = append(patterns, s) foundPatterns[s] = true // make sure to set this! } } } return patterns }
//FindClumps takes as input a string, an integer k, an integer L, and an integer t.
//It returns a slice of strings representing all k-mers that appear at least t times
//in a window of length L in the input string.
func FindClumps(text string, k, L, t int) []string {
    patterns := make([]string, 0)
    n := len(text)

    //map to track patterns that have already been found
    foundPatterns := make(map[string]bool)

    //range over every window of text of length L
    for i := 0; i < n-L+1; i++ {
        window := text[i : i+L]
        freqMap := FrequencyTable(window, k)
        for s := range freqMap {
            //add frequently occurring k-mers s to patterns if they don't appear already
            if freqMap[s] >= t && foundPatterns[s] == false {
                patterns = append(patterns, s)
                foundPatterns[s] = true // make sure to set this!
            }
        }
    }
    return patterns
}
Note: You might be curious what happens if we try to access
foundPatterns[s]
foundPatterns[s] if we have not yet come across
s
s. After all,
s
s will not be a key of
foundPatterns
foundPatterns! In Go, if you access a map value whose key does not exist, the value returned will be the default value of that type (which in this case is
false
false since it is a boolean value).

Applying the clump finding algorithm to a real bacterial genome

We would like to apply FindClumps() to a real genome. First, let’s apply it to a smaller sample dataset just to make sure that everything appears to be in order. Using the following dataset, we should print only a single occurrence of "AA".

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
)
func main() {
fmt.Println("Finding clumps.")
genome := "AAAACGTCGAAAAA"
k := 2
L := 4
t := 2
fmt.Println(FindClumps(genome, k, L, t)) // should print "AA"
}
package main import ( "fmt" ) func main() { fmt.Println("Finding clumps.") genome := "AAAACGTCGAAAAA" k := 2 L := 4 t := 2 fmt.Println(FindClumps(genome, k, L, t)) // should print "AA" }
package main

import (
    "fmt"
)

func main() {
    fmt.Println("Finding clumps.")

    genome := "AAAACGTCGAAAAA"
    k := 2
    L := 4
    t := 2

    fmt.Println(FindClumps(genome, k, L, t)) // should print "AA"
}
STOP: Open a terminal, and navigate to the clumps directory using the command cd go/src/clumps. Then, compile your code using the command go build and run your code using the command ./clumps (Mac) or clumps (Windows).

Delete this dataset from

func main()
func main(). We are ready to scale up our work to a real bacterial genome to see how many strings appear surprisingly frequently in short regions of this genome. We will choose E. coli, the most commonly studied bacterium. You can view the E. coli genome as a .txt file at the Bioinformatics Algorithms website (click here to view). However, the genome has over 4.5 million nucleotides, and copying it into
func main()
func main() would pose a struggle. Instead, let’s write some code to read this genome from the URL over the internet.

We first expand the collection of package imports to include three additional packages:

  1. "net/http": needed for accessing URLs.
  2. "io": needed to read (and write) to files.
  3. "log": needed to create logs, including error logs.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
"net/http" //needed for accessing URLs
"io" //needed to read (and write) to files
"log" //needed to create logs, including errors
)
func main() {
fmt.Println("Finding clumps.")
}
package main import ( "fmt" "net/http" //needed for accessing URLs "io" //needed to read (and write) to files "log" //needed to create logs, including errors ) func main() { fmt.Println("Finding clumps.") }
package main

import (
    "fmt"
    "net/http" //needed for accessing URLs
    "io" //needed to read (and write) to files
    "log" //needed to create logs, including errors
)

func main() {
    fmt.Println("Finding clumps.")
}

We next define the address that we wish to access from above as a string

url
url, and we input url into a function from the
"net/http"
"net/http"
package called
http.Get()
http.Get() that attempts to access the address. This function returns a “response” object (which we will say more about later in this course) along with an error. If the error is
nil
nil, then we know that we were able to access the address. Recall that we introduced checking error messages in our introduction to strings code along when we introduced the
"strconv"
"strconv" package.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
"net/http" //needed for accessing URLs
"io" //needed to read (and write) to files
"log" //needed to create logs, including errors
)
func main() {
fmt.Println("Finding clumps.")
url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
resp, err := http.Get(url)
if err != nil {
panic(err)
}
// to fill in ...
}
package main import ( "fmt" "net/http" //needed for accessing URLs "io" //needed to read (and write) to files "log" //needed to create logs, including errors ) func main() { fmt.Println("Finding clumps.") url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt" resp, err := http.Get(url) if err != nil { panic(err) } // to fill in ... }
package main

import (
    "fmt"
    "net/http" //needed for accessing URLs
    "io" //needed to read (and write) to files
    "log" //needed to create logs, including errors
)

func main() {
    fmt.Println("Finding clumps.")

    url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }

    // to fill in ...

}

We next make sure that the “status code” of the response from the website is normal, so that we can send requests (which in this case means reading the contents of the file). To do so, we check its status code against a “normal” status (which is typically 200). If the status is not normal, then we will log the status and exit the program to prevent a crash later.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
"net/http" //needed for accessing URLs
"io" //needed to read (and write) to files
"log" //needed to create logs, including errors
)
func main() {
fmt.Println("Finding clumps.")
url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
resp, err := http.Get(url)
if err != nil {
panic(err)
}
if resp.StatusCode != http.StatusOK {
log.Fatalf("Received non-OK status: %v", resp.Status)
}
// to fill in ...
}
package main import ( "fmt" "net/http" //needed for accessing URLs "io" //needed to read (and write) to files "log" //needed to create logs, including errors ) func main() { fmt.Println("Finding clumps.") url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt" resp, err := http.Get(url) if err != nil { panic(err) } if resp.StatusCode != http.StatusOK { log.Fatalf("Received non-OK status: %v", resp.Status) } // to fill in ... }
package main

import (
    "fmt"
    "net/http" //needed for accessing URLs
    "io" //needed to read (and write) to files
    "log" //needed to create logs, including errors
)

func main() {
    fmt.Println("Finding clumps.")

    url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }


    if resp.StatusCode != http.StatusOK {
        log.Fatalf("Received non-OK status: %v", resp.Status)
    }

    // to fill in ...

}

If all has gone well, then we are ready to read in the contents of the file. We can do so with the function

io.ReadAll()
io.ReadAll(), which takes
resp.Body
resp.Body as an input and returns a slice
genomeSymbols
genomeSymbols of
byte
byte variables containing every symbol present in the string within the file. Once again, if an error was obtained, then we call
panic()
panic(). Otherwise, we will print the length of the genome by printing
len(genomeSymbols)
len(genomeSymbols).

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
"net/http" //needed for accessing URLs
"io" //needed to read (and write) to files
"log" //needed to create logs, including errors
)
func main() {
fmt.Println("Finding clumps.")
url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
resp, err := http.Get(url)
if err != nil {
panic(err)
}
if resp.StatusCode != http.StatusOK {
log.Fatalf("Received non-OK status: %v", resp.Status)
}
genomeSymbols, err := io.ReadAll(resp.Body)
if err != nil {
panic(err)
}
fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols))
// to fill in ...
}
package main import ( "fmt" "net/http" //needed for accessing URLs "io" //needed to read (and write) to files "log" //needed to create logs, including errors ) func main() { fmt.Println("Finding clumps.") url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt" resp, err := http.Get(url) if err != nil { panic(err) } if resp.StatusCode != http.StatusOK { log.Fatalf("Received non-OK status: %v", resp.Status) } genomeSymbols, err := io.ReadAll(resp.Body) if err != nil { panic(err) } fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols)) // to fill in ... }
package main

import (
    "fmt"
    "net/http" //needed for accessing URLs
    "io" //needed to read (and write) to files
    "log" //needed to create logs, including errors
)

func main() {
    fmt.Println("Finding clumps.")

    url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }


    if resp.StatusCode != http.StatusOK {
        log.Fatalf("Received non-OK status: %v", resp.Status)
    }

    genomeSymbols, err := io.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }

    fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols))


    // to fill in ...

}

Now that the genome has been read, we are finished with the file, and should close the connection.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
"net/http" //needed for accessing URLs
"io" //needed to read (and write) to files
"log" //needed to create logs, including errors
)
func main() {
fmt.Println("Finding clumps.")
url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
resp, err := http.Get(url)
if err != nil {
panic(err)
}
if resp.StatusCode != http.StatusOK {
log.Fatalf("Received non-OK status: %v", resp.Status)
}
genomeSymbols, err := io.ReadAll(resp.Body)
if err != nil {
panic(err)
}
fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols))
resp.Body.Close() // closes connection to site
// to fill in ...
}
package main import ( "fmt" "net/http" //needed for accessing URLs "io" //needed to read (and write) to files "log" //needed to create logs, including errors ) func main() { fmt.Println("Finding clumps.") url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt" resp, err := http.Get(url) if err != nil { panic(err) } if resp.StatusCode != http.StatusOK { log.Fatalf("Received non-OK status: %v", resp.Status) } genomeSymbols, err := io.ReadAll(resp.Body) if err != nil { panic(err) } fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols)) resp.Body.Close() // closes connection to site // to fill in ... }
package main

import (
    "fmt"
    "net/http" //needed for accessing URLs
    "io" //needed to read (and write) to files
    "log" //needed to create logs, including errors
)

func main() {
    fmt.Println("Finding clumps.")

    url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }


    if resp.StatusCode != http.StatusOK {
        log.Fatalf("Received non-OK status: %v", resp.Status)
    }

    genomeSymbols, err := io.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }

    fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols))

    resp.Body.Close() // closes connection to site


    // to fill in ...

}

All that remains is for us to convert the slice of symbols in

genomeSymbols
genomeSymbols into a string that we can input into our
FindClumps()
FindClumps() function. Fortunately, Go provides the built-in function
string()
string() just for this task. Once we have set
genome
genome equal to
string(genomeSymbols)
string(genomeSymbols), we are ready to set
k
k,
L
L, and
t
t equal to their values from the core text, and then call
FindClumps(genome, k, L, t)
FindClumps(genome, k, L, t).

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
package main
import (
"fmt"
"net/http" //needed for accessing URLs
"io" //needed to read (and write) to files
"log" //needed to create logs, including errors
)
func main() {
fmt.Println("Finding clumps.")
url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"
resp, err := http.Get(url)
if err != nil {
panic(err)
}
if resp.StatusCode != http.StatusOK {
log.Fatalf("Received non-OK status: %v", resp.Status)
}
genomeSymbols, err := io.ReadAll(resp.Body)
if err != nil {
panic(err)
}
fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols))
resp.Body.Close() // closes connection to site
EcoliGenome := string(genomeSymbols)
k := 9
L := 500
t := 3
clumps := FindClumps(EcoliGenome, k, L, t)
fmt.Println("Found", len(clumps), "total patterns.")
}
package main import ( "fmt" "net/http" //needed for accessing URLs "io" //needed to read (and write) to files "log" //needed to create logs, including errors ) func main() { fmt.Println("Finding clumps.") url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt" resp, err := http.Get(url) if err != nil { panic(err) } if resp.StatusCode != http.StatusOK { log.Fatalf("Received non-OK status: %v", resp.Status) } genomeSymbols, err := io.ReadAll(resp.Body) if err != nil { panic(err) } fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols)) resp.Body.Close() // closes connection to site EcoliGenome := string(genomeSymbols) k := 9 L := 500 t := 3 clumps := FindClumps(EcoliGenome, k, L, t) fmt.Println("Found", len(clumps), "total patterns.") }
package main

import (
    "fmt"
    "net/http" //needed for accessing URLs
    "io" //needed to read (and write) to files
    "log" //needed to create logs, including errors
)

func main() {
    fmt.Println("Finding clumps.")

    url := "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }


    if resp.StatusCode != http.StatusOK {
        log.Fatalf("Received non-OK status: %v", resp.Status)
    }

    genomeSymbols, err := io.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }

    fmt.Println("The number of nucleotides in the genome is", len(genomeSymbols))

    resp.Body.Close() // closes connection to site

    EcoliGenome := string(genomeSymbols)

    k := 9
    L := 500
    t := 3

    clumps := FindClumps(EcoliGenome, k, L, t)
    fmt.Println("Found", len(clumps), "total patterns.")
}
Note: In practice, we can use the statement
defer resp.Body.Close()
defer resp.Body.Close() immediately after the call to
http.Get()
http.Get(). When applying the keyword
defer
defer before a command, Go will wait until the end of the current function (in this case
func main()
func main()) to execute the command.
STOP: Compile and run your code. Be patient! It may take a few minutes to complete. What can you conclude about the number of surprisingly frequent patterns in the E. coli genome?

In our code, we found nearly 2,000 9-mers that are (500, 3)-clumps in the E. coli genome, meaning that a great deal of strings appear surprisingly frequently in short regions and may be involved as “hidden messages”. Yet before returning to the main text, we would like to observe that we can make

FindClumps()
FindClumps() faster.

Optimizing the clump finding algorithm by avoiding many subroutine calls

“Premature optimization is the root of all evil.”

Donald Knuth

In the light of Knuth’s famous quotation, we should always be careful pursuing optimizations before we have a working solution to our problems. Yet in this particular case, we already have an intuitive solution, and so our desire to optimize it is not premature.

The critical insight is that the frequency tables of consecutive substring “windows” of length L in any string

text
text are nearly identical. For example, say that we are working with the text
"BANANASPLIT"
"BANANASPLIT", the window length L is equal to 6, and k is equal to 3. Then the frequency table of the first window (
"BANANA"
"BANANA") is shown in the table below.

PatternCount
"BAN"
"BAN"
1
"ANA"
"ANA"
2
"NAN"
"NAN"
1
The frequency table for the text
"BANANA"
"BANANA" with k equal to 3.

The frequency table of the second window (

"ANANAS"
"ANANAS") is shown in the table below. The only difference between this frequency table and the one above is that we lost one occurrence of
"BAN"
"BAN", the first k-mer in
"BANANA"
"BANANA", and we gained one occurrence of
"NAS"
"NAS", the final k-mer in
"ANANAS"
"ANANAS".

PatternCount
"ANA"
"ANA"
2
"NAN"
"NAN"
1
"NAS"
"NAS"
1
The frequency table for the text
"ANANAS"
"ANANAS" with k equal to 3.

STOP: How might this observation help us speed up our clump finding algorithm?

In general, say that

window1
window1 and
window2
window2 are consecutive length-L substrings of a string
text
text. The frequency table of
window1
window1 will be the same as that of
window2
window2, except that the former will include one additional occurrence of the first k-mer in
window1
window1, and the latter will include one additional occurrence of the final k-mer in
window2
window2.

Note: You may find the remainder of this section tricky. If so, no worries! It’s an advanced discussion that isn’t necessary for following the remainder of this chapter.

To use this observation to optimize

FindClumps()
FindClumps(), we note that after generating the frequency table of the first window of
text
text, we do not need to call
FrequencyTable()
FrequencyTable() to generate the frequency table of the second window of text. We need only to modify the frequency table of
text[0:L]
text[0:L] by decreasing the value associated with
text[0:k]
text[0:k] by 1, and increasing the value associated with
text[L+1-k: L+1]
text[L+1-k: L+1] by 1.

More generally, after calling

FrequencyTable()
FrequencyTable() to form the frequency table of
text[0:L]
text[0:L], we will range an integer i between 1 and
n-L+1
n-L+1 (to access all remaining windows). For each such
i
i, we update the current frequency table by decreasing the value in the frequency table associated with
text[i-1 : i-1+k]
text[i-1 : i-1+k] and increasing the value associated with
text[i+L-k : i+L]
text[i+L-k : i+L].

This is implemented with the updated

FindClumpsFaster()
FindClumpsFaster() function below. The only thing present in this function that we haven’t yet discussed is that it may be that decreasing the value of the frequency table associated with
text[i-1 : i-1+k]
text[i-1 : i-1+k] reduces that value to zero. If that is the case, then we will clean up the map by using Go’s built-in
delete()
delete() function that takes as input a map and the key that we wish to delete. In this case, where
oldPattern
oldPattern is equal to
text[i-1 : i-1+k]
text[i-1 : i-1+k], we call
delete(freqMap, oldPattern)
delete(freqMap, oldPattern). This function also uses the trick that we learned about in a previous code along introducing maps that if
freqMap[pattern]
freqMap[pattern] does not exist, we can create it and increase it by 1 with the shorthand
freqMap[pattern]++
freqMap[pattern]++.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
// FindClumpsFaster takes as input a string, an integer k, an integer L, and an integer t.
// It returns a slice of strings representing all k-mers that appear at least t times
// in a window of length L in the input string.
func FindClumpsFaster(text string, k, L, t int) []string {
patterns := make([]string, 0)
n := len(text)
// Map to track patterns that have already been found
foundPatterns := make(map[string]bool)
// Initial frequency map for the first window
freqMap := FrequencyTable(text[:L], k)
// Append any frequent patterns to patterns
for s, freq := range freqMap {
if freq >= t {
patterns = append(patterns, s)
foundPatterns[s] = true
}
}
for i := 1; i < n-L+1; i++ {
// Remove the pattern from the beginning of the last window
oldPattern := text[i-1 : i-1+k]
freqMap[oldPattern]--
//clean up the map
if freqMap[oldPattern] == 0 {
delete(freqMap, oldPattern)
}
// Add the pattern from the end of the current window
newPattern := text[i+L-k : i+L]
freqMap[newPattern]++
// Update patterns according to current map
for s, freq := range freqMap {
if freq >= t && !foundPatterns[s] {
patterns = append(patterns, s)
foundPatterns[s] = true
}
}
}
return patterns
}
// FindClumpsFaster takes as input a string, an integer k, an integer L, and an integer t. // It returns a slice of strings representing all k-mers that appear at least t times // in a window of length L in the input string. func FindClumpsFaster(text string, k, L, t int) []string { patterns := make([]string, 0) n := len(text) // Map to track patterns that have already been found foundPatterns := make(map[string]bool) // Initial frequency map for the first window freqMap := FrequencyTable(text[:L], k) // Append any frequent patterns to patterns for s, freq := range freqMap { if freq >= t { patterns = append(patterns, s) foundPatterns[s] = true } } for i := 1; i < n-L+1; i++ { // Remove the pattern from the beginning of the last window oldPattern := text[i-1 : i-1+k] freqMap[oldPattern]-- //clean up the map if freqMap[oldPattern] == 0 { delete(freqMap, oldPattern) } // Add the pattern from the end of the current window newPattern := text[i+L-k : i+L] freqMap[newPattern]++ // Update patterns according to current map for s, freq := range freqMap { if freq >= t && !foundPatterns[s] { patterns = append(patterns, s) foundPatterns[s] = true } } } return patterns }
// FindClumpsFaster takes as input a string, an integer k, an integer L, and an integer t.
// It returns a slice of strings representing all k-mers that appear at least t times
// in a window of length L in the input string.
func FindClumpsFaster(text string, k, L, t int) []string {
	patterns := make([]string, 0)
	n := len(text)

	// Map to track patterns that have already been found
	foundPatterns := make(map[string]bool)

	// Initial frequency map for the first window
	freqMap := FrequencyTable(text[:L], k)

	// Append any frequent patterns to patterns
	for s, freq := range freqMap {
		if freq >= t {
			patterns = append(patterns, s)
			foundPatterns[s] = true
		}
	}

	for i := 1; i < n-L+1; i++ {
		// Remove the pattern from the beginning of the last window
		oldPattern := text[i-1 : i-1+k]
		freqMap[oldPattern]--

		//clean up the map
		if freqMap[oldPattern] == 0 {
			delete(freqMap, oldPattern)
		}

		// Add the pattern from the end of the current window
		newPattern := text[i+L-k : i+L]
		freqMap[newPattern]++

		// Update patterns according to current map
		for s, freq := range freqMap {
			if freq >= t && !foundPatterns[s] {
				patterns = append(patterns, s)
				foundPatterns[s] = true
			}
		}

	}
	
    return patterns
}
STOP: When we encounter the first for loop appending frequent words to
patterns
patterns, we don’t check that
foundPattern[s]
foundPattern[s] is
false
false. Why do you think this is the case?

Timing the optimized approach

Now that we have optimized our clump finding algorithm, let’s see if it truly is any faster by timing both

FindClumps()
FindClumps() and
FindClumpsFaster()
FindClumpsFaster(). On our machine, the former runs in about 2.5 minutes, whereas the latter completes in under 30 seconds. Once again, we encounter the paradigm that we introduced in Chapter 0 that efficient algorithms are particularly important when applying them to large datasets.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
func main() {
fmt.Println("Finding clumps.")
// code for reading in genome from file ...
start := time.Now()
FindClumps(substring, k, L, t)
elapsed := time.Since(start)
log.Printf("FindClumps() took %s", elapsed)
start2 := time.Now()
FindClumpsFaster(substring, k, L, t)
elapsed2 := time.Since(start2)
log.Printf("FindClumpsFaster() took %s", elapsed2)
}
func main() { fmt.Println("Finding clumps.") // code for reading in genome from file ... start := time.Now() FindClumps(substring, k, L, t) elapsed := time.Since(start) log.Printf("FindClumps() took %s", elapsed) start2 := time.Now() FindClumpsFaster(substring, k, L, t) elapsed2 := time.Since(start2) log.Printf("FindClumpsFaster() took %s", elapsed2) }
func main() {
    fmt.Println("Finding clumps.")

    // code for reading in genome from file ...

    start := time.Now()
    FindClumps(substring, k, L, t)
    elapsed := time.Since(start)
    log.Printf("FindClumps() took %s", elapsed)

    start2 := time.Now()
    FindClumpsFaster(substring, k, L, t)
    elapsed2 := time.Since(start2)
    log.Printf("FindClumpsFaster() took %s", elapsed2)

}

Unfortunately, although our clump finding algorithm finds candidate frequent words across the entire genome, it finds too many clumps, which means that if we are looking for a specific region like the origin of replication, we will need a targeted approach. We will introduce just such a method in the next code along.

STOP: How would increasing or decreasing each of the the parameters
k
k,
L
L, and
t
t affect the number of clumps we find? Run
FindClumpsFaster()
FindClumpsFaster() multiple times with different values of k, L, and
t
t. How does changing these parameter values affect the number of clumps that you find?

Check your work from the code along

We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:

  • FindClumps()
    FindClumps()

close

Love P4❤️? Join us and help share our journey!

Page Contents