Code Along - Substrings and Subslices in Go

Learning objectives

In this lesson, we will return to the problem that we introduced in the main text of identifying the number of occurrences of one string as a substring as another, which we stated as the substring counting problem.

Substring Counting Problem

Input: A string pattern and a longer string text.

Output: The number of times that pattern occurs as a substring of text.

We then introduced one solution for this problem in the form of the following function PatternCount().

PatternCount(pattern, text)
    count ← 0
    n ← length(text)
    k ← length(pattern)
    for every integer i between 0 and n − k
        if text[i, i + k] = pattern
            count ← count + 1
    return count

In the following lesson, we encountered the more general problem of identifying the starting positions of all the occurrences of pattern in text, which we stated as the following computational problem.

Pattern Matching Problem

Input: Strings pattern and genome.

Output: All starting positions in genome where pattern appears as a substring.

We solved this problem using the following function StartingIndices().

StartingIndices(pattern, text)
    positions ← array of integers of length 0
    n ← length(text)
    k ← length(pattern)
    for every integer i between 0 and n − k
        if text[i, i + k] = pattern
            positions ← append(positions, i)
    return positions

We then noticed that the Pattern Matching Problem is essentially a more general version of the Substring Counting Problem, since if we can identify the starting positions of pattern in text, then the number of these positions solves the Substring Counting Problem. Accordingly, we wrote a much shorter version of PatternCount() that calls StartingIndices() as a subroutine, thus further demonstrating the power of modularity.

PatternCount(pattern, text)
    positions ← StartingIndices(pattern, text)
    return length(positions)

In this code along, we will implement all of these algorithms. To do so, we will need to learn more about how Go implements substrings.

Setup

Create a folder called substrings in your go/src directory and create a text file called main.go in the go/src/substrings folder. We will edit main.go, which should have the following starter code.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Substrings (and subslices).")
}

Code along video

Beneath the video, we provide a detailed summary of the topics and code covered in the code along.

At the bottom of this page, you will have the opportunity to validate your work via auto-graded assessments that evaluate the functions covered in the code along.

Although we strongly suggest completing the code along on your own, you can find completed code from the code along in our course code repository.

Code along summary

Indexing substrings

We have already introduced a language-neutral notation for substrings, using text[i, i+k] to denote the substring of text starting at position i having length k. This substring starts at position i and goes up to but not including position i + k. (This notation appears in StartingIndices() above.)

In Go, the indexing is the same, but we use a colon instead of a comma to separate i and k. The following example prints the substring of s starting at position 1 having length 4.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Strings!")
    
    s := "Hi Lovers!"
    fmt.Println(s[1:5]) // prints "i Lo"
}

If we wanted to access the substring of s starting at position 0 having length 7, we could use the notation s[0:7], but Go includes a shorthand syntax for prefixes of a string in which we do not need to specify the starting position.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Strings!")
    
    s := "Hi Lovers!"
    fmt.Println(s[1:5]) // prints "i Lo"
    fmt.Println(s[:7]) // prints "Hi Love"
}

The indices of s range between 0 and len(s)-1. To access a suffix of s, because the second index of a substring does not include the symbol at that position, the second index should be len(s). For example, we could access the suffix of s starting at position 4 using s[4:len(s)]. In practice, just as we do not need to include 0 as the first index when accessing a prefix, we do not need to include the second index when accessing a suffix; that is, we can access the suffix of s starting at position 4 using s[4:].

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Strings!")
    
    s := "Hi Lovers!"
    fmt.Println(s[1:5]) // prints "i Lo"
    fmt.Println(s[:7]) // prints "Hi Love"
    fmt.Println(s[4:]) // prints "overs!"
}

Indexing subslices

The same notation can be used for accessing subslices (or subarrays). In the code below, we create a slice of length 6 holding the first six odd positive integers, and access subslices of it.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Strings!")
    
    //...

    a := make([]int, 6)
    for i := range a {
        a[i] = 2*i + 1
    }

    fmt.Println(a) // prints [1 3 5 7 9 11]
    fmt.Println(a[3:5]) // prints [7 9]
    fmt.Println(a[:3]) // prints [1 3 5]
    fmt.Println(a[4:]) // prints [9 11]
}

Exercise: You now know enough to implement PatternCount() and StartingIndices(). Give it a try on your own, and we will discuss their implementations in the coming sections.

A first try at implementing PatternCount(), and diagnosing a common bug

We will now turn to implementing PatternCount(); our first attempt is shown below.

//PatternCount takes as input two strings, a pattern and a text.
//It returns the number of (possibly overlapping) occurrences of pattern in text.
func PatternCount(pattern, text string) int {
    count := 0
    k := len(pattern)

    // range over all substrings of text, and increment count if we find a pattern match
    for i := range text {
        //does the substring of text starting at position i having length k match pattern?
        if text[i:i+k] == pattern {
            count++
        }
    }
    return count
}

STOP: Our implementation of PatternCount() has a bug. What is it?

Let’s test our function in func main() on a sample pattern and text.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Strings!")

    //...
    
    pattern := "ATA"
    text := "ATATA"

    fmt.Println(PatternCount(pattern, text))
}

When we compile and run this code, we get a common runtime error: slice bounds out of range. We added a bug to our code in order to feature this error, which is one of the most dreaded error messages for every programmer. It means that we have attempted to set an element of a slice or string that does not exist, and it occurs in one of two cases with respect to the index k when accessing the slice element a[k].

k is negative.
k is greater than or equal to the length of a. (Remember that the indices of a range between 0 and len(a)-1.)

Our current bug fell into trap #2. Using the loop for i := range text { to access the starting positions of all k-mer substrings of text encounters the issue that text will have are fewer than len(text) total k-mers. For example, in the above example, k is equal to 3, and the length of text is equal to 5. If we range i over the length of text, then i will take the values between 0 and 4, inclusively. When i is equal to 2, PatternCount() checks text[2:5], which is equal to "ATA". But when i is equal to 3, PatternCount() checks text[3:6], which does not exist because text has length 5.

For a more visual example, recall the figure from the main text below, which shows the window-sliding approach of PatternCount() when the text has length equal to 13, k is equal to 3, and the final k-mer considered starts at position 11.

In this case, we were able to reason through our bug, but what could we do in a more complicated case? One thing that we might do to diagnose an out-of-bounds issue is to print the index of the array or slice so that we know the index when the error is created. In this case, we will print i at the start of the for loop.

//PatternCount takes as input two strings, a pattern and a text.
//It returns the number of (possibly overlapping) occurrences of pattern in text.
func PatternCount(pattern, text string) int {
    count := 0
    k := len(pattern)

    // range over all substrings of text, and increment count if we find a pattern match
    for i := range text {
        fmt.Println(i)
        //does the substring of text starting at position i having length k match pattern?
        if text[i:i+k] == pattern {
            count++
        }
    }
    return count
}

Now when we run our code, we will see 0, 1, 2, and 3 printed to the console before the error message is reached. As a result, we have learned that the issue in our code arises for this text and pattern when i is equal to 3. Keep this trick in mind when you inevitably encounter your own out-of-bounds errors!

STOP: Where should the indexing of i stop for an arbitrary k and text?

Fixing our implementation of PatternCount()

To fix our implementation of PatternCount(), we rely on the results of an exercise from the main text, and a fact that many a computational biologist has committed to memory. This fact is that a string of length n possesses n – k + 1 substrings of length k, which range from starting position 0 to starting position n – k. As a result, the condition of our for loop needs to be either i <= n-k or i < n - k + 1. We prefer the latter, since it is consistent with our notation for indexing substrings and subslices.

//PatternCount takes as input two strings, a pattern and a text.
//It returns the number of (possibly overlapping) occurrences of pattern in text.
func PatternCount(pattern, text string) int {
    count := 0
    k := len(pattern)
    n := len(text)

    // range over all substrings of text, and increment count if we find a pattern match
    for i := 0; i < n-k+1; i++ {
        //does the substring of text starting at position i having length k match pattern?
        if text[i:i+k] == pattern {
            count++
        }
    }
    return count
}

We now return to func main(), where we can test our updated implementation of PatternCount() on the previous example.

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Strings!")

    //...
    
    pattern := "ATA"
    text := "ATATA"

    fmt.Println(PatternCount(pattern, text)) // prints 2
}

We contrast the outcome of PatternCount() with the result of calling the built-in function strings.Count() from the strings package, which returns 1 instead of 2 because this function does not account for overlapping occurrences of pattern, further justifying the need for us to write our own function. Note also that the order of the parameters is reversed in strings.Count().

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Strings!")

    //...
    
    pattern := "ATA"
    text := "ATATA"

    fmt.Println(PatternCount(pattern, text)) // prints 2
    fmt.Println(strings.Count(text, pattern)) // prints 1 (doesn't count overlapping matches)
}

Implementing StartingIndices() and using it to update PatternCount()

We now turn to implementing StartingIndices() to find not just the number of occurrences of pattern in text, but identifying their starting positions. As we pointed out previously, this problem is more general, and so if we implement StartingIndices() to return a slice of starting positions, we need only to return the length of this slice to implement PatternCount(). We call the resulting function PatternCount2() because Go will not allow us to assign the same name to two different functions in the same folder.

//PatternCount2 takes as input two strings, a pattern and a text.
//It returns the number of (possibly overlapping) occurrences of pattern in text.
//It relies on StartingIndices() as a subroutine.
func PatternCount2(pattern, text string) int {
    return len(StartingIndices(pattern, text))
}

We now present StartingIndices(), which has the same structure as PatternCount(). The difference is that instead of maintaining a counter variable, we maintain an initially empty slice positions, and we then append to positions each integer i where text[i:i+k] matches pattern.

//StartingIndices takes as input two strings pattern and text.
//It returns a slice containing all starting positions of pattern in text.
func StartingIndices(pattern, text string) []int {
    positions := make([]int, 0) // we don't know the number of hits in advance

    k := len(pattern)
    n := len(text)

    //range over all substrings of text, and append to positions if we find a match
    for i := 0; i < n-k+1; i++ {
        //does the substring of text starting at position i having length k match pattern?
        if text[i:i+k] == pattern {
            positions = append(positions, i)
        }
    }

    return positions
}

StartingIndices() offers an example of a more general principle in programming, which is that when solving a problem, it can sometimes be no more difficult to solve a more general version of the problem.

Now that we have started learning how to implement string algorithms, we turn in the next code along to one of the two main problems considered in the main text, that of finding the most frequent k-mers in a given string.

Check your work from the code along

We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:

PatternCount()
StartingIndices()

Next lesson