Code Along - Substrings and Sublists in Python

Code along video

Although we strongly suggest coding along with us by following the video above, you can find completed code from the code along in our course code repository.

Learning objectives

In this lesson, we will return to the problem that we introduced in the main text of identifying the number of occurrences of one string as a substring of another, which we stated as the substring counting problem.

Substring Counting Problem

Input: A string pattern and a longer string text.

Output: The number of times that pattern occurs as a substring of text.

We then introduced one solution for this problem in the form of the following function PatternCount().

PatternCount(pattern, text)
    count ← 0
    n ← length(text)
    k ← length(pattern)
    for every integer i between 0 and n − k
        if text[i, i + k] = pattern
            count ← count + 1
    return count

In the following lesson, we encountered the more general problem of identifying the starting positions of all the occurrences of pattern in text, which we stated as the following computational problem.

Pattern Matching Problem

Input: Strings pattern and text.

Output: All starting positions in text where pattern appears as a substring.

We solved this problem using the following function StartingIndices().

StartingIndices(pattern, text)
    positions ← array of integers of length 0
    n ← length(text)
    k ← length(pattern)
    for every integer i between 0 and n − k
        if text[i, i + k] = pattern
            positions ← append(positions, i)
    return positions

We then noticed that the Pattern Matching Problem is essentially a more general version of the Substring Counting Problem, since if we can identify the starting positions of pattern in text, then the number of these positions solves the Substring Counting Problem. Accordingly, we wrote a much shorter version of PatternCount() that calls StartingIndices() as a subroutine, thus further demonstrating the power of modularity.

PatternCount(pattern, text)
    positions ← StartingIndices(pattern, text)
    return length(positions)

In this code along, we will implement all of these algorithms. To do so, we will need to learn more about how Python implements substrings.

Code along summary

Setup

Create a folder called substrings in your python/src directory and create a text file called main.py in the python/src/substrings folder. We will edit main.py, which should have the following starter code.

def main():
    print("Substrings (and sublists).")

if __name__ == "__main__":
    main()

Indexing substrings

In the core text, we introduced a language-neutral pseudocode notation for substrings, using text[i, i + k] to denote the substring of text starting at position i with length k. Another way of interpreting the notation is that the substring starts at position i and goes up to but not including position i + k. (Note that this notation appears in starting_indices() above.)

This indexing is common in modern programming languages and is adopted by Python, except that we use a colon instead of a comma to separate the indices i and k. The following example prints the substring of s starting at position 1 having length 4.

def main():
    print("Substrings (and sublists)")
    s = "Hi Lovers"

    print(s[1:5]) # prints "i Lo"

If we wanted to access the substring of s starting at position 0 having length 7, then we could use the notation s[0:7], but Python includes a shorthand syntax for prefixes of a string in which we do not need to specify the starting position when it is equal to zero.

def main():
    print("Substrings (and sublists)")
    s = "Hi Lovers"

    print(s[1:5]) # prints "i Lo"
    print(s[:7])  # prints "Hi Love"

The indices of s range between 0 and len(s) - 1. To access a suffix of s, because the second index of a substring does not include the symbol at that position, the second index should be len(s). For example, we could access the suffix of s starting at position 4 using s[4:len(s)]. In practice, just as we do not need to include 0 as the first index when accessing a prefix, we do not need to include the second index when accessing a suffix; that is, we can access the suffix of s starting at position 4 using s[4:].

def main():
    print("Substrings (and sublists)")
    s = "Hi Lovers"

    print(s[1:5]) # prints "i Lo"
    print(s[:7])  # prints "Hi Love"
    print(s[4:])  # prints "overs"

Click Run 👇 to try it!

STOP: Open a terminal window, and navigate into the folder containing main.py by executing cd python/src/substrings. Then run your code by executing python3 main.py (on macOS/Linux) or python main.py (on Windows). You should see i Lo, Hi Love, and overs printed to the console, each on its own line.

Indexing sublists

Analogous notation can be used for accessing sublists (or subarrays). In the code below, we create a list of length 6 holding the first six odd positive integers ([1, 3, 5, 7, 9, 11]), and then we print sublists of it.

def main():
    print("Substrings (and sublists)")
    # ...  

    a = [0] * 6

    for i in range(len(a)):
        a[i] = 2*i + 1

    print(a) 
    print(a[3:5]) # Prints [7, 9]
    print(a[:3])  # Prints [1, 3, 5]
    print(a[4:])  # Prints [9, 11]

Click Run 👇 to try it!

When we save and run our code, we see that a[3:5] is [7, 9], a[:3] is [1, 3, 5], and a[4:] is [9, 11].

Exercise: You now know enough to implement pattern_count() and starting_indices(). Give it a try on your own, and we will discuss their implementations in the coming sections.

A first try at counting patterns

We will now turn to implementing pattern_count(); we start by deciding what to do if the length of pattern is zero or if pattern is longer than text.

def pattern_count(pattern: str, text: str) -> int:
    """
    pattern_count finds the number of times the substring occurs in
    a given text string.

    Parameters:
    - pattern (str): The substring we are searching for in text.
    - text (str): The parent text string we are searching against.

    Returns:
    - int: The number of times that pattern occurs in text, including overlaps.
    """
    
    k = len(pattern)
    n = len(text)

    if k == 0:
        raise ValueError("pattern_count: empty pattern not allowed.")
    if k > n:
        return 0

We now will initialize a count variable equal to zero. We then range a variable i from 0 to len(text)-1, checking at each point in time if the substring of text starting at i (text[i: i+k]) is equal to pattern; if so, then we increment count. At the end of this process, we return count.

def pattern_count(pattern: str, text: str) -> int:
    """
    pattern_count finds the number of times the substring occurs in
    a given text string.

    Parameters:
    - pattern (str): The substring we are searching for in text.
    - text (str): The parent text string we are searching against.

    Returns:
    - int: The number of times that pattern occurs in text, including overlaps.
    """
    
    k = len(pattern)
    n = len(text)

    if k == 0:
        raise ValueError("pattern_count: empty pattern not allowed.")
    if k > n:
        return 0

    count = 0
    
    # range over text, incrementing count every time we find a pattern match
    for i in range(n):
        # Check to see if pattern matches substring of length k.
        if pattern == text[i:i + k]:
            count += 1
    return count

Let’s test our function in main() on a sample pattern and text.

def main():
    print("Substrings (and sublists)")

    # ...


    pattern = "ATA"
    text = "ATATA"

    print(pattern_count(pattern, text))

Click Run 👇 to try it!

When we run this code, we see that 2 is printed, as we might imagine. However, the function does have a slight issue. Recall from the previous code along that, like lists, the indices of a string text range from 0 to len(text)-1. However, using the loop

for i in range(len(text))

to access the starting positions of all k-mer substrings of text is not quite correct because text has fewer than len(text) total k-mers. For example, in the above example, k is equal to 3, and the length of text is equal to 5. If we range i over the length of text, then i will take the values between 0 and 4, inclusively. When i is equal to 2, pattern_count() checks text[2:5], which is equal to "ATA". But when i is equal to 3, pattern_count() checks text[3:6], which is "TA" because text has length 5.

For a more visual example, recall the figure from the core text below, which shows the window-sliding approach of pattern_count() when the text has length equal to 13, k is equal to 3, and the final k-mer considered starts at position 11.

**Figure:** Sliding a window to show how `PatternCount()`
computes that *pattern* = “`ATA`” occurs three times *text* = “`CGATATATCCATAG`“. We initialize *count* to zero and then increment it each time that *pattern* appears in *text* (shown in green).

We might catch this issue by printing the index of the list or print certain components of the variables so that we know the index when the error is created. In this case, we will print text[i:i + k] at the start of the for loop.

def pattern_count(pattern: str, text: str) -> int:
    """
    pattern_count finds the number of times the substring occurs in
    a given text string.

    Parameters:
    - pattern (str): The substring we are searching for in text.
    - text (str): The parent text string we are searching against.

    Returns:
    - int: The number of times that pattern occurs in text, including overlaps.
    """

    k = len(pattern)
    n = len(text)

    if k == 0:
        raise ValueError("pattern_count: empty pattern not allowed.")
    if k > n:
        return 0

    count = 0
    
    # range over text, incrementing count every time we find a pattern match
    for i in range(n):
        print(text[i:i + k])
        # Check to see if pattern matches substring of length k.
        if pattern == text[i:i + k]:
            count += 1
    return count

Click Run 👇 to try it!

This time, when we run our code, we will see ATA, TAT, ATA, TA, and A printed to the console. One might imagine that if text[i:i+k] does not fall completely within text, then Python would raise an error. Many languages would take this action, but Python does not, instead just truncating any substrings that fall off text.

Exercise: Where should the indexing of i stop for an arbitrary integer k and string text?

Fixing our implementation of pattern_count()

To fix our implementation of pattern_count(), we rely on the results of an exercise from the main text, and a fact that many a computational biologist has committed to memory. This fact is that a string of length n possesses n - k + 1 substrings of length k, which range from starting position 0 to starting position n - k. As a result, the condition of our for loop needs to be either i <= n-k or i < n - k + 1. We prefer the latter, since it is consistent with our notation for indexing substrings and sublists.

def pattern_count(pattern: str, text: str) -> int:
    """
    pattern_count finds the number of times the substring occurs in
    a given text string.

    Parameters:
    - pattern (str): The substring we are searching for in text.
    - text (str): The parent text string we are searching against.

    Returns:
    - int: The number of times that pattern occurs in text, including overlaps.
    """

    k = len(pattern)
    n = len(text)

    if k == 0:
        raise ValueError("pattern_count: empty pattern not allowed.")
    if k > n:
        return 0

    count = 0

    # range over text, incrementing count every time we find a pattern match
    for i in range(n - k + 1):
        # Check to see if pattern matches substring of length k.
        if pattern == text[i:i + k]:
            count += 1
    return count

We now return to main(), where we can test our updated implementation of pattern_count() on the previous example.

def main():
    print("Substrings (and sublists)")

    # ...


    pattern = "ATA"
    text = "ATATA"

    print(pattern_count(pattern, text)) # Prints 2

We contrast the outcome of pattern_count() with the result of calling the built-in Python function count() (which is called as text.count(pattern)), which returns 1 instead of 2 because this function does not account for overlapping occurrences of pattern, further justifying the need for us to write our own function.

def main():
    print("Substrings (and sublists)")

    # ...


    pattern = "ATA"
    text = "ATATA"

    print(pattern_count(pattern, text)) # Prints 2
    print(text.count(pattern)) # Prints 1

Click Run 👇 to try it!

Implementing starting_indices() and updating pattern_count()

We now turn to implementing a function starting_indices() to find not just the number of occurrences of pattern in text, but identifying their starting positions. As we pointed out previously, this problem is more general, and so if we implement starting_indices() to return a list of starting positions, we need only to return the length of this list to implement pattern_count().

def pattern_count(pattern: str, text: str) -> int:
    """
    Finds the number of times the substring occurs in a given text string.

    Parameters:
    - pattern (str): The substring we are searching for in text.
    - text (str): The parent text string we are searching against.

    Returns:
    - int: The number of times that pattern occurs in text, including overlaps.
    """

    if len(pattern) == 0:
        raise ValueError("pattern_count: empty pattern not allowed.")
    if len(pattern) > len(text):
        return 0
    
    return len(starting_indices(pattern, text))

We now present starting_indices(), which has the same structure as pattern_count(). The difference is that instead of maintaining a counter variable, we maintain an initially empty list positions, and we then append to positions each integer i where text[i:i + k] matches pattern.

def starting_indices(pattern: str, text: str) -> list[int]:
    """
    starting_indices returns the list containing all the starting positions of 
    pattern in text.

    Parameters:
    - pattern (str): A given substring.
    - text (str): A given superstring.

    Returns:
    - list[int]: A list containing the starting positions of pattern in text (indices).
    """

    k = len(pattern)
    n = len(text)

    if k == 0:
        raise ValueError("starting_indices: empty pattern not allowed.")

    if k > n:
        return []

    # Create an empty list of positions.
    positions = []
    
    # Range over all substrings of text, looking for matches
    for i in range(n - k + 1):
        if text[i:i+k] == pattern:
            positions.append(i)

    # Return the list of positions.
    return positions

Click Run 👇 to try it!

The starting_indices() function offers an example of a more general principle in programming, which is that when solving a problem, it can sometimes be no more difficult to solve a more general version of the problem.

Looking ahead

Now that we have started learning how to implement string algorithms, we turn to one of the two main problems considered in the main text, that of finding the most frequent k-mers in a given string. To do so, we will need to introduce a new type of data structure called a dictionary that will allow us to organize pairs of elements.

Next lesson

Check your work from the code along

We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:

pattern_count()
starting_indices()

Substrings and Sublists in Python

Code along video

Learning objectives

Code along summary

Setup

Indexing substrings

Indexing sublists

A first try at counting patterns

Fixing our implementation of pattern_count()

Implementing starting_indices() and updating pattern_count()

Looking ahead

Check your work from the code along

Join our community!

Join our community!