Code Along - An introduction to Strings in Python

Code along video

Although we strongly suggest coding along with us by following the video above, you can find completed code from the code along in our course code repository.

Note: Each chapter of Programming for Lovers comprises two parts. First, the “core text” presents critical concepts at a high level, avoiding language-specific details. The core text is followed by “code alongs,” where you will apply what you have learned while learning the specifics of the language syntax.

Chapter 1 Core Text

Learning objectives

In this lesson, we will return to a computational problem that we introduced in the core text modeling the biological problem of finding the complementary strand of a given strand of DNA; that is, the reverse complement of a DNA string.

Reverse Complement Problem

Input: A DNA string pattern.

Output: The reverse complement of pattern.

We saw the power of modularity to solve this problem, since we can reduce finding a reverse complement to two problems: reversing a string, and taking the complementary nucleotide at each position. At the level of pseudocode, this corresponds to calling Reverse() and Complement() subroutines as follows.

ReverseComplement(pattern)
    pattern ← Reverse(pattern)
    pattern ← Complement(pattern)
    return pattern

We also saw that we could simplify these subroutines by calling Reverse() on the output of Complement() in a single line, leading to the following one-line function.

ReverseComplement(pattern)
    return Reverse(Complement(pattern))

In this lesson, we will implement these functions, and along the way, we will explore the basics of working with strings in Python.

Code along summary

Setup

Create a folder called strings in your python/src directory and create a text file called main.py in the python/src/strings folder. We will edit main.py, which should have the following starter code.

def main():
    print("Strings.")

if __name__ == "__main__":
    main()

Declaring strings, and string operations

We begin by declaring two strings. In Python, the value of a string can be enclosed in either single or double quotes.

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers"

Note: We will predominantly use double quotations for strings, so that if we have an apostrophe in a string, it isn’t read as the end of the string (e.g., "doesn't" will be parsed correctly). If we need a double quotation mark within a string, then we can use a character escape by placing a backslash before the quotation marks that should be preserved (e.g., "She said \"Hi\" to you.")

Just as we perform operations on numeric variables, we can use operators to combine strings. In particular, we can concatenate two strings by using the + operator. That is, given two strings s and t, s + t is a new string comprising the symbols of s, immediately followed by the symbols of t. When we concatenate s and t from the above example, the resulting string u has the value "HiLovers".

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers"  

    u = s+t # Concatenating strings s and t together makes "HiLovers"
    print(u)

Click Run 👇 to try it!

Note: If we wanted to have a space in u between the two constituent words, we could concatenate a space symbol between the strings using u = s + " " + t.

Python also implements a multiplication operation on strings that allows us to repeat a string some given number of times. Chances are slim that we will use this operation very much in the course, but the operation s * 3 results in three consecutive copies of s. Just as multiplication is repeated addition in arithmetic, so it is the case with Python: s * 3 is equivalent to s + s + s.

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers"  

    u = s+t # Concatenating strings s and t together makes "HiLovers"
    print(u)     

    print(s * 3) # prints "HiHiHi"

Click Run 👇 to try it!

STOP: After saving main.py, navigate into python/src/strings from the command line and run your code by executing python3 main.py (on macOS/Linux) or python main.py (on Windows).

Strings are (kinda) arrays of symbols

One way of thinking about a string is as a list of symbols. Accordingly, the symbols of a string use 0-based indexing, and we can access the first and last symbols of our string u using the notation u[0] and u[len(u) - 1], respectively. (As with lists, we can also access the final symbol of u using u[-1].)

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers"  

    u = s+t # Concatenating strings s and t together makes "HiLovers"
    print(u)     

    print(s * 3) # prints "HiHiHi"

    print("The first symbol of u is " + u[0])
    print("The final symbol of u is "+ u[len(u)-1])

When we access an individual symbol of a string, we obtain a string containing that symbol. For example, the following code will check whether t[2] is equal to the string "v" (it is).

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers" 

    u = s+t # Concatenating strings s and t together makes "HiLovers"
    print(u)

    print(s * 3) # prints "HiHiHi"  

    print("The first symbol of u is " + u[0])
    print("The final symbol of u is "+ u[-1])

    if t[2] == "v":
        print("The symbol at position 2 of t is v")

Click Run 👇 to try it!

Note: Strings are “case sensitive”. As a result, if we were to change the condition of the if statement above to t[2] == "V", then it would evaluate to false and we would not enter the if block.

Furthermore, as with lists, we should not try to access an element of a string that is out of range. In particular, if we access the symbol of a string s with index that is larger than len(s) - 1 or smaller than -len(s), then Python will raise an IndexError.

Strings are immutable

When we introduced tuples in the preceding chapter, we noted that they are immutable, meaning that after we create them, we cannot change individual elements.

Strings are also immutable. Although we can access a single symbol of a string to test its value, we cannot change it. For example, say that we wanted to change the symbol at index 4 of our string u from "v" to "s", thus changing u into "HiLosers". Attempting to assign u[4] = "s", as shown below, results in a TypeError telling us that strings do not support item assignment.

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers" 

    u = s+t # Concatenating strings s and t together makes "HiLovers"
    print(u)

    print(s * 3) # prints "HiHiHi"  

    print("The first symbol of u is " + u[0])
    print("The final symbol of u is "+ u[-1])

    if t[2] == "v":
        print("The symbol at position 2 of t is v")

    u[4] = "s" # throws a TypeError

Click Run 👇 to try it!

However, we can update strings all at once; for example, we can change the value of s from 'Hi' to "Yo".

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers" 

    u = s+t # Concatenating strings s and t together makes "HiLovers"
    print(u)

    print(s * 3) # prints "HiHiHi"  

    print("The first symbol of u is " + u[0])
    print("The final symbol of u is "+ u[-1])

    if t[2] == "v":
        print("The symbol at position 2 of t is v")

    # u[4] = "s" # throws a TypeError

    s = "Yo"

Furthermore, we can update a string using shortcut assignment operators like +=. The code below will print "Yo-Yo Ma".

def main():
    print("Strings.")

    s = 'Hi'
    t = "Lovers" 

    u = s+t # Concatenating strings s and t together makes "HiLovers"
    print(u)

    print(s * 3) # prints "HiHiHi"  

    print("The first symbol of u is " + u[0])
    print("The final symbol of u is "+ u[-1])

    if t[2] == "v":
        print("The symbol at position 2 of t is v")

    # u[4] = "s" # throws a TypeError

    s = "Yo"
    s += "-Yo"
    s += " Ma"
    print(s) #Yo-Yo Ma

Click Run 👇 to try it!

Reverse complementing a DNA string by passing the work to subroutines

We are now ready to return to implementing our reverse_complement() function. As mentioned previously, we can quickly subcontract the job to two subroutines: complement() and reverse(). (In the core text, we asked you to write pseudocode for each of these subroutines as an exercise.)

def reverse_complement(dna: str) -> str:
    """
    reverse_complement finds the reverse complement of the given string.
    
    Parameters:
    - dna (str): A given dna sequence string.
    
    Returns:
    - str: The reverse complement of the given dna string.
    """
    
    return reverse(complement(dna))

STOP: Say that we replaced the return statement of this function with return complement(reverse(dna)). Would the output of reverse_complement() be the same?

Complementing a DNA string, and match statements

We are now ready to implement a function complement() that takes a DNA string as input and returns the complementary string, formed by replacing the symbol at each position with its complementary nucleotide. Just as we can iterate over the indices and values of a list a by using enumerate(a), the same syntax applies when ranging over the indices and symbols of a string.

def complement(dna: str) -> str:
    """
    Finds the complementary strand of the given string.
    
    Parameters:
    - dna (str): A dna string.
    
    Returns:
    - dna2: the string whose i-th symbol is the complementary 
    nucleotide of the i-th symbol of the input string. (A-T, C-G, T-A, G-C)
    """

    # Range through the dna string, taking complements.
    for i, symbol in enumerate(dna):
        if symbol == "A":
            dna[i] = "T"   
        elif symbol == "C":
            dna[i] = "G"
        elif symbol == "G":
            dna[i] = "C"
        elif symbol == "T":
            dna[i] = "A"
        else:
            raise ValueError("Error: symbol in string is not a DNA string.")
    return dna

STOP: This function has a flaw; what is it?

Because strings are immutable, we know that we cannot assign individual symbols of dna. Instead, we will create an empty string "", and then add symbols to it one at a time.

def complement(dna: str) -> str:
    """
    Finds the complementary strand of the given string.
    
    Parameters:
    - dna (str): A dna string.
    
    Returns:
    - dna2: the string whose i-th symbol is the complementary 
    nucleotide of the i-th symbol of the input string. (A-T, C-G, T-A, G-C)
    """

    # Declare an empty string.
    dna2 = ""

    # Range through the dna string, taking complements.
    for symbol in dna:
        if symbol == 'A':
            dna2 += 'T'
        elif symbol == 'C':
            dna2 += 'G'
        elif symbol == 'G':
            dna2 += 'C'
        elif symbol == 'T':
            dna2 += 'A'
        else:
            raise ValueError("Invalid symbol in string given to complement().")
    return dna2

The above function is correct, but the repeated use of elif statements can sometimes appear tedious. Another way of writing complement() is to use a special control flow construct called a match statement (which is typically called a switch statement in other languages). The statement match symbol indicates that we are testing the value of the symbol variable against a number of “cases”. Rewriting complement() using a match statement is shown below.

def complement(dna: str) -> str:
    """
    Finds the complementary strand of the given string.
    
    Parameters:
    - dna (str): A dna string.
    
    Returns:
    - dna2: the string whose i-th symbol is the complementary 
    nucleotide of the i-th symbol of the input string. (A-T, C-G, T-A, G-C)
    """

    dna2 = ""

    for symbol in dna:
        match symbol:
            case "A":
                dna2 += "T"
            case "C":
                dna2 += "G"
            case "G":
                dna2 += "C"
            case "T":
                dna2 += "A"
            case _:
                raise ValueError("Invalid symbol in string given to complement().")

    return dna2

We can now test the complement() function on a short input string by running our code with the following added to main(). (To avoid an error, you will need to briefly comment out your reverse_complement() function because we have yet to implement reverse().)

def main():
    # Prints out strings.
    print("Strings.")

    # ...

  
    dna = "ACCGAT"
    print(complement(dna)) # Should print TGGCTA

Click Run 👇 to try it!

Reversing a string

We are now ready to implement reverse(), a function that takes a string as input and that returns the result of reversing all of the input string’s symbols.

Exercise: Before we continue, practice what you have learned by attempting to implement reverse() yourself.

We could range a counter variable i starting at either the left or the right side of the input string s; we will choose the left side because it will allow us to use the range keyword. Also, unlike complement(), our reverse() function should work for an arbitrary input string, not just a string comprising DNA symbols.

Our implementation of reverse(), which builds a string rev through repeated concatenations, is shown below. We also need to be careful with ranging. Letting n denote the length of s, we want to set rev[0] equal to s[n-1], rev[1] equal to s[n - 2], rev[2] equal to s[n - 3], and so on. For an arbitrary i, we set rev[i] equal to s[(n - 1) - i], or s[(n - i) - 1].

def reverse(s: str) -> str:
    """
    reverse returns the given string backwards.
    
    Parameters:
    - s (str): The given string to reverse.
    
    Returns:
    - str: The reverse of s.
    """
    
    rev = ""
    n = len(s)
    for i in range(n):
        rev += s[n - 1 - i]
    return rev

Note: This function offers a simple illustration of the need for strong programmers to have strong foundational quantitative skills. The control flow of reverse() is straightforward, but appreciating how to establish a formula for which index of s to consider requires a mathematics education that is based on noticing patterns and solving problems as opposed to rote memorization. We leave this topic of conversation to another time. For now, we will note that even though this is a course about programming computers, strong programmers use pencil and paper or the electronic equivalent (here, to write out the indices that we are considering at each point in time) to help themselves notice patterns and solve problems.

A note on efficiency of string concatenation

Because strings are immutable, every time we perform an operation like rev += s[n - i - 1], we have to allocate a brand new string to the rev variable. If the input string s is small, then this is no big deal, but if s is large, then as rev grows, our function will slow down.

However, lists are mutable, which means that they can grow without requiring us to reallocate memory each time their values change. As a result, we can write a more memory efficient version of reverse() that first generates a list characters corresponding to the collection of symbols in our desired string, and then converting this list to a string one time using the command "".join(characters). In general, the join() function concatenates all elements in its input parameter into a string, separated by whatever string we provide in advance of the function call; in this case, that string is the empty string "" that contains no symbols, and so the resulting string will simply concatenate together all the elements in characters.

def reverse(s: str) -> str:
    """
    reverse returns the given string backwards.
    
    Parameters:
    - s (str): The given string to reverse.
    
    Returns:
    - str: The reverse of s.
    """
    
    characters = []
    n = len(s)
    for i in range(n):
        characters.append(s[n - 1 - i])
    return "".join(characters)

Putting it all together, and a final point about modularity

Since we have already written reverse_complement(), we are now ready to test it in addition to our function reverse(). We can do so by running our program after adding the following code in main().

def main():
    # Prints out strings.
    print("Strings.")

    # ...
    
    dna = "ACCGAT"
    print(complement(dna)) # Should print TGGCTA
    print(reverse(dna)) # Should print TAGCCA
    print(reverse_complement(dna)) # Should print ATCGGT

Click Run 👇 to try it!

Testing our functions in this way illustrates one more benefit of writing modular code, which is that such code is easy to test. By passing the work of reverse complementing a string to two subroutines, we can test and debug our code by first testing each of these subroutines, so that once these functions have been tested, we can be nearly certain that reverse_complement() is correct.

Looking ahead

Now that we have introduced strings, we would like to move toward the algorithms that we introduced in this chapter for finding frequent words. To do so, we need to learn more about how to work with substrings, or contiguous patterns contained within strings.

Next lesson

Check your work from the code along

We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:

complement()
reverse()
reverse_complement()

An introduction to Strings in Python

Code along video

Learning objectives

Code along summary

Setup

Declaring strings, and string operations

Strings are (kinda) arrays of symbols

Strings are immutable

Reverse complementing a DNA string by passing the work to subroutines

Complementing a DNA string, and match statements

Reversing a string

A note on efficiency of string concatenation

Putting it all together, and a final point about modularity

Looking ahead

Check your work from the code along

Join our community!

Join our community!