Building a Skew Array in Python

Learning objectives

In the previous code along, we learned how to find “clumps” of short strings in a bacterial genome in the hopes that these clumps might lead us to the replication origin. Unfortunately, so many different biological processes involve binding to frequent substrings in a short region of the genome that we are left with nearly 2,000 different candidates to consider.

Instead, the core text introduced the concept of a skew array, whose i-th value is the difference between the total number of occurrences of G and the total number of occurrences of C in the first i nucleotides of a given genome. For example, the skew array of the string "CATGGGCATCGGCCATACGCC" is [0, -1, -1, -1, 0, 1, 2, 1, 1, 1, 0, 1, 2, 1, 0, 0, 0, 0, -1, 0, -1, -2].

We can visualize a skew array with a chart called a skew diagram, a line graph that plots the number of nucleotides k encountered in genome on the x-axis, and the k-th value of the skew array on the y-axis. A skew diagram for the DNA string "CATGGGCATCGGCCATACGCC" is shown in the figure below.

The skew diagram for genome = "CATGGGCATCGGCCATACGCC".

We can generate a skew array using the following pseudocode function skew_array(), which calls a single subroutine skew() that takes as input a character symbol and returns 1, -1, or 0 based on whether symbol is equal to G, C, or another character, respectively.

SkewArray(genome)
    n ← length(genome)
    array ← array of length n + 1
    array[0] ← 0
    for every integer i between 1 and n
        array[i] = array[i-1] + Skew(genome[i-1])
    return array

Skew(symbol)
    if symbol = 'G'
        return 1
    else if symbol = 'C'
        return -1
    return 0

We also presented evidence that because the half of a bacterial genome preceding the replication origin has a surplus of C and a shortage of G, whereas the half of the genome following the replication origin has a surplus of G and a shortage of C. As a result, we hypothesized that that the position where the skew array achieves a minimum value (i.e., the low point of the skew diagram) would lead us to the replication origin. That reasoning led us to formulate the following computational problem.

Minimum Skew Problem

Input: A DNA string genome.

Output: All indices i of the skew array of genome achieving the minimum value in the skew array.

The following function MinimumSkew() solves this problem by calling SkewArray() and a function MinIntegerArray() finding the minimum value of an array as subroutines.

MinimumSkew(genome)
    indices ← array of integers of length 0
    n ← length(genome)
    array ← SkewArray(genome)
    m ← MinIntegerArray(array)
    for every integer i between 0 and n
        if array[i] = m
            indices = append(indices, i)
    return indices

In this lesson, we will implement these functions in Python and apply them to the E. coli genome to find the bacterium’s origin of replication without having to run a single experiment. Along the way, we will also discuss how to visualize the skew diagram using packages for plotting. We promise that rendering the skew diagram of E. coli is well worth the effort!

Code along summary

Setup

Create a folder called skew in your python/src directory and create a text file called main.py in the python/src/skew folder. We will edit main.py, which should have the following starter code. The code appearing in main(), introduced in the preceding code along, reads in a genome from a hyperlink.

import requests

def main():
    print("Building a skew array.")

    url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    response = requests.get(url)
    response.raise_for_status()

    genome = response.text

    print("The number of nucleotides in E. coli genome is", len(genome))

Implementing skew array functions

First, we will implement skew_array(), which allows us to practice ranging and working with slices.

def skew_array(genome: str) -> list[int]:
    """
    skew_array returns the list that represents the skew at each position
    of the genome. That is, the i-th position in the list is the skew at 
    the i-th position of the genome.

    Parameters:
    - genome (str): A genome string.

    Returns:
    - list[int]: A list representing the skew of the genome string.
    """

    if len(genome) == 0:
        raise ValueError("genome must be a non-empty string")

    # Define the length of the genome.
    n = len(genome)

    # Define a list for the skew array.
    skew_array = [0] * (n + 1)

    # Range through the genome and add its skew at each position.
    for i in range(1, n + 1):
        skew_array[i] = skew_array[i - 1] + skew(genome[i - 1])

    # Return the skew array.
    return skew_array

Next, we implement skew(), which provides a refresher of if statements.

def skew(symbol: str) -> int:
    """
    skew returns 1 or -1 if the given single character string is either
    G or C respectively (case-insensitive), otherwise returns zero. 
    Throws an error if the given string does not have a length of one.

    Parameters:
    - symbol (str): A given one character string.

    Returns:
    - int: The symbol's respective skew score.
    """
    
    # Raise a ValueError if the length of the string is not 1.
    if len(symbol) != 1:
        raise ValueError("symbol must be a single character")

    # Return 1 if G or g.
    if symbol == "G" or symbol == "g":
        return 1
    # Return -1 if C or c.
    elif symbol == "C" or symbol == "c":
        return -1
    # Else return 0.
    else:
        return 0

Finally, we implement minimum_skew(), which relies on the min_integer_array() subroutine that we wrote in the previous chapter.

def minimum_skew(genome: str) -> list[int]:
    """
    minimum_skew finds the list of integers representing all integer indices
    that minimizes the skew of the genome text.

    Parameters:
    - genome (str): A genome string.

    Returns:
    - list[int]: A list of indices that minimize the skew value of the genome text.
    """

    if len(genome) == 0:
        raise ValueError("genome must be non-empty")

    # Create a list of indices.
    indices = []

    # Create the skew array.
    skew_arr = skew_array(genome)

    # Find the minimum in the skew array.
    min_val = min_integer_array(skew_arr)

    # range over the skew array, finding indices whose values are equal to the minimum skew value
    n = len(skew_arr)
    for i in range(n):
        if skew_arr[i] == min_val:
            indices.append(i)

    # Return the list of indices.
    return indices

As a refresher, here is the min_integer_array() function.

def min_integer_array(list_values: list[int]) -> int:
    """
    min_integer_array that finds the minimum value in the list.

    Parameters:
    - list_values (list[int]): A list of integer values.

    Returns:
    - int: The smallest value in list_values.
    """

    if len(list_values) == 0:
        raise ValueError("list_values must be non-empty")
    
    # Sets the minimum value as the first element of the array.
    min_val = list_values[0]

    # Ranges through the values in the list.
    for val in list_values:
        # If the value is less than the min, set the value equal to the min.
        if val < min_val:
            min_val = val

    # Return the minimum value.
    return min_val

Finding the minimum skew positions in E. coli

Now that we have written these functions, let’s call skew_array() and minimum_skew() on the E. coli genome and see where the skew is minimized. To do so, we will generate two slices, ecoli_skew_array and min_skew_positions, which respectively store the skew array for E. coli and the positions where the skew is minimized.

We know that all of the indices in min_skew_positions attain the minimum skew value in ecoli_skew_array. Accordingly, we can access the array value for min_skew_positions[0] to determine the minimum skew value.

import urllib.request

def main():
    print("Building a skew array.")
    url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    # Urllib closes the connection as soon as you receive a response.
    contents = urllib.request.urlopen(url)
    
    # Check status code.
    if contents.getcode() != 200:
        raise urllib.error.URLError("Bad status code")
    
    # Read response object.
    contents = contents.read()

    # Do some processing.
    genome = contents.decode('utf-8')

    if len(genome) == 0:
        raise ValueError("Downloaded genome sequence is empty.")
    
    print("The number of nucleotides in E. coli genome is " + str(len(genome)))
    ecoli_skew_array = skew_array(genome)
    min_skew_positions = minimum_skew(genome)

    if len(min_skew_positions) == 0:
        raise RuntimeError("No minimum positions found (unexpected).")
	
    first_position = min_skew_positions[0]
	
    print("The minimum skew value of " + str(ecoli_skew_array[first_position]) + " occurs at positions " + str(min_skew_positions))
Click Run 👇 to try it!
STOP: Open a terminal, and navigate to the skew directory using the command cd python/src/skew. Then, run your code using the command python3 main.py (macOS/Linux) or python main.py (Windows). Where does the skew array attain a minimum in E. coli?

Visualizing the skew array using a plot

Although we have found the minimum skew position in E. coli (at positions [3923620, 3923621, 3923622, 3923623], our work is ultimately not satisfying because we lack compelling evidence that this is indeed the replication origin. To provide this evidence, we should instead visualize the skew array as a skew diagram, a task that requires us to learn how to make plots in Python.

We will first need to install code from a third party package to perform the plotting that is not included as part of the standard Python installation. This package, called matplotlib, is standard external package that make scientific visualization easier.

To install this package, we will once again use pip, the Python package installer. Open a terminal application and execute the command pip3 install matplotlib (on macOS/Linux) or pip install matplotlib (on Windows).

Once you have done so, let’s add matplotlib to our import statements. Specifically, we only want to import the pyplot submodule, which we can import as follows:

from matplotlib import pyplot

Next, we will start writing our plotting code. Let’s place all of the plotting code into a function draw_skew() that takes as input a (skew) array of integers and plots the skew diagram corresponding to this array.

We begin with using the standard pyplot.plot() function from the matplotlib package. We will then set the title, axis labels, and save the plot to our computer.

Note: Lines like pyplot.title = "Skew Diagram" are setting “fields”, or attributes, of the plot object. We will say much more about this type of notation and other syntax encountered in draw_skew() when we transition to more object-oriented programming in a later chapter.
from matplotlib import pyplot

def draw_skew(skew_list: list[int]) -> None:
    """
    draw_skew draws a skew diagram using a skew_array output.
	
    """
    if len(skew_list) == 0:
        raise ValueError("skew_list is empty")

    pyplot.plot(skew_list)
    pyplot.title("Skew Diagram")
    pyplot.xlabel("Genome Position")
    pyplot.ylabel("Skew Value")
    pyplot.savefig("skewDiagram.png")
    plt.show()

We are now ready to call draw_skew() from main() on the E. coli skew array that we generated previously.

import urllib.request

from matplotlib import pyplot

def main() -> None:
    print("Building a skew array.")
    url = "https://bioinformaticsalgorithms.com/data/realdatasets/Replication/E_coli.txt"

    # Urllib closes the connection as soon as you receive a response.
    contents = urllib.request.urlopen(url)
    
    # Check status code.
    print(contents.getcode())
    
    # Read response object.
    contents = contents.read()

    # Do some processing.
    genome = contents.decode('utf-8')
    if not genome:
        raise ValueError("Genome data is empty.")

    print("The number of nucleotides in E. coli genome is " + str(len(genome)))
    ecoli_skew_array = skew_array(genome)
    min_skew_positions = minimum_skew(genome)

    if len(min_skew_positions) == 0:
        raise RuntimeError("No minimum positions found (unexpected).")
	
    first_position = min_skew_positions[0]
	
    print("The minimum skew value of " + str(ecoli_skew_array[first_position]) + " occurs at positions " + str(min_skew_positions))

    draw_skew(ecoli_skew_array)
	
    print("Skew diagram drawn! Exiting normally.")
Click Run 👇 to try it!
STOP: Run your code. Isn’t the skew diagram of E. coli beautiful?

Time to practice!

After you have the chance to check your work below, you are ready to apply what you have learned to the chapter’s exercises. Or, you can carry on to the next chapter, where we will learn about how generating random numbers can help us simulate a presidential election.

Check your work from the code along

We provide autograders in the window below (or via a direct link) allowing you to check your work for the following functions:

  • skew()
  • skew_array()
  • minimum_skew()

powered by Advanced iFrame


Scroll to Top
Programming for Lovers banner no background
programming for lovers logo cropped

Join our community!

programming for lovers logo cropped
Programming for Lovers banner no background

Join our community!