Cython: Accelerate Python Execution

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • What is Cython and its relation with Python?

Objectives
  • Optimize a numerical intensive operation with Cython

Cython

Cython is a language the extends Python. With minimal changes to a python code, you can get a compiled code that runs faster. From another perspective, Cython is like a bridge that allows you to interface C and C++ routines inside python.

We will follow a very practical route to give you basic elements that will quickly improve the performance of your code with minimal effort. Scientific computing, performance is central, but also you want to optimize the time spent developing the code. The right balance between both is the goal. Using Python is an efficient way of getting a code that works, but still work must to be done to make it efficient.

Let us take the code for the Sieve of Eratosthenes, the python code was very poor in performance. We are using an interpreted language, but also we are using lists and loops. Those elements make the code run very slowly, that was done on purpose because allow us to explore how to improve from there.

We start splitting the original code in two files. The file that contains the function SieveOfEratosthenes will be separated on a file called sieve.py:

def SieveOfEratosthenes(n):

    if n%2 == 0:
        imax=int(n/2)
    else:
        imax=int((n+1)/2)

    neff= 2*(imax-1)+1
    prime = imax*[True]

    print(" Dimension of array: %d\n" % imax)
    print(" Array with indices between 0 to %d\n" % (imax-1))
    print(" Stores primality of odd numbers in range [1, %d]\n\n" % neff)

    # The loop runs over all odd numbers starting at 3
    p=3
    while (p<n/2+1):
        # If prime[p] is true, then it is a prime
        if(prime[int((p-1)/2)] is True):
            # Update all multiples of p
            i = 2*p
            while (i <= neff):
                if (i%2!=0):
                    prime[int((i-1)/2)] = False
                i += p
        p+=2

    counter = 1
    nprimes = 0
    # Print all prime numbers if the number is small otherwise just count the number of primes found
    if (imax < 10000):
        print("%10d " % 2, end='')
        # Starting with index 1 instead of zero because (2*0)+1 = 1 is not prime
        for i in range(1,imax):
            if(prime[i] == 1):
                print("%10d " % (2*i+1), end='')
                counter+=1

            if (counter == 15):
                print("")
                counter=0

        print("\n")

    else:
        for i in range(1,imax):
            if(prime[i] is True):
                nprimes+=1
        print(" Total number of primes found: %d" % nprimes)

    return nprimes

The other file will just contain the main execution:

import sys
import sieve

if __name__ == "__main__":

    if (len(sys.argv) > 1):
        n = int(float(sys.argv[1]))
    else:
        n = 100

    print("")
    print(" Prime numbers up to %d" % n)
    nprimes = sieve.SieveOfEratosthenes(n)
    print(" Function return: %d" % nprimes)

Let us compute how much time will take to find all prime numbers among the first 50 million integers. We do not want to do that on the head node. As several others will be doing the same, we will use a qsub on interactive mode, here I am using debug, but use any queue that allows you to enter quickly.

$ qsub -I -q debug

Once you get allocated a compute node use:

$ module load compilers/python/3.6.0
$ cd 2018-Data-HandsOn/2.Programming/5.Cython
$ time python3 main.py 5E7

 Prime numbers up to 50000000
 Dimension of array: 25000000

 Array with indices between 0 to 24999999

 Stores primality of odd numbers in range [1, 49999999]


 Total number of primes found: 3001133
 Function return: 3001133

real    0m35.606s
user    0m35.560s
sys    0m0.052s

From above, you see that it takes 35 seconds to compute the prime numbers up to 50 million, there are several reasons for that extreme long execution time, we are using an interpreted language, we are using python’s lists instead of a library like numpy and we are using python’s loops instead of vectorized operations again using numpy.

Let us explore what we can do easily with Cython to improve that figure. We start copying sieve.py as sieve_ct.pyx and adding this to the first line of the new file:

#cython: language_level=3, boundscheck=False

That line is a comment for python but will instruct cython that the code below is python 3. Otherwise, cython interprets print as python 2 and will fail.

Cython works by taking a code like sieve_ct.pyx that right now is exactly identical to the original and create a C code sieve_ct.c that can be compiled into a shared library sieve_ct.so that we can import as any other python module.

There are several ways of compiling Cython extensions. Three ways will demonstrate here: the manual way, using distutils and using pyximport for transparent compilation.

Compiling manually

To compile manually you need first to identify where the python headers are located. On Spruce using the Python 3.6.0 the following command will give you the answer

$ python3 -c "from distutils import sysconfig; print(sysconfig.get_python_inc())"
/shared/software/languages/python/3.6.0/include/python3.6m

With that knowledge we compile the code with this two steps:

$ cython -3 sieve_ct.pyx
$ gcc -shared -fPIC -O2 -Wall -lm -I/shared/software/languages/python/3.6.0/include/python3.6m sieve_ct.c -o sieve_ct.so

A copy of main.py with a couple of small modifications can help us to test the new module. We are renaming the module to avoid loading the original sieve.py instead. As we rename the code as sieve_ct.pyx and the final module will be sieve_ct we need to change the main like this:

import sys
import sieve_ct

if __name__ == "__main__":

    if (len(sys.argv) > 1):
        n = int(sys.argv[1])
    else:
        n = 100

    print("")
    print(" Prime numbers up to %d" % n)
    nprimes = sieve_ct.SieveOfEratosthenes(n)
    print(" Function return: %d" % nprimes)

Let us get some timing using the new cython modules (remember to use qsub -I)

$ module load compilers/python/3.6.0
$ cd 2018-Data-HandsOn/2.Programming/5.Cython
$ time python3 main_ct.py 5E7
 Prime numbers up to 50000000
 Dimension of array: 25000000

 Array with indices between 0 to 24999999

 Stores primality of odd numbers in range [1, 49999999]


 Total number of primes found: 3001133
 Function return: 3001133

real    0m17.602s
user    0m17.549s
sys    0m0.056s

We are basically cutting in a half the time needed. Not bad for a code that has not been changed at all.

Compiling with distutils

Another way of compiling the code is using distutils. Create a file setup.py and enter this lines:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name='sieve',
    ext_modules=cythonize('sieve_ct.pyx')
    )

We are using the cythonize module from Cython that allow us to compile the code without actually running cython and the gcc command.

$ python3 setup.py build_ext --inplace
running build_ext
building 'sieve_ct' extension
creating build
creating build/temp.linux-x86_64-3.6
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/shared/software/languages/python/3.6.0/include/python3.6m -c sieve_ct.c -o build/temp.linux-x86_64-3.6/sieve_ct.o
creating build/lib.linux-x86_64-3.6
gcc -pthread -shared -flto -ffat-lto-objects -flto-partition=none build/temp.linux-x86_64-3.6/sieve_ct.o -o build/lib.linux-x86_64-3.6/sieve_ct.cpython-36m-x86_64-linux-gnu.so

As you can see that save you from knowing the compilation line. The --inplace is just to get the resulting .so file in the same place as the main_ct.py. Otherwise, you will have to look inside the build folder for the .so file generated.

The new file is called sieve_ct.cpython-36m-x86_64-linux-gnu.so, you can get timings again and noticing that the new code is one or two seconds faster, just because the compilation line add a bit higher optimization level.

Using cython transparently with pyximport

The third option is to use pyximport, use interactive mode and not running on the head node.

$ module load compilers/python/3.6.0
$ cd 2018-Data-HandsOn/2.Programming/5.Cython
$ rm *.so

And run IPython:

$ ipython3
Python 3.6.0 (default, Jan 24 2017, 12:13:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyximport

In [2]: pyximport.install()
Out[2]: (None, <pyximport.pyximport.PyxImporter at 0x7fe26a7766a0>)

In [3]: import sieve_ct

In [4]:%timeit sieve_ct.SieveOfEratosthenes(1E7)
...
...
15.3 s ± 269 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The module pyximport will take care of compile the code transparently for you. On this example we are using the magic %timeit from IPython that help us to benchmark the function, revealing that it takes now 15 seconds to compute the primes.

Adding static types

Until now, we have not change the function SieveOfEratosthenes(n) from the original used on our first episode. We can get another boost in performance by using “typed variables”.

In python, you can use variables to store all sort of types and changing them during execution, for example, you can declare a=1 and later change it into a='one'. That kind of flexibility is very welcome on an interpreted language but it comes with high-performance overhead. Without knowing the type of variables in advance, the interpreter cannot produce on-the-fly the kind of optimized code that a programming language like C can produce.

In Cython, you can declare the type of variables with a minimal change on the code. Make a copy of sieve_ct.pyx as sieve_st.pyx and add this line inside the function definition:

#cython: language_level=3, boundscheck=False

import sys

def SieveOfEratosthenes(n):

    # Type declared variables for cython
    cdef int imax, neff, p, i, counter, nprimes

    if n%2 == 0:
        imax=int(n/2)
    else:
        imax=int((n+1)/2)

    neff= 2*(imax-1)+1
    prime = imax*[True]
...
...

The line cdef int imax, neff, p, i, counter, nprimes instruct cython that all those variables will only contain integers for the entire execution. Save the file and get new timings for the new code.

Here we are using the pyximport inside IPython and timing with the magic %timeit. Remember to use qsub -I

$ ipython3
Python 3.6.0 (default, Jan 24 2017, 12:13:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyximport

In [2]: pyximport.install()
Out[2]: (None, <pyximport.pyximport.PyxImporter at 0x7f573945def0>)

In [3]: import sieve_st

In [4]: %timeit sieve_st.SieveOfEratosthenes(5E7)
...
...
8.34 s ± 818 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Working with C arrays and pointers

One of the reasons why the Python code that we proposed initially runs so slow is that we are storing the primality condition for all odd numbers as a big list. Python list are very flexible, they are mutable, meaning that you can add and remove elements to the list and change the values on the fly. Those features are nice but also comes with a performance tag. This time we need to introduce more changes.

This is the final code and we will explain the changes:

#cython: language_level=3, boundscheck=False
from libc.stdio cimport printf
from libc.stdlib cimport malloc, free

import sys

def SieveOfEratosthenes(n):

    cdef int imax, neff, p, i, counter, nprimes

    if n%2 == 0:
        imax=int(n/2)
    else:
        imax=int((n+1)/2)

    neff= 2*(imax-1)+1

    cdef int *prime = <int *> malloc(imax * sizeof(int))
    for i in range(imax):
        prime[i]=1    

    print(" Dimension of array: %d\n" % imax)
    print(" Array with indices between 0 to %d\n" % (imax-1))
    print(" Stores primality of odd numbers in range [1, %d]\n\n" % neff)

    # The loop runs over all odd numbers starting at 3
    p=3
    while (p<n/2+1):
        # If prime[p] is true, then it is a prime
        if(prime[int((p-1)/2)] == 1):
            # Update all multiples of p
            i = 2*p
            while (i <= neff):
                if (i%2!=0):
                    prime[int((i-1)/2)] = 0
                i += p
        p+=2

    counter = 1
    nprimes = 0
    # Print all prime numbers if the number is small otherwise just count the number of primes found
    if (imax < 10000):
        print("%10d " % 2, end='')
        # Starting with index 1 instead of zero because (2*0)+1 = 1 is not prime
        for i in range(1,imax):
            if(prime[i] == 1):
                print("%10d " % (2*i+1), end='')
                counter+=1

            if (counter == 15):
                print("")
                counter=0

        print("\n")

    else:
        for i in range(1,imax):
            if(prime[i] == 1):
                nprimes+=1
        print(" Total number of primes found: %d" % nprimes)

    free(prime)
    return nprimes

First, we add some imports from libc, those imports are very similar to #include <stdlib> and #include <stdio>

#cython: language_level=3, boundscheck=False
from libc.stdio cimport printf
from libc.stdlib cimport malloc, free

The next step is to replace the Phyton prime list into a Cython C-like array (actually a pointer).

cdef int *prime = <int *> malloc(imax * sizeof(int))
for i in range(imax):
    prime[i]=1    

The other changes in the code consist from changing storing and comparing Booleans (True and False) for integers (1 or 0)

The final change is freeing the memory for prime array. Python offers a convenient garbage collector, but in C, you are responsible from freeing the memory that you are not using, for this particular case is not a big deal, as the program will finish after evaluating the primes, but in more complex situations freeing memory is an important issue for keeping the code efficient by avoiding memory leaks.

Now, it is time for benchmarking the new code.

$ ipython
Python 3.6.0 (default, Jan 24 2017, 12:13:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyximport

In [2]: pyximport.install()
Out[2]: (None, <pyximport.pyximport.PyxImporter at 0x7f783904c668>)

In [3]: import sieve_ar

In [4]: %timeit sieve_ar.SieveOfEratosthenes(5E7)
...
...
1.59 s ± 35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We have successfully lowered the time to compute the prime numbers up to 50 million from 39 seconds to 1.5 seconds.

Let us summarize our success in the next table:

Code Description Timing (Using %timeit)
sieve.py Original code 39 s ± 1.01 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
sieve_ct.pyx Original code with Cython compilation 16.4 s ± 1.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
sieve_st.pyx Using typed variables 7.18 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sieve_ar.pyx Using C arrays and pointers 1.62 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Key Points

  • Cython is a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python.