Python Scripting for HPC

Overview

Teaching: 45 min
Exercises: 15 min
Topics
  • Why learn Python programming language?

  • How can I use Python to write small scripts?

Objectives
  • Learn about variables, loops, conditionals and functions

Python Programming for Scientists in a hurry

Chapter 2. Standard Library

Guillermo Avendaño Franco
Aldo Humberto Romero


Scientific Computing with Python
Scientific Computing with Python

List of Notebooks

Python is a great general-purpose programming language on its own. Python is a general purpose programming language. It is interpreted and dynamically typed and is very suited for interactive work and quick prototyping while being powerful enough to write large applications in. The lesson is particularly oriented to Scientific Computing. Other episodes in the series include:

After completing all the series in this lesson you will realize that python has become a powerful environment for scientific computing at several levels, from interactive computing to scripting to big project developments.

Setup

%load_ext watermark
%watermark
Last updated: 2024-07-25T19:10:56.785117-04:00

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.14.0

Compiler    : Clang 12.0.0 (clang-1200.0.32.29)
OS          : Darwin
Release     : 20.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit
import time
start = time.time()
chapter_number = 2
import matplotlib
%matplotlib inline
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
%watermark -iv
matplotlib: 3.8.2
numpy     : 1.26.2

Python Standard Library

Table of Contents

  1. Introduction
  2. Module sys
  3. Modules math and cmath
  4. Modules os and os.path
  5. Module shutil
  6. Module itertools
  7. Module json
  8. Module subprocess
  9. module multiprocessing

Introduction

The Python Standard Library (PSL) is a set of modules distributed with Python and they are included in most Python implementations. With some very specific exceptions, you can take for granted that every machine capable of running Python code will have those modules available too.

The Python’s standard library is very extensive. The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. The idea we are trying to use here is that the existence of some modules will help with the simplicity of the program and they will allow also portability between different systems. Python is trying to create a natural neutral platform for application programming interfaces (APIs).

Here we are making a selection of a few modules that are commonly used in Scientific Computing. The selection itself is rather subjective but from experience, most users using Python for research, especially numerical-oriented calculations will use at some point several of these modules.

The complete documentation about these modules can be found here

Module sys

This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It is always available. More info can be found in sys

import sys

There are a few reasons to include this module in the selection, consider getting the version of Python that is in use:

sys.version
'3.11.7 (main, Dec 24 2023, 07:47:18) [Clang 12.0.0 (clang-1200.0.32.29)]'
sys.version_info
sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)

To know information about the limits of float type. sys.float_info contains low-level information about the precision and internal representation. The values correspond to the various floating-point constants defined in the standard header file float.h for the ‘C’ programming language; see section 5.2.4.2.2 of the 1999 ISO/IEC C standard [C99], ‘Characteristics of floating types’, for details.

sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

Each value can be retrieved independently like

sys.float_info.max
1.7976931348623157e+308

Similarly for integers:

sys.int_info
sys.int_info(bits_per_digit=30, sizeof_digit=4, default_max_str_digits=4300, str_digits_check_threshold=640)

To get the size of any object in bytes:

a=list(range(1000))
sys.getsizeof(a)
8056
b=range(1000)
sys.getsizeof(b)
48

By itself, the builtin function sys.getsizeof() does not help determine the size of a container (a given object) and all of its contents, but can be used with a recipe like this to recursively collect the contents of a container.

To know the paths to search for modules

sys.path
['/Users/guilleaf/Documents/GitHub/Notebooks_4SIAH/Python',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python311.zip',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/lib-dynload',
 '',
 '/Users/guilleaf/Library/Python/3.11/lib/python/site-packages',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages']

Prefix path where the current version of Python is in use:

sys.prefix
'/opt/local/Library/Frameworks/Python.framework/Versions/3.11'

To collect arguments such as

myscript.py arg1 arg2 arg3

from the command line, sys.argv can be used, in particular for scripts.

sys.argv
['/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ipykernel_launcher.py',
 '-f',
 '/Users/guilleaf/Library/Jupyter/runtime/kernel-b2ee4d39-5d1d-4166-902d-90360006ed51.json']

However, sys.argv is very primitive for practical purposes. The module argparse is the recommended module to parse arguments from the command line.

Modules math and cmath

This module provides access to the mathematical functions defined by the C standard. A similar module for complex numbers is cmath

import math
import cmath

The arguments for the functions in math and cmath must be numbers. As we will see in the lesson for NumPy when the functions have to operate over multiple numbers the functions on NumPy are a far more efficient alternative and avoid expensive loops over lists or other low-performance containers.

A few functions are shown as examples:

math

math.ceil(2.5)
3
math.fabs(-3.7)
3.7

fabs only works for real numbers and returns always a floating point number even if the argument is integer. In the case of complex numbers the built-in abs() returns the magnitude of the complex number

abs(-1.7+4.5j)
4.810405388322278

GCD stands for Greatest Common Divisor of the integers a and b.

math.gcd(91, 133)
7
math.sqrt(256)
16.0
math.cos(math.pi/3)
0.5000000000000001

cmath

cmath.sqrt(-256)
16j
cmath.cos(1j*math.pi/3)
(1.600286857702386-0j)

Modules os and os.path

Sooner or later you will interact with files and folders. The module os not only provides basic operativity over the filesystem but also allows us to gain information about the operating system that is executing Python

os

import os

The module os, provides operating system-dependent functionality. Some functions are not available in some Operating Systems returning os.OSError in those cases.

os.name
'posix'
os.environ
environ{'TERM_PROGRAM': 'Apple_Terminal',
        'SHELL': '/bin/zsh',
        'TERM': 'xterm-color',
        'TMPDIR': '/var/folders/1m/dc_l_kx53tv3qkygf1r7pnmc0000gn/T/',
        'TERM_PROGRAM_VERSION': '440',
        'TERM_SESSION_ID': 'FF04BCC3-7FD5-4F11-AD59-F0BB8196132C',
        'USER': 'guilleaf',
        'SSH_AUTH_SOCK': '/private/tmp/com.apple.launchd.b46iLsbxXw/Listeners',
        'PATH': '/Users/guilleaf/miniconda3/bin:/opt/local/bin:/opt/local/sbin:/Users/guilleaf/Library/Python/3.11/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/zfs/bin:/opt/X11/bin:/Library/Apple/usr/bin:/Users/guilleaf/Library/Python/3.9/bin',
        '__CFBundleIdentifier': 'com.apple.Terminal',
        'PWD': '/Users/guilleaf',
        'XPC_FLAGS': '0x0',
        'XPC_SERVICE_NAME': '0',
        'SHLVL': '1',
        'HOME': '/Users/guilleaf',
        'LOGNAME': 'guilleaf',
        'DISPLAY': '/private/tmp/com.apple.launchd.KV7LFSdvGQ/org.xquartz:0',
        'OLDPWD': '/Volumes/SSD1TB',
        'LANG': 'en_US.UTF-8',
        '_': '/opt/local/bin/jupyter-lab-3.11',
        '__CF_USER_TEXT_ENCODING': '0x1F5:0x0:0x0',
        'GIT_PYTHON_REFRESH': 'quiet',
        'PYDEVD_USE_FRAME_EVAL': 'NO',
        'JPY_PARENT_PID': '60342',
        'CLICOLOR': '1',
        'FORCE_COLOR': '1',
        'CLICOLOR_FORCE': '1',
        'PAGER': 'cat',
        'GIT_PAGER': 'cat',
        'MPLBACKEND': 'module://matplotlib_inline.backend_inline'}

Individual environment variables can be retrieved

os.getenv('USER')
'guilleaf'

A couple of functions reproduce the effect of a few commands in Unix/Linux like pwd, cd and mkdir

# Equivalent to pwd
os.getcwd()
'/Users/guilleaf/Documents/GitHub/Notebooks_4SIAH/Python'
# Equivalent to mkdir
if not os.path.exists('test_folder'):
    os.mkdir('test_folder')
# Equivalent to cd
os.chdir('test_folder')
os.chdir('..')
# Equivalent to ls
os.listdir("test_folder")
[]
# Equivalent to rmdir
os.rmdir('test_folder')

These functions are useful in HPC to determine the number of cores on a machine

os.cpu_count()
8

The os module is particularly large and the functions above are just a tiny fraction of all the commands available. It is always better to use commands like os.mkdir() than to use external calls to system commands. A bad program habit is using for example:

os.system("mkdir test_folder")
0

This command, not only makes the code non-portable (will not work in Windows) but also on Unix systems is creating a subshell for a function that can be executed using os.mkdir()

os.path

This module implements some useful functions on pathnames. For checking the existence of a file or folder or splitting the filename from the full path

import os.path

To know if a file or folder exists:

if not os.path.exists('test_folder'):
    os.mkdir('test_folder')
os.path.isfile('test_folder')
False
os.path.isdir('test_folder')
True
fullpath=os.path.abspath('test_folder')
print(fullpath)
/Users/guilleaf/Documents/GitHub/Notebooks_4SIAH/Python/test_folder
os.path.split(fullpath)
('/Users/guilleaf/Documents/GitHub/Notebooks_4SIAH/Python', 'test_folder')

This function splits a path into two components (head, tail) where the tail is the last pathname component and the head is everything leading up to that. The tail part will never contain a slash; if the path ends in a slash, the tail will be empty.

It is useful to separate the filename from the path to that file.

Module shutil

For high-level operations on one or more files. Most functions in shutil support file copying and removal of multiple files from a single call. These functions are more efficient than creating loops and operate over the files individually.

import shutil
wf=open('newfile1','w')
wf.close()
if not os.path.exists('test_folder'):
    os.mkdir('test_folder')
shutil.copy2('newfile1', 'test_folder')
'test_folder/newfile1'
shutil.rmtree('test_folder')
os.remove("newfile1")

Module itertools

In mathematics, statistic, and machine learning, the solution of many problems can be naturally expressed using the functional programming style instead. We will discuss Python’s support for the functional programming paradigm and itertools is one of the modules in the Standard Libray to program in this style.

Combinations and permutations are often found in scientific problems. The module itertools offers efficient functions for creating iterables for those operations. Compared to actual lists, iterators can create infinite iterations, producing new elements as needed. An iterator has the advantage of using less memory than actual lists.

import itertools

Infinite iterators will create sequences of infinite length as shown below. There are two interators count() and cycle()

index=0
for i in itertools.count(13):
    print(i)
    index=index+1
    if index>9:
        break
13
14
15
16
17
18
19
20
21
22
index=0
for i in itertools.cycle('aeiou'):
    print(i)
    index=index+1
    if index>9:
        break
a
e
i
o
u
a
e
i
o
u

Finite iterators will create sequences of a predetermined length as shown below. In the itertools module you can use repeat():

for i in itertools.repeat('one',5):
    print(i)
one
one
one
one
one

For large interactions, this is more memory efficient than an equivalent:

for i in 5*['one']:
    print(i)
one
one
one
one
one

Combinatoric iterators

One of the reasons for using iterators is to produce permutations and combinations without explicitly creating long lists for evaluation. New values are generated on-the-fly which is usually far more efficient than lists.

Iterators for Combinations and Permutations can be created as follows:

for i in itertools.permutations('ABCD',3):
    print(i)
('A', 'B', 'C')
('A', 'B', 'D')
('A', 'C', 'B')
('A', 'C', 'D')
('A', 'D', 'B')
('A', 'D', 'C')
('B', 'A', 'C')
('B', 'A', 'D')
('B', 'C', 'A')
('B', 'C', 'D')
('B', 'D', 'A')
('B', 'D', 'C')
('C', 'A', 'B')
('C', 'A', 'D')
('C', 'B', 'A')
('C', 'B', 'D')
('C', 'D', 'A')
('C', 'D', 'B')
('D', 'A', 'B')
('D', 'A', 'C')
('D', 'B', 'A')
('D', 'B', 'C')
('D', 'C', 'A')
('D', 'C', 'B')
for i in itertools.combinations('ABCD',3):
    print(i)
('A', 'B', 'C')
('A', 'B', 'D')
('A', 'C', 'D')
('B', 'C', 'D')
for i in itertools.product('ABCD',repeat=2):
    print(i)
('A', 'A')
('A', 'B')
('A', 'C')
('A', 'D')
('B', 'A')
('B', 'B')
('B', 'C')
('B', 'D')
('C', 'A')
('C', 'B')
('C', 'C')
('C', 'D')
('D', 'A')
('D', 'B')
('D', 'C')
('D', 'D')

Module json

JSON is a lightweight data interchange format inspired by JavaScript object literal syntax. It is an effective and standard way of storing structured data. The JSON is just a format of serializing data similar to XML but more compact and easier to read for humans.

import json

Consider serializing this dictionary:

polygons={'triangle': 3, 'square': 4, 'pentagon': 5, 'hexagon': 6}
js=json.dumps(polygons)
js
'{"triangle": 3, "square": 4, "pentagon": 5, "hexagon": 6}'

This is a string that can be easily read by humans and also easily converted into a python dictionary.

poly=json.loads(js)
poly
{'triangle': 3, 'square': 4, 'pentagon': 5, 'hexagon': 6}

There are extra arguments to beautify the string, for example:

print(json.dumps(polygons, sort_keys=True, indent=4))
{
    "hexagon": 6,
    "pentagon": 5,
    "square": 4,
    "triangle": 3
}

Similar to json.dumps and json.loads there are functions to write and read JSON content directly from readable files. The functions json.dump(obj, fp, ...) and json.load(fp, ...) work on File-like objects. File-like objects have to support write() and read() like normal text file objects.

Module subprocess

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several older modules and functions like os.system.

The underlying process creation and management in this module is handled by the Popen class. It offers a lot of flexibility so that developers can handle the less common cases not covered by the convenience functions.

import subprocess
sp= subprocess.Popen(["ls","-lha","/"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)

sp.wait()
0

The process.communicate() call reads input and output from the process. stdout is the process output. stderr will be written only if an error occurs. If you want to wait for the program to finish you can call Popen.wait().

stout, sterr = sp.communicate()
print(stout)
total 10
drwxr-xr-x  20 root  wheel   640B Jan  1  2020 .
drwxr-xr-x  20 root  wheel   640B Jan  1  2020 ..
lrwxr-xr-x   1 root  admin    36B Jan  1  2020 .VolumeIcon.icns -> System/Volumes/Data/.VolumeIcon.icns
----------   1 root  admin     0B Jan  1  2020 .file
drwxr-xr-x   2 root  wheel    64B Jan  1  2020 .vol
drwxrwxr-x  68 root  admin   2.1K Jul 25 15:27 Applications
drwxr-xr-x  74 root  wheel   2.3K Jun  7 20:06 Library
drwxr-xr-x@  9 root  wheel   288B Jan  1  2020 System
drwxr-xr-x   7 root  admin   224B Jan  1  2020 Users
drwxr-xr-x+  5 root  wheel   160B Jul 25 19:00 Volumes
drwxr-xr-x@ 38 root  wheel   1.2K Jan  1  2020 bin
drwxrwxr-t   2 root  admin    64B Feb 25  2019 cores
dr-xr-xr-x   3 root  wheel   4.5K Jul 20 16:27 dev
lrwxr-xr-x@  1 root  wheel    11B Jan  1  2020 etc -> private/etc
lrwxr-xr-x   1 root  wheel    25B Jul 20 16:27 home -> /System/Volumes/Data/home
drwxr-xr-x   5 root  wheel   160B Dec 31  2023 opt
drwxr-xr-x   6 root  wheel   192B Jan  1  2020 private
drwxr-xr-x@ 65 root  wheel   2.0K Jan  1  2020 sbin
lrwxr-xr-x@  1 root  wheel    11B Jan  1  2020 tmp -> private/tmp
drwxr-xr-x@ 11 root  wheel   352B Jan  1  2020 usr
lrwxr-xr-x@  1 root  wheel    11B Jan  1  2020 var -> private/var

subprocess module has received several important changes in the last versions of Python 3.x. Prior to version 3.5 the high level function was subprocess.call(), subprocess.check_call() and subprocess.check_output() all this functionality was replaced by subprocess.run() from version 3.5 and beyond.

Module multiprocessing

Up to now, we have been dealing with serial processes but now most computers have several cores that allow us to do multiprocessing. Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken into smaller routines that run independently and in more cases, they talk to each other very infrequently. A simple way to see this is to have 4 different drivers that try to go from point A to point B. Each driver can take their own path but in the end, they will get together at point B. Python has different methods, where the operating system allocates these threads to the processors improving the performance of the system.

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

For intensive numerical calculations, multiprocessing must be preferred over multithreading a similar module that spawns threads instead of processes.

The frequently use class Pool offers a simple way to spawn multiple workers to divide the same function over an iterable dividing the workload over several workers. The prototypical example is like this:

import sys
sys.path.insert(0, './scripts')

from worker import funcx2
import multiprocessing
multiprocessing.cpu_count()
8
import multiprocessing as mp

from worker import funcx2

with mp.Pool() as pool:
    args = list(range(10))
    res = pool.map(funcx2, args)
    pool.close()
    pool.join()
res
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
import multiprocessing

from worker import funcx2

def driver_func():
    PROCESSES = 4
    with multiprocessing.Pool(PROCESSES) as pool:
        params = [(1, ), (2, ), (3, ), (4, )]
        results = [pool.apply_async(funcx2, p) for p in params]

        for r in results:
            print('\t', r.get())
driver_func()
	 1
	 4
	 9
	 16
from multiprocessing import Pool

from worker import funcx2

if __name__ == '__main__':
    with Pool(4) as p:
        print(p.map(funcx2, [1, 2, 3]))
[1, 4, 9]
import math
import matplotlib.pyplot as plt

ppn=multiprocessing.cpu_count()
ppn=1

from worker import funcx2

if __name__ == '__main__':
    with multiprocessing.Pool(ppn) as p:
        ret = p.map(funcx2, range(1,100))
        
plt.plot(ret);

png

multiprocessing.cpu_count()
8

This is a function to get the number of cores on the system. That is different from the number of cores available to the Python process. The recommended method is using os.sched_getaffinity(0) but it is absent on some architectures. In particular in macOS, Windows, and some old Linux distros.

Final Remarks

The Python Standard Library is extensive, and the API is more prone to changes than the language itself. In real projects is better to decide what will be the oldest version of Python that will be supported and keep compatibility until the marker is shifted for a more recent version. Most Linux distributions today include Python 3.9 or newer.

Acknowledgments and References

This Notebook has been adapted by Guillermo Avendaño (WVU), Jose Rogan (Universidad de Chile) and Aldo Humberto Romero (WVU) from the Tutorials for Stanford cs228 and cs231n. A large part of the info was also built from scratch. In turn, that material was adapted by Volodymyr Kuleshov and Isaac Caswell from the CS231n Python tutorial by Justin Johnson (http://cs231n.github.io/python-numpy-tutorial/). Another good resource, in particular, if you want to just look for an answer to specific questions is planetpython.org, in particular for data science.

Changes to the original tutorial include strict Python 3 formats and a split of the material to fit a series of lessons on Python Programming for WVU’s faculty and graduate students.

The support of the National Science Foundation and the US Department of Energy under projects: DMREF-NSF 1434897, NSF OAC-1740111 and DOE DE-SC0016176 is recognized.

National Science Foundation
National Science Foundation

Back of the Book

plt.figure(figsize=(3,3))
n = chapter_number
maxt=(2*(n-1)+3)*np.pi/2
t = np.linspace(np.pi/2, maxt, 1000)
tt= 1.0/(t+0.01)
x = (maxt-t)*np.cos(t)**3
y = t*np.sqrt(np.abs(np.cos(t))) + np.sin(0.3*t)*np.cos(2*t)
plt.plot(x, y, c="green")
plt.axis('off');

png

end = time.time()
print(f'Chapter {chapter_number} run in {int(end - start):d} seconds')
Chapter 2 run in 3 seconds

Key Points

  • Python is an easy-to-learn programming language.