This lesson is being piloted (Beta version)

Paradigms of Parallel Computing

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • Which are the ways we can use parallel computing?

  • On which of them GPUs can be used?

Objectives
  • Review one example of a code written with several methods of parallelization

We will pursue an exploration of several paradigms in parallel computing, from purely CPU computing to GPU computing. The paradigms chosen were OpenMP, OpenACC, OpenCL, MPI, and CUDA. By using the same problem as baseline we hope to give a sense of perspective of how those different alternatives of parallelization work in practice.

We will be solving a very simple problem. The DAXPY function is the name given in BLAS to a routine that implements the function Y = A * X + Y where X and Y may be either native double-precision valued matrices or numeric vectors, and A is a scalar. We will be computing a trigonometric identity using DAXPY as the function that will be the target of parallelization.

We will compute this formula:

\[cos(2x) = cos^2(x)-sin^2(x)\]

For an array of vectors in a domain in the range of \(x=[0:\pi]\)

All the codes in this lesson will be compiled with the NVIDIA HPC compilers. That will give us some uniformity on the compiler choice as this compiler supports several of the parallel models that will be demonstrated.

To access the compilers on Thorny you need to load the module:

$> module load lang/nvidia/nvhpc/23.3   

To check that the compiler is available execute:

$> nvc --version

You should get something like:

nvc 23.3-0 64-bit target on x86-64 Linux -tp skylake-avx512 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Serial Code

We will start without parallelization. The term given to codes that use one single sequence of instructions is serial. We will use this to explain a few of the basic elements of C programming for those not familiar with the language.

The C programming language uses plain text files to describe the program. The code needs to be compiled. Compiling means that using a software package called compiler the text file is interpreted resulting in a new file that contains the instructions that the computer can follow directly to run the program.

The program below shows a code that populates two vectors and computes a new vector, the subtraction of the input vectors. Here is the code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define VECTOR_SIZE 100

int main( int argc, char* argv[] )
{

    int i;
    double alpha=1.0;
    // Number of bytes to allocate for N doubles
    size_t bytes = VECTOR_SIZE*sizeof(double);

    // Allocate memory for arrays X, A, B, and C on host
    double *X = (double*)malloc(bytes);
    double *A = (double*)malloc(bytes);
    double *B = (double*)malloc(bytes);
    double *C = (double*)malloc(bytes);

    for(i=0;i<VECTOR_SIZE;i++){
        X[i]=M_PI*(double)(i+1)/VECTOR_SIZE;
        A[i]=B[i]=C[i]=0.0;
    }

    for(i=0;i<VECTOR_SIZE;i++){
        A[i]=cos(X[i])*cos(X[i]);
        B[i]=sin(X[i])*sin(X[i]);
        C[i]=alpha*A[i]-B[i];
    }

    if (VECTOR_SIZE<=100){
      for(i=0;i<VECTOR_SIZE;i++) printf("%9.5f %9.5f %9.5f %9.5f\n",X[i],A[i],B[i],C[i]);
    }
    else{
      for(i=VECTOR_SIZE-10;i<VECTOR_SIZE;i++) printf("%9.5f %9.5f %9.5f %9.5f\n",X[i],A[i],B[i],C[i]);
    }

    // Release memory
    free(A);
    free(B);
    free(C);

    return 0;
}

Most of this code will be seen in the other implementations, for those unfamiliar with C programming, the code will be reviewed in some detail. That should facilitate understanding when moving into the parallel versions of this code.

The first lines in the code are:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define VECTOR_SIZE 100

The first three indicate which files must be processed to access functions that will appear in the code below. In particular we need stdio.h because we are calling the function printf. We need stdlib.h for using the functions to allocate malloc and deallocate free arrays in memory. Finally we need math.h because we are calling math functions such as sin and cos.

The next line is a preprocessor instruction. Any place in the code where the word VECTOR_SIZE appears, excluding string constants, will be replace by the characters 100 before sending the code to the actual compiler. For us that help us to have a single place to change the code in case we want a larger or shorter array. As it is the lenght of the array is hardcoded meaning that in order to change the lenght of the arrays the source needs to be changed and recompiled.

The next lines are:

int main( int argc, char* argv[] )
{

    int i;
    double alpha=1.0;
    // Number of bytes to allocate for N doubles
    size_t bytes = VECTOR_SIZE*sizeof(double);

    // Allocate memory for arrays X, A, B, and C on host
    double *X = (double*)malloc(bytes);
    double *A = (double*)malloc(bytes);
    double *B = (double*)malloc(bytes);
    double *C = (double*)malloc(bytes);

The first line in this chunk, indicates the main function on any program written in C. The code starts execution exactly starting on that line. A program written in C can read command line arguments from the shell and those arguments are stored in the array variable argv and the number of arguments with the integer argc

Next lines are declarations of the variables and the kind of each variable. Different from higher level languages like Python or R, in C each variable must have a very definite kind. int for integers, double for floating point numbers, ie, truncated real numbers. size_t is a kind defined on the stdlib.h header used to declare the number of bytes to allocate on each array.

The final lines are declarations for the 4 arrays that we will use. One for the domain X, the array A to store $cos(x)^2$, the array B to store $sin(x)^2$ and the array C for storing the difference of those two arrays.

Next lines are the core of the program:

for(i=0;i<VECTOR_SIZE;i++){
    X[i]=M_PI*(double)(i+1)/VECTOR_SIZE;
    A[i]=B[i]=C[i]=0.0;
}

for(i=0;i<VECTOR_SIZE;i++){
    A[i]=cos(X[i])*cos(X[i]);
    B[i]=sin(X[i])*sin(X[i]);
    C[i]=alpha*A[i]-B[i];
}

There are two loops in this chunk of code. The counter is the integer i that starts on 0 and stops before the variable reaches VECTOR_SIZE. In C language, the index of vectors of length N start in 0 and end in N-1. After each cycle the variable is increased by one with the instructions i++ a short for i=i+1.

The vector X is filled with numbers starting with 0 and ending with $\pi$. Notice the use of M_PI, a constant declared in math.h that provides a long precision numerical value for PI. The other arrays are zeroed. Not really necessary but done here to show how several variables can receive the same value.

The next loop is actually the main piece of code that will change the most when we explore the different models of parallel programming. In this case, we have a single loop that will compute A, compute B and subtract those values to compute C.

The final portion of the code, shows the results of the calculations.

if (VECTOR_SIZE<=100){
  for(i=0;i<VECTOR_SIZE;i++) printf("%9.5f %9.5f %9.5f %9.5f\n",X[i],A[i],B[i],C[i]);
}
else{
  for(i=VECTOR_SIZE-10;i<VECTOR_SIZE;i++) printf("%9.5f %9.5f %9.5f %9.5f\n",X[i],A[i],B[i],C[i]);
}

// Release memory
free(A);
free(B);
free(C);

return 0;
}

Here we have a conditional, for small arrays we will print the entire set of arrays, X, A, B, and C as 4 columns of text on the screen. For larger arrays only the last 10 elements are shown.

Notice how to indicate the format of the numbers that will be shown on the screen. The string %9.5f means that each number will have 9 characters to display with 5 of them being decimals. The character f means that the content of a floating point number will be shown. Other indicators are f for floats, d for integers, and s for strings.

The next lines deallocate the memory used for the 4 arrays. Every program in written in C should return an integer at the end, with 0 meaning a successful completion of the program. Any other return value can be interpreted as some sort of failure.

To compile the code, execute:

$> nvc daxpy_serial.c

Execute the code with:

$> ./a.out

That concludes the first program. No parallelism here. The code will use just one core, no matter how big the array is and how many cores you have available on your computer. This is a serial code and can only execute one instruction at a time.

No we will use this first program to present a few popular models for parallel computing.

OpenMP: Shared-memory parallel programming.

The OpenMP (Open Multi-Processing) API is a model for parallel computing that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, HP-UX, Linux, macOS, and Windows. Shared-memory multiprocessing means that it can parallelize operations on multiple cores that are able to address a single pool of memory.

Programming in OpenMP consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. We will focus only on the C language but there are equivalent directives for C++ and Fortran.

Programming in OpenMP consists in adding a few directives in critical places of the code that we want to parallelize, it is very simple to use and requires minimal changes to the source code.

Consider the next code based on the serial code with OpenMP directives:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define VECTOR_SIZE 100

int main( int argc, char* argv[] )
{
    int i,id;
    double alpha=1.0;

    // Number of bytes to allocate for N doubles
    size_t bytes = VECTOR_SIZE*sizeof(double);

    // Allocate memory for arrays X, A, B, and C on host
    double *X = (double*)malloc(bytes);
    double *A = (double*)malloc(bytes);
    double *B = (double*)malloc(bytes);
    double *C = (double*)malloc(bytes);

    #pragma omp parallel for shared(A,B) private(i,id)
    for(i=0;i<VECTOR_SIZE;i++){
        id = omp_get_thread_num();
        printf("Initializing vectors id=%d working on i=%d\n", id,i);
        X[i]=M_PI*(double)(i+1)/VECTOR_SIZE;
        A[i]=cos(X[i])*cos(X[i]);
        B[i]=sin(X[i])*sin(X[i]);
    }

    #pragma omp parallel for shared(A,B,C) private(i,id) schedule(static,10)
    for(i=0;i<VECTOR_SIZE;i++){
        id = omp_get_thread_num();
        printf("Computing C id=%d working on i=%d\n", id,i);
        C[i]=alpha*A[i]-B[i];
    }

    if (VECTOR_SIZE<=100){
      for(i=0;i<VECTOR_SIZE;i++) printf("%9.5f %9.5f %9.5f %9.5f\n",X[i],A[i],B[i],C[i]);
    }
    else{
      for(i=VECTOR_SIZE-10;i<VECTOR_SIZE;i++) printf("%9.5f %9.5f %9.5f %9.5f\n",X[i],A[i],B[i],C[i]);
    }

    // Release memory
    free(A);
    free(B);
    free(C);

    return 0;
}

The good thing about OpenMP is that not much of the code has to change in order to get a decent parallelization. As such, we will only focus of the 4 lines that have changed here:

#include <omp.h>

There is no need of importing any header for using OpenMP, however, we will use function call to omp_get_thread_num() that is in the header above.

The next two sections with OpenMP directives are on top of the two loops, the initialization loop:

#pragma omp parallel for shared(A,B) private(i,id)
for(i=0;i<VECTOR_SIZE;i++){
    id = omp_get_thread_num();

And the evaluation of the vector:

#pragma omp parallel for shared(A,B,C) private(i,id) schedule(static,10)
for(i=0;i<VECTOR_SIZE;i++){
    id = omp_get_thread_num();

These two #pragma lines are sit on top of the for loops. From the point of view of the C language, they are just comments, so the language itself is not concerned with them. A C compiler that supports OpenMP will interpret those lines and will parallelize the for loops. The parallelization is possible because each evaluation of the i element is independent from the j element. That independence allows for different evaluations go to different cores.

There are several directives in the OpenMP specification. The directive omp parallel for, is specific to parallelize for loops. There are others for assigning executions to a core. For the parallel for directive there are arguments, one important set is the shared and private arguments that declare which variables will be shared on all the threads created by OpenMP and for which variables a copy will be created independently for each thread. The index i and the thread number id are always private. In this example, the variables are being declared explicitly even if that is not always needed.

The final argument is schedule(static,10) when the indices will be assigned in chunks of 10.

OpenMP is a good option for easy parallelization of codes, but before OpenMP 4.0 the model was restricted only to parallelization with CPUs.

MPI: Distributed Parallel Programming

Message Passing Interface (MPI) is a communication protocol for programming parallel computers. It is designed to be allow the coordination of multiple cores on machines when cores are potentially independent, such as HPC clusters.

With MPI point-to-point and collective communication are supported. Point-to-point communication is when one process sends a message to another specific process, different processes are differentiated by their rank, an integer that uniquely identifies the process.

MPI’s goals are high performance, scalability, and portability. MPI remains the dominant model used in high-performance computing today.

The code below solves a rewrite of the original serial program using point-to-point functions to distribute the computation across several MPI processes.

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define MASTER 0
#define VECTOR_SIZE 100

int main (int argc, char *argv[])
{
    double alpha=1.0;
    double * A;
    double * B;
    double * C;
    double * X;

    // arrays a and b
    int total_proc;
    // total nuber of processes
    int rank;
    // rank of each process
    int T;
    // total number of test cases
    long long int n_per_proc;
    // elements per process

    long long int i, j, n;
    MPI_Status status;

    // Initialization of MPI environment
    MPI_Init (&argc, &argv);

    MPI_Comm_size (MPI_COMM_WORLD, &total_proc); //The total number of processes running in parallel
    MPI_Comm_rank (MPI_COMM_WORLD, &rank); //The rank of the current process

    double * x;
    double * ap;
    double * bp;
    double * cp;
    double * buf;

    n_per_proc = VECTOR_SIZE/total_proc;
    if(VECTOR_SIZE%total_proc != 0) n_per_proc+=1; // to divide data evenly by the number of processors

    x  = (double *) malloc(sizeof(double)*n_per_proc);
    ap = (double *) malloc(sizeof(double)*n_per_proc);
    bp = (double *) malloc(sizeof(double)*n_per_proc);
    cp = (double *) malloc(sizeof(double)*n_per_proc);
    buf = (double *) malloc(sizeof(double)*n_per_proc);

    for(i=0;i<n_per_proc;i++){
        //printf("i=%d\n",i);
        x[i]=M_PI*(rank*n_per_proc+i)/VECTOR_SIZE;
        ap[i]=cos(x[i])*cos(x[i]);
        bp[i]=sin(x[i])*sin(x[i]);
        cp[i]=alpha*ap[i]-bp[i];
    }

    if (rank == MASTER){

        for(i=0;i<total_proc;i++){

            if(i==MASTER){
                printf("Skip\n");
            }
            else{
                printf("Receiving from %d...\n",i);
                MPI_Recv(buf, n_per_proc, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
                for(j=0;j<n_per_proc;j++) printf("rank=%d i=%d x=%f c=%f\n",i,j,M_PI*(i*n_per_proc+j)/VECTOR_SIZE,buf[j]);
            }
        }
    }
    else{
        MPI_Bsend(cp, n_per_proc, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
    }

    MPI_Finalize();
    //Terminate MPI Environment
    return 0;
}

There are many changes in the code here, writing MPI code means important changes in the overall structure of the code as a result of organizing the sending and receiving of data from different processes.

The first change is the header:

#include "mpi.h"

All MPI programs must import that header to access the functions in the MPI API.

The next relevant chunk of code is:

MPI_Status status;

// Initialization of MPI environment
MPI_Init (&argc, &argv);

MPI_Comm_size (MPI_COMM_WORLD, &total_proc); //The total number of processes running in parallel
MPI_Comm_rank (MPI_COMM_WORLD, &rank); //The rank of the current process

Here we see a call to MPI_Init() the very first MPI function that must be call before any other. A variable MPI_Status is added to be used later for the receiving function. The call to MPI_Comm_size and MPI_Comm_rank will retrieve the total number of processes involved in the calculation, a number that could change at runtime. Each individual process receives a number called rank and we are storing the integer in the variable with that name.

In MPI we can avoid allocating big arrays full size in memory, but just allocating the portion of the array that will be used on each rank we can decrease the overall memory usage. A poorly written code will allocate entire arrays and use just a portion of the values. Here we are just allocating the amount of data actually needed.

n_per_proc = VECTOR_SIZE/total_proc;
if(VECTOR_SIZE%total_proc != 0) n_per_proc+=1; // to divide data evenly by the number of processors

x  = (double *) malloc(sizeof(double)*n_per_proc);
ap = (double *) malloc(sizeof(double)*n_per_proc);
bp = (double *) malloc(sizeof(double)*n_per_proc);
cp = (double *) malloc(sizeof(double)*n_per_proc);
buf = (double *) malloc(sizeof(double)*n_per_proc);

for(i=0;i<n_per_proc;i++){
    //printf("i=%d\n",i);
    x[i]=M_PI*(rank*n_per_proc+i)/VECTOR_SIZE;
    ap[i]=cos(x[i])*cos(x[i]);
    bp[i]=sin(x[i])*sin(x[i]);
    cp[i]=alpha*ap[i]-bp[i];
}

The arrays, ap, bp, cp and x are smaller with size n_per_proc instead of VECTOR_SIZE. Notice that x must be initialized correctly with

    x[i]=M_PI*(rank*n_per_proc+i)/VECTOR_SIZE;

Each rank is just allocating and initializing the portion of data that it will manage.

The last chunk of source:

if (rank == MASTER){

    for(i=0;i<total_proc;i++){

        if(i==MASTER){
            printf("Skip\n");
        }
        else{
            printf("Receiving from %d...\n",i);
            MPI_Recv(buf, n_per_proc, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
            for(j=0;j<n_per_proc;j++) printf("rank=%d i=%d x=%f c=%f\n",i,j,M_PI*(i*n_per_proc+j)/VECTOR_SIZE,buf[j]);
        }
    }
}
else{
    MPI_Bsend(cp, n_per_proc, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
}

The MASTER rank (usually 0) will receive the data from the other ranks, and it could be used to assemble the complete array or simply to continue the execution based on the data processed by all the ranks. In this simple example, we are just printing the data. The important lines here are:

MPI_Bsend(cp, n_per_proc, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);

and

MPI_Recv(buf, n_per_proc, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);

They send and receive the arrays in the first argument, in our case each array contains n_per_proc numbers of MPI_DOUBLE kind, (double in C), from the rank i with tag 0, ranks that belong to MPI_COMM_WORLD the MPI world initialized.

The final MPI call of any program is:

MPI_Finalize();
//Terminate MPI Environment
return 0;

The calls MPI_Init() and MPI_Finalize() are the first and last MPI functions called on any program using MPI.

OpenCL: A model for heterogeneous parallel computing

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, digital signal processors, field-programmable gate arrays, and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces to control the platform and execute programs on the computing devices.

Our program written in OpenCL looks like this:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#ifdef __APPLE__
#include <OpenCL/cl.h>
#else
#include <CL/cl.h>
#endif
#define VECTOR_SIZE 1024

//OpenCL kernel which is run for every work item created.
const char *daxpy_kernel =
"__kernel                                   \n"
"void daxpy_kernel(double alpha,     \n"
"                  __global double *A,       \n"
"                  __global double *B,       \n"
"                  __global double *C)       \n"
"{                                          \n"
"    //Get the index of the work-item       \n"
"    int index = get_global_id(0);          \n"
"    C[index] = alpha* A[index] - B[index]; \n"
"}                                          \n";

int main(void) {
  int i;
  // Allocate space for vectors A, B and C
  double alpha = 1.0;
  double *X = (double*)malloc(sizeof(double)*VECTOR_SIZE);
  double *A = (double*)malloc(sizeof(double)*VECTOR_SIZE);
  double *B = (double*)malloc(sizeof(double)*VECTOR_SIZE);
  double *C = (double*)malloc(sizeof(double)*VECTOR_SIZE);
  for(i = 0; i < VECTOR_SIZE; i++)
  {
    X[i] = M_PI*(double)(i+1)/VECTOR_SIZE;
    A[i] = cos(X[i])*cos(X[i]);
    B[i] = sin(X[i])*sin(X[i]);
    C[i] = 0;
  }

  // Get platform and device information
  cl_platform_id * platforms = NULL;
  cl_uint     num_platforms;
  //Set up the Platform
  cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);
  platforms = (cl_platform_id *)
  malloc(sizeof(cl_platform_id)*num_platforms);
  clStatus = clGetPlatformIDs(num_platforms, platforms, NULL);

  //Get the devices list and choose the device you want to run on
  cl_device_id     *device_list = NULL;
  cl_uint           num_devices;

  clStatus = clGetDeviceIDs( platforms[0], CL_DEVICE_TYPE_GPU, 0,NULL, &num_devices);
  device_list = (cl_device_id *)
  malloc(sizeof(cl_device_id)*num_devices);
  clStatus = clGetDeviceIDs( platforms[0],CL_DEVICE_TYPE_GPU, num_devices, device_list, NULL);

  // Create one OpenCL context for each device in the platform
  cl_context context;
  context = clCreateContext( NULL, num_devices, device_list, NULL, NULL, &clStatus);

  // Create a command queue
  cl_command_queue command_queue = clCreateCommandQueue(context, device_list[0], 0, &clStatus);

  // Create memory buffers on the device for each vector
  cl_mem A_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(double), NULL, &clStatus);
  cl_mem B_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(double), NULL, &clStatus);
  cl_mem C_clmem = clCreateBuffer(context, CL_MEM_WRITE_ONLY,VECTOR_SIZE * sizeof(double), NULL, &clStatus);

  // Copy the Buffer A and B to the device
  clStatus = clEnqueueWriteBuffer(command_queue, A_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(double), A, 0, NULL, NULL);
  clStatus = clEnqueueWriteBuffer(command_queue, B_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(double), B, 0, NULL, NULL);

  // Create a program from the kernel source
  cl_program program = clCreateProgramWithSource(context, 1,(const char **)&daxpy_kernel, NULL, &clStatus);

  // Build the program
  clStatus = clBuildProgram(program, 1, device_list, NULL, NULL, NULL);

  // Create the OpenCL kernel
  cl_kernel kernel = clCreateKernel(program, "daxpy_kernel", &clStatus);

  // Set the arguments of the kernel
  clStatus = clSetKernelArg(kernel, 0, sizeof(double), (void *)&alpha);
  clStatus = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&A_clmem);
  clStatus = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&B_clmem);
  clStatus = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&C_clmem);

  // Execute the OpenCL kernel on the list
  size_t global_size = VECTOR_SIZE; // Process the entire lists
  size_t local_size = 64;           // Process one item at a time
  clStatus = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

  // Read the cl memory C_clmem on device to the host variable C
  clStatus = clEnqueueReadBuffer(command_queue, C_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(double), C, 0, NULL, NULL);

  // Clean up and wait for all the comands to complete.
  clStatus = clFlush(command_queue);
  clStatus = clFinish(command_queue);

  // Display the result to the screen
  for(i = 0; i < VECTOR_SIZE; i++)
    printf("%f * %f - %f = %f\n", alpha, A[i], B[i], C[i]);

  // Finally release all OpenCL allocated objects and host buffers.
  clStatus = clReleaseKernel(kernel);
  clStatus = clReleaseProgram(program);
  clStatus = clReleaseMemObject(A_clmem);
  clStatus = clReleaseMemObject(B_clmem);
  clStatus = clReleaseMemObject(C_clmem);
  clStatus = clReleaseCommandQueue(command_queue);
  clStatus = clReleaseContext(context);
  free(A);
  free(B);
  free(C);
  free(platforms);
  free(device_list);
  return 0;
}

There are many new lines here, most of it boilerplate code, so we will digest the most relevant chunks.

#ifdef __APPLE__
#include <OpenCL/cl.h>
#else
#include <CL/cl.h>
#endif

These lines are needed for reading the header for OpenCL, the location varies between MacOS and Linux, so preprocessor conditionals are used.

//OpenCL kernel which is run for every work item created.
const char *daxpy_kernel =
"__kernel                                   \n"
"void daxpy_kernel(double alpha,     \n"
"                  __global double *A,       \n"
"                  __global double *B,       \n"
"                  __global double *C)       \n"
"{                                          \n"
"    //Get the index of the work-item       \n"
"    int index = get_global_id(0);          \n"
"    C[index] = alpha* A[index] - B[index]; \n"
"}                                          \n";

Central to OpenCL is the idea of kernel function, the portion of code that will be offloaded to the GPU or any other accelerator. For academic purposes it is here written as a constant, but it can be a separate file for the kernel. In this function we are just doing the difference alpha*A[i]-B[i]

// Get platform and device information
cl_platform_id * platforms = NULL;
cl_uint     num_platforms;
//Set up the Platform
cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);
platforms = (cl_platform_id *)
malloc(sizeof(cl_platform_id)*num_platforms);
clStatus = clGetPlatformIDs(num_platforms, platforms, NULL);

//Get the devices list and choose the device you want to run on
cl_device_id     *device_list = NULL;
cl_uint           num_devices;

clStatus = clGetDeviceIDs( platforms[0], CL_DEVICE_TYPE_GPU, 0,NULL, &num_devices);
device_list = (cl_device_id *)
malloc(sizeof(cl_device_id)*num_devices);
clStatus = clGetDeviceIDs( platforms[0],CL_DEVICE_TYPE_GPU, num_devices, device_list, NULL);

// Create one OpenCL context for each device in the platform
cl_context context;
context = clCreateContext( NULL, num_devices, device_list, NULL, NULL, &clStatus);

// Create a command queue
cl_command_queue command_queue = clCreateCommandQueue(context, device_list[0], 0, &clStatus);

// Create memory buffers on the device for each vector
cl_mem A_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(double), NULL, &clStatus);
cl_mem B_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(double), NULL, &clStatus);
cl_mem C_clmem = clCreateBuffer(context, CL_MEM_WRITE_ONLY,VECTOR_SIZE * sizeof(double), NULL, &clStatus);

Most of this is just boilerplate code, it identifies the device, create an OpenCL context and allocate the memory on the device.

// Copy the Buffer A and B to the device
clStatus = clEnqueueWriteBuffer(command_queue, A_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(double), A, 0, NULL, NULL);
clStatus = clEnqueueWriteBuffer(command_queue, B_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(double), B, 0, NULL, NULL);

// Create a program from the kernel source
cl_program program = clCreateProgramWithSource(context, 1,(const char **)&daxpy_kernel, NULL, &clStatus);

// Build the program
clStatus = clBuildProgram(program, 1, device_list, NULL, NULL, NULL);

// Create the OpenCL kernel
cl_kernel kernel = clCreateKernel(program, "daxpy_kernel", &clStatus);

// Set the arguments of the kernel
clStatus = clSetKernelArg(kernel, 0, sizeof(double), (void *)&alpha);
clStatus = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&A_clmem);
clStatus = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&B_clmem);
clStatus = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&C_clmem);

// Execute the OpenCL kernel on the list
size_t global_size = VECTOR_SIZE; // Process the entire lists
size_t local_size = 64;           // Process one item at a time
clStatus = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

This section of the code is will be very similar when we explore CUDA, the main language in the next lectures. Here we are copying data from the host to the device, creating the kernel program and preparing the execution on the device.

// Read the cl memory C_clmem on device to the host variable C
clStatus = clEnqueueReadBuffer(command_queue, C_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(double), C, 0, NULL, NULL);

// Clean up and wait for all the comands to complete.
clStatus = clFlush(command_queue);
clStatus = clFinish(command_queue);

// Display the result to the screen
for(i = 0; i < VECTOR_SIZE; i++)
  printf("%f * %f - %f = %f\n", alpha, A[i], B[i], C[i]);

// Finally release all OpenCL allocated objects and host buffers.
clStatus = clReleaseKernel(kernel);
clStatus = clReleaseProgram(program);
clStatus = clReleaseMemObject(A_clmem);
clStatus = clReleaseMemObject(B_clmem);
clStatus = clReleaseMemObject(C_clmem);
clStatus = clReleaseCommandQueue(command_queue);
clStatus = clReleaseContext(context);
free(A);
free(B);
free(C);
free(platforms);
free(device_list);
return 0;
}

The final section cleans the memory from the device, print the values of the arrays as we did on the serial code and free the memory on the host.

OpenACC: Model for heterogeneous parallel programming

OpenACC (for open accelerators) is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify the parallel programming of heterogeneous CPU/GPU systems.

OpenACC is similar to OpenMP. In OpenMP, the programmer can annotate C, C++, and Fortran source code to identify the areas that should be accelerated using compiler directives and additional functions. Like OpenMP 4.0 and newer, OpenACC can target both the CPU and GPU architectures and launch computational code on them.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define VECTOR_SIZE 100

int main( int argc, char* argv[] )
{

    double alpha=1.0;
    double *restrict X;
    // Input vectors
    double *restrict A;
    double *restrict B;
    // Output vector
    double *restrict C;

    // Size, in bytes, of each vector
    size_t bytes = VECTOR_SIZE*sizeof(double);

    // Allocate memory for each vector
    X = (double*)malloc(bytes);
    A = (double*)malloc(bytes);
    B = (double*)malloc(bytes);
    C = (double*)malloc(bytes);

    // Initialize content of input vectors, vector A[i] = cos(i)^2 vector B[i] = sin(i)^2
    int i;
    for(i=0; i<VECTOR_SIZE; i++) {
        X[i]=M_PI*(double)(i+1)/VECTOR_SIZE;
        A[i]=cos(X[i])*cos(X[i]);
        B[i]=sin(X[i])*sin(X[i]);
    }

    // sum component wise and save result into vector c
    #pragma acc kernels copyin(A[0:VECTOR_SIZE],B[0:VECTOR_SIZE]), copyout(C[0:VECTOR_SIZE])
    for(i=0; i<VECTOR_SIZE; i++) {
        C[i] = alpha*A[i]-B[i];
    }

    for(i=0; i<VECTOR_SIZE; i++) printf("X=%9.5f C=%9.5f\n",X[i],C[i]);

    // Release memory
    free(A);
    free(B);
    free(C);

    return 0;
}

OpenACC works very similar to OpenMP, you add #pragma lines that are comments from the point of view of the C language but interpreted by the compiler if support for OpenACC is granted on the compiler.

On this chunk:

// sum component wise and save result into vector c
#pragma acc kernels copyin(A[0:VECTOR_SIZE],B[0:VECTOR_SIZE]), copyout(C[0:VECTOR_SIZE])
for(i=0; i<VECTOR_SIZE; i++) {
    C[i] = alpha*A[i]-B[i];
}

How OpenACC works is better understood with a more deep understanding of CUDA. The #pragma is parallelizing the for loop copying the A and B arrays to the memory on the GPU and returning the C array back. The main difference between OpenMP and OpenACC is for variables being able to operate on accelerators like GPUs those arrays and variables must be copied to the device and the results copied back to host.

CUDA: Parallel Computing on NVIDIA GPUs

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA to work on its GPUs. The purpose of CUDA is to leverage general-purpose computing on CUDA-enabled graphics processing unit (GPU) sometimes termed GPGPU (General-Purpose computing on Graphics Processing Units).

The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels. CUDA works as a superset of the C language and as such it cannot be compiled on a normal C language compiler. CUDA provides what is called the CUDA Toolkit a set of tools to compile and run programs written in CUDA.

To access the CUDA compilers on Thorny you need to load the module:

$> module load lang/nvidia/nvhpc/23.3   

To check that the compiler is available execute:

$> nvcc --version

You should get something like:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
#include <stdio.h>

// Size of array
#define VECTOR_SIZE 100

// Kernel
__global__ void add_vectors(double alpha, double *a, double *b, double *c)
{
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    if(id < VECTOR_SIZE) c[id] = a[id] - b[id];
}

// Main program
int main( int argc, char* argv[] )
{
    double alpha=1.0;
    // Number of bytes to allocate for N doubles
    size_t bytes = VECTOR_SIZE*sizeof(double);

    // Allocate memory for arrays A, B, and C on host
    double *X = (double*)malloc(bytes);
    double *A = (double*)malloc(bytes);
    double *B = (double*)malloc(bytes);
    double *C = (double*)malloc(bytes);

    // Allocate memory for arrays d_A, d_B, and d_C on device
    double *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

    // Fill host arrays A and B
    for(int i=0; i<VECTOR_SIZE; i++)
    {
        X[i]=M_PI*(double)(i+1)/VECTOR_SIZE;
        A[i]=cos(X[i])*cos(X[i]);
        B[i]=sin(X[i])*sin(X[i]);
    }

    // Copy data from host arrays A and B to device arrays d_A and d_B
    cudaMemcpy(d_A, A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, bytes, cudaMemcpyHostToDevice);

    // Set execution configuration parameters
    //      thr_per_blk: number of CUDA threads per grid block
    //      blk_in_grid: number of blocks in grid
    int thr_per_blk = 32;
    int blk_in_grid = ceil( double(VECTOR_SIZE) / thr_per_blk );

    // Launch kernel
    add_vectors<<< blk_in_grid, thr_per_blk >>>(alpha, d_A, d_B, d_C);

    // Copy data from device array d_C to host array C
    cudaMemcpy(C, d_C, bytes, cudaMemcpyDeviceToHost);

    // Verify results
    for(int i=0; i<VECTOR_SIZE; i++)
    {
        if(C[i] != alpha*A[i]-B[i])
        {
            printf("\nError: value of C[%d] = %f instead of %f\n\n", i, C[i], alpha*A[i]-B[i]);
            //exit(-1);
        }
        else
        {
            printf("X=%f C=%f\n",X[i],C[i]);
        }
    }

    // Free CPU memory
    free(A);
    free(B);
    free(C);

    // Free GPU memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    printf("\n---------------------------\n");
    printf("__SUCCESS__\n");
    printf("---------------------------\n");
    printf("VECTOR SIZE       = %d\n", VECTOR_SIZE);
    printf("Threads Per Block = %d\n", thr_per_blk);
    printf("Blocks In Grid    = %d\n", blk_in_grid);
    printf("---------------------------\n\n");

    return 0;
}

There are two main sections in this code that depart from the C language:

// Kernel
__global__ void add_vectors(double alpha, double *a, double *b, double *c)
{
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    if(id < VECTOR_SIZE) c[id] = a[id] - b[id];
}

The function add_vectors will operate on the GPU using memory allocated on the device. The integer id will be associated to the exact index in the array based on the indices of the thread and block the main components in the thread hierarchy of execution in CUDA.

// Allocate memory for arrays d_A, d_B, and d_C on device
double *d_A, *d_B, *d_C;
cudaMalloc(&d_A, bytes);
cudaMalloc(&d_B, bytes);
cudaMalloc(&d_C, bytes);

In this section, the memory on the device is allocated. Those allocations are different from the arrays on the host and next lines will transfer data from the host to the device:

// Copy data from host arrays A and B to device arrays d_A and d_B
cudaMemcpy(d_A, A, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, bytes, cudaMemcpyHostToDevice);

The operation of transferring data from the host to the device is a blocking operation. The execution will not return to the CPU until the transfer is completed.

// Set execution configuration parameters
//      thr_per_blk: number of CUDA threads per grid block
//      blk_in_grid: number of blocks in grid
int thr_per_blk = 32;
int blk_in_grid = ceil( double(VECTOR_SIZE) / thr_per_blk );

// Launch kernel
add_vectors<<< blk_in_grid, thr_per_blk >>>(alpha, d_A, d_B, d_C);

// Copy data from device array d_C to host array C
cudaMemcpy(C, d_C, bytes, cudaMemcpyDeviceToHost);

In this section, we decide two important numbers in CUDA execution that affect performance. The number of threads per block and the number of blocks in the grid. CUDA organizes parallel executions in threads those threads are grouped in a 3D array called blocks and blocks are arranged in a 3D array called a grid. For the simple case above we have just unidimensional blocks and unidimensional grids.

The call to add_vectors is a non-blocking operation. Execution returns to the host as soon as the function is called. The device will work on that function independently from the host.

Only the function cudaMemcpy() will impose a barrier, waiting for the device to complete execution before copying the data back from the device to the host.

Key Points

  • The serial code only allows one core to follow the instructions in the code

  • OpenMP uses several cores on the same machine as all the cores are able to address the same memory

  • OpenACC is capable to use both multiple cores and accelerators such as GPUs

  • MPI is used for distributed parallel computing. Being able to use cores on different compute nodes.

  • OpenCL is an effort for a platform-independent model for hybrid computing

  • CUDA is the model used by NVIDIA for accessing compute power from NVIDIA GPUs