Parallel Programming with OpenMP
Overview
Teaching: 45 min
Exercises: 15 minQuestions
When OpenMP can be the right choice for accelerating an algorithm
What is the difference between OpenMP and MPI
Objectives
Introduce OpenMP and API for parallel computing on SMP machines.
Parallel Programming with OpenMP
OpenMP is an Application Program Interface (API) that provides a portable, scalable model for developers of shared memory parallel applications. OpenMP works with Symmetric Multiprocessing (SMP) The API supports C/C++ and Fortran on a wide variety of architectures. This tutorial covers some of the major features of OpenMP 4.0, including its various constructs and directives for specifying parallel regions, work sharing, synchronization and data environment. Runtime library functions and environment variables are also covered. This tutorial includes both C and Fortran example codes and a lab exercise.
What is a Symmetric Multiprocessing (SMP)?
Most computers today have several cores, that is also true for the individual nodes on a cluster. The several cores in a node have access to the main memory of the machine. When several processing units see the same memory we say that that machine is a SMP system. Most multiprocessor systems today use an SMP architecture. Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all I/O devices, and are controlled by a single operating system instance that treats all processors equally.
What is OpenMP
OpenMP is an implementation of multithreading, a method of parallelizing whereby a master thread (a series of instructions executed consecutively) forks a specified number of slave threads and the system divides a task among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.
The core elements of OpenMP are the constructs for thread creation, workload distribution (work sharing), data-environment management, thread synchronization, user-level runtime routines and environment variables.
In C/C++, OpenMP uses #pragmas. Those instructions are comments to the non openmp compiler but can be understood if the compiler support them and if the compilation line includes instructions to include openmp in the final object. In Fortran OpenMP uses comments like “!$omp”, they are also comments in Fortran 90+ but can be used by the compiler if the compilation arguments are included.
First example
Lets create a very simple example in C. Lets call it “omp_simple.c”
#include <stdio.h>
int main(int argc, char *argv[])
{
#pragma omp parallel
printf("This is a thread.\n");
return 0;
}
gcc -fopenmp omp_simple.c
Executing the executable “a.out” will produce:
This is a thread.
This is a thread.
This is a thread.
This is a thread.
...
The actual number of threads is based on the number of cores available on the machine. However, you can control that number changing the environmental variable “OMP_NUM_THREADS”
OMP_NUM_THREADS=3 ./a.out
Similarly, the fortran version of the same program is
program hello
!$omp parallel
print *, 'This is a thread'
!$omp end parallel
end program hello
gfortran -fopenmp omp_simple.f90
And executed like
OMP_NUM_THREADS=3 ./a.out
A parallel loop
Now its time to a more useful and complex example. One of the typical cases where OpenMP offers advantages is doing Single Instruction Multiple Data (SIMD) operations. For example, if you want to compute the dot product of two vectors, you are doing the same operation, multiplying two numbers, for all the indices of the vectors. As both vectors are on the same memory space that all cores can see, the operation can be easily compute concurrently, by giving each core a chunk of the vector and storing the result on another one also shared by all cores. Lets see the example of that. The C version looks like this:
#include <omp.h>
#include <stdio.h>
#define N 10000
#define CHUNKSIZE 100
int main(int argc, char *argv[]) {
int i, chunk;
float a[N], b[N], c[N];
/* Initializing vectors */
for (i=0; i < N; i++)
{
a[i] = i;
b[i] = i * N;
}
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(dynamic,chunk)
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
} /* end of parallel region */
for (i=0; i < 10; i++) printf("%17.1f %17.1f %17.1f\n", a[i], b[i], c[i]);
printf("...\n");
for (i=N-10; i < N; i++) printf("%17.1f %17.1f %17.1f\n", a[i], b[i], c[i]);
}
And the fortran version is:
program omp_vec_addition
integer n, chunksize, chunk, i
parameter (n=10000)
parameter (chunksize=100)
real a(n), b(n), c(n)
!initialization of vectors
do i = 1, n
a(i) = i
b(i) = a(i) * n
enddo
chunk = chunksize
!$omp parallel shared(a,b,c,chunk) private(i)
!$omp do schedule(dynamic,chunk)
do i = 1, n
c(i) = a(i) + b(i)
enddo
!$omp end do
!$omp end parallel
do i = 1, 10
write(*,'(3F17.1)') a(i), b(i), c(i)
end do
write(*,*) '...'
do i = 1, 10
write(*,'(3F17.1)') a(n-10+i), b(n-10+i), c(n-10+i)
end do
end program omp_vec_addition
Notice that we have declare that vectors a, b and c are shared; the different threads are actually reading and writing on the same memory. Each thread, however, keeps a private copy of i as each value will change and the loop will be run for different threads for different chunks.
The instruction “dynamic” indicates that each thread will receive a new chunk everytime it completes the previous one. The size of the chunk is determined by the variabl “chunk” and for this example is set to 100.
Concurrent sections
Spliting loops with SIMD operations is one of the most common ways. However, OpenMP offers several other functionalities for parallel programming. One of them is declaring sections of code that can be run concurrently and instruct the compiler to create separate threads for each of them.
The SECTIONS directive specifies that the enclosed sections of code are to be divided among the threads in the team. Independent SECTION directives are group within a SECTIONS directive. Each SECTION is executed once by a thread in the team. Different sections may be executed by different threads. If there are more threads than sections, some threads will not execute a section and some will.
Lets see than with an example, the code in C.
#include <stdio.h>
#include <omp.h>
#define N 10000
int main(int argc, char *argv[]) {
int i, tid;
float a[N], b[N], c[N], d[N];
/* Some initializations */
for (i=0; i < N; i++) {
a[i] = i * i;
b[i] = i + i;
}
#pragma omp parallel shared(a,b,c,d) private(i, tid)
{
tid=omp_get_thread_num();
#pragma omp sections nowait
{
#pragma omp section
{
printf("Thread: %d working on + section\n", tid);
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
}
#pragma omp section
{
printf("Thread: %d working on * section\n", tid);
for (i=0; i < N; i++)
d[i] = a[i] * b[i];
}
} /* end of sections */
} /* end of parallel region */
for (i=0; i < 10; i++) printf("%17.1f %17.1f %17.1f %17.1f\n", a[i], b[i], c[i], d[i]);
printf("...\n");
for (i=N-10; i < N; i++) printf("%17.1f %17.1f %17.1f %17.1f\n", a[i], b[i], c[i], d[i]);
}
The almost equivalent version if fortran will be:
program vec_add_sections
integer n, i, tid, omp_get_thread_num
parameter (n=10000)
real a(n), b(n), c(n), d(n)
!some initializations
do i = 1, n
a(i) = i * i
b(i) = i + i
enddo
!$omp parallel shared(a,b,c,d), private(i, tid)
tid=omp_get_thread_num()
!$omp sections
!$omp section
write(*,*) 'Thread: ', tid, ' working on + section'
do i = 1, n
c(i) = a(i) + b(i)
enddo
!$omp section
write(*,*) 'Thread: ', tid, ' working on * section'
do i = 1, n
d(i) = a(i) * b(i)
enddo
!$omp end sections nowait
!$omp end parallel
do i = 1, 10
write(*,'(4f17.1)') a(i), b(i), c(i), d(i)
end do
write(*,*) '...'
do i = 1, 10
write(*,'(4f17.1)') a(n-10+i), b(n-10+i), c(n-10+i), d(n-10+i)
end do
end program vec_add_sections
In this example we have created two sections. If your OMP_NUM_THREADS is larger than 2, the other threads will still exists but no work will be allocated to them.
Collecting partial results.
Sometimes the result of the operation needs to be accumulated on a single variable. You could eventually create and array to store partial results, but OpenMP offers also the possibility of a shared variable that is not rewritten but accumulated by several threads working on it. Let see than on a basic example of the dot product. The C version:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
# define N 10
int main(int argc, char *argv[]) {
int i, chunk;
float a[N], b[N], result;
/* Some initializations */
chunk = 10;
result = 0.0;
for (i=0; i < N; i++) {
a[i] = drand48();
b[i] = drand48();
}
for (i=0; i < N; i++)
printf("%17.3f %17.3f\n", a[i], b[i]);
#pragma omp parallel for \
default(shared) private(i) \
schedule(static,chunk) \
reduction(+:result)
for (i=0; i < N; i++)
result = result + (a[i] * b[i]);
printf("Dot Product dot(a,b) = %f\n",result);
}
The Fortran version:
program dot_product
integer n, chunksize, chunk, i, nn
parameter (n=10)
parameter (chunksize=10)
real a(n), b(n), result
integer, allocatable :: seed(:)
call random_seed(size = nn)
allocate(seed(nn))
seed(:)=0.0
call random_seed(put=seed)
!some initializations
!$omp parallel do &
!$omp default(shared) private(i)
do i = 1, n
call random_number(a(i))
call random_number(b(i))
enddo
!$omp end parallel do
do i = 1, n
write(*,'(2F17.3)') a(i),b(i)
end do
result= 0.0
chunk = chunksize
!$omp parallel do &
!$omp default(shared) private(i) &
!$omp schedule(static,chunk) &
!$omp reduction(+:result)
do i = 1, n
result = result + (a(i) * b(i))
enddo
!$omp end parallel do
print *, 'Dot product of dot(a,b) = ', result
end program dot_product
Runtime Library functions
There are a number of functions that can be called from the openmp library. This is summary of some of them:
routine | purpose |
---|---|
omp_set_num_threads | sets the number of threads that will be used in the next parallel region |
omp_get_num_threads | returns the number of threads that are currently in the team executing the parallel region from which it is called |
omp_get_max_threads | returns the maximum value that can be returned by a call to the omp_get_num_threads function |
omp_get_thread_num | returns the thread number of the thread, within the team, making this call |
omp_get_thread_limit | returns the maximum number of openmp threads available to a program |
omp_get_num_procs | returns the number of processors that are available to the program |
omp_in_parallel | used to determine if the section of code which is executing is parallel or not |
omp_set_dynamic | enables or disables dynamic adjustment (by the run time system) of the number of threads available for execution of parallel regions |
omp_get_dynamic | used to determine if dynamic thread adjustment is enabled or not |
omp_set_nested | used to enable or disable nested parallelism |
omp_get_nested | used to determine if nested parallelism is enabled or not |
omp_set_schedule | sets the loop scheduling policy when “runtime” is used as the schedule kind in the openmp directive |
omp_get_schedule | returns the loop scheduling policy when “runtime” is used as the schedule kind in the openmp directive |
omp_set_max_active_levels | sets the maximum number of nested parallel regions |
omp_get_max_active_levels | returns the maximum number of nested parallel regions |
omp_get_level | returns the current level of nested parallel regions |
omp_get_ancestor_thread_num | returns, for a given nested level of the current thread, the thread number of ancestor thread |
omp_get_team_size | returns, for a given nested level of the current thread, the size of the thread team |
omp_get_active_level | returns the number of nested, active parallel regions enclosing the task that contains the call |
omp_in_final | returns true if the routine is executed in the final task region; otherwise it returns false |
omp_init_lock | initializes a lock associated with the lock variable |
omp_destroy_lock | disassociates the given lock variable from any locks |
omp_set_lock | acquires ownership of a lock |
omp_unset_lock | releases a lock |
omp_test_lock | attempts to set a lock, but does not block if the lock is unavailable |
omp_init_nest_lock | initializes a nested lock associated with the lock variable |
omp_destroy_nest_lock | disassociates the given nested lock variable from any locks |
omp_set_nest_lock | acquires ownership of a nested lock |
omp_unset_nest_lock | releases a nested lock |
omp_test_nest_lock | attempts to set a nested lock, but does not block if the lock is unavailable |
omp_get_wtime | provides a portable wall clock timing routine |
omp_get_wtick | returns a double-precision floating point value equal to the number of seconds between successive clock ticks |
When using those runtime calls in C/C++ you need to import “omp.h”.
Challenge
Create a program that get the maximum number of threads and use “omp_set_num_threads” to create loops with increasing number of threads up to the limit imposed by omp_get_max_threads.
Where to go from here
The OpenMP webpage is OpenMP offers several resources. Several examples are available at OpenMP Github page.
This lesson is loosely based on a similar one from Lawrence Livermore National Laboratory
Key Points
First key point.