OpenMP and C++

Reap the Benefits of Multithreading without All the Work

Kang Su Gatlin and Pete Isensee

This article is based on a prerelease version of Visual C++ 2005. All information independent herein is subject area to change.

This article discusses:
  • OpenMP features for multithreaded applications
  • OpenMP pragmas
  • OpenMP runtime routines
This article uses the following technologies:
Visual C++ 2005, OpenMP

Contents

Enabling OpenMP in Visual C++
Parallelism in OpenMP
OpenMP Constructs
Specifying Parallelism
Comparing OpenMP and Win32 Threads
Shared Information and Private Data
Non-Loop Parallelism
Synchronization Pragmas
Execution Environment Routines
Synchronization/Lock Routines
Parallel Execution of Data Structure Traversal
Advanced Scheduling Algorithms
When to Use OpenMP
Common Pitfalls When Using OpenMP in Apps
OpenMP and .Cyberspace

A mong those in the parallel computation field, a common joke is "Parallel computing is the wave of the time to come...and always will be." This lilliputian joke has held true for decades. A similar sentiment was felt in the computer compages community, where an end of processor clock speed growth seemed always on the horizon, but the clocks kept getting faster. The multicore revolution is the collision of this parallel community optimism and compages customs pessimism.

The major CPU vendors accept shifted gears away from ramping up clock speeds to adding parallelism support on-fleck with multicore processors. The idea is simple: put more than one CPU core on a single bit. This finer makes a organization with a processor with ii cores operate merely like a dual-processor computer, and a system with a processor with four cores operate like a quad-processor calculator. This practice avoids many of the technological hurdles CPU vendors are encountering in boosting speeds, while however offering a better-performing processor.

Then far this seems like pretty good news, only if your awarding does non take advantage of these multiple cores, it may not be able to operate any faster. This is where OpenMP comes into the picture. OpenMP helps C++ developers create multithreaded applications more apace.

Tackling the topic of OpenMP in a single commodity is a daunting task, as information technology is a large and powerful API. Therefore, this article volition just serve equally an initial introduction, demonstrating how you tin use the various features of OpenMP to write multithreaded applications quickly. If you'd like to take a wait at farther data on this topic, we recommend the surprisingly readable specification bachelor from the OpenMP Web site.

Enabling OpenMP in Visual C++

The OpenMP standard was formulated in 1997 as an API for writing portable, multithreaded applications. It started equally a Fortran-based standard, but afterwards grew to include C and C++. The current version is OpenMP 2.0, and Visual C++® 2005 supports the total standard. OpenMP is also supported by the Xbox 360™ platform.

Before nosotros go into the fun of playing with code, you'll need to know how to turn on the OpenMP capabilities of the compiler. Visual C++ 2005 provides a new /openmp compiler switch that enables the compiler to sympathize OpenMP directives. (You lot tin can also enable OpenMP directives in the belongings pages for your project. Click Configuration Properties, and then C/C++, and then Linguistic communication, and modify the OpenMP Support property.) When the /openmp switch is invoked, the compiler defines the symbol _OPENMP, which may be used to detect when OpenMP is enabled using #ifndef _OPENMP.

OpenMP is linked to applications through the import lib vcomp.lib. The respective runtime library is vcomp.dll. The debug versions of the import lib and runtime (vcompd.lib and vcompd.dll, respectively) take boosted error messages that are emitted during certain illegal operations. Note that Visual C++ does non back up static linking of the OpenMP runtime, although static linking is supported in the version for Xbox 360.

Parallelism in OpenMP

An OpenMP application begins with a unmarried thread, the chief thread. Equally the program executes, the awarding may encounter parallel regions in which the master thread creates thread teams (which include the main thread). At the end of a parallel region, the thread teams are parked and the principal thread continues execution. From within a parallel region there can be nested parallel regions where each thread of the original parallel region becomes the principal of its own thread team. Nested parallelism can continue to farther nest other parallel regions.

Figure 1 OpenMP Parallel Sections

Effigy 1** OpenMP Parallel Sections **

Figure 1 illustrates how OpenMP parallelism works. The left-most, xanthous line is the primary thread. This thread runs equally a unmarried-threaded application until it hits its first parallel region at point 1. At the parallel region the master thread creates a thread team (consisting of the yellowish and orange lines), and these threads now all run concurrently in the parallel region.

At point 2, 3 of the four threads running in the parallel region create new thread teams (pink, greenish, and blue) in this nested parallel region. The yellow and orangish threads that created the teams are the masters of their respective teams. Annotation that each thread could create a new team at a different point in time, or information technology may not run across a nested parallel region at all.

At point 3, the nested parallel regions are finished. Each nested parallel region synchronizes with the other threads in the region, just observe that they do not synchronize beyond regions. Point 4 ends the first parallel region, and point v begins a new parallel region. In the new parallel region at betoken five, the thread-local data for each thread is persisted from the previous parallel region.

At present that you have a bones understanding of the execution model, let'south examine how you lot can actually brainstorm making a parallel application.

OpenMP Constructs

OpenMP is easy to use and consists of only two basic constructs: pragmas and runtime routines. The OpenMP pragmas typically direct the compiler to parallelize sections of lawmaking. All OpenMP pragmas begin with #pragma omp. Equally with whatsoever pragma, these directives are ignored by compilers that practise not support the feature—in this case, OpenMP.

OpenMP runtime routines are used primarily to set and retrieve information about the environment. There are also APIs for sure types of synchronization. In social club to use the functions from the OpenMP runtime, the program must include the OpenMP header file omp.h. If the application is only using the pragmas, you can omit omp.h.

You add parallelism to an app with OpenMP by merely calculation pragmas and, if needed, using a set of OpenMP function calls from the OpenMP runtime. These pragmas use the following class:

              #pragma omp <directive> [clause[ [,] clause]...]                          

The directives include the following: parallel, for, parallel for, section, sections, single, chief, disquisitional, flush, ordered, and diminutive. These directives specify either work-sharing or synchronization constructs. We will cover the majority of these directives in this commodity.

The clauses are optional modifiers of the directives and affect the behavior of the directives. Each directive has a different set up of clauses bachelor to it, and v of the directives (master, critical, flush, ordered, and atomic) do not accept clauses at all.

Specifying Parallelism

Although at that place are many directives, it's piece of cake to go started knowing but a few. The most common and important directive is parallel. This directive creates a parallel region for the dynamic extent of the structured block that follows the directive. For case:

              #pragma omp parallel [clause[ [, ]clause] ...]  structured-block                          

This directive tells the compiler that the structured block of lawmaking should be executed in parallel on multiple threads. Each thread volition execute the same instruction stream, even so not necessarily the same fix of instructions. This will depend on control-flow statements such equally if-else.

Here is an example using the obligatory "Hello World" program:

              #pragma omp parallel {      printf("Howdy World\n"); }                          

On a two-processor system you may expect output like this:

              How-do-you-do World Hello World                          

However, you may as well get something like this:

              HellHell oo WorWlodrl d                          

This 2d output occurs considering the ii threads are running in parallel and are both attempting to print at the same time. Whenever more than i thread attempts to read or modify a shared resource (in this case the shared resources is the panel window), at that place is potential for a race status. These are nondeterministic bugs in your app and are often hard to find. It's the programmer's responsibility to make sure that these don't happen, usually by using locks or avoiding shared resource as much as possible.

Let's look at an example that does something a bit more useful—it computes the average of two values from one array and puts the value into some other. Here we introduce a new OpenMP construct: #pragma omp for. This is a work-sharing directive. Work-sharing directives do not create parallelism, merely rather distribute the thread team in a logical way to implement the following command-catamenia construct. The #pragma omp for piece of work-sharing directive tells OpenMP that the for loop that is shown in the following code, when called from a parallel region, should have its iterations divided amidst the thread team:

              #pragma omp parallel     { #pragma omp for for(int i = 1; i < size; ++i)         x[i] = (y[i-i] + y[i+one])/2; }                          

In this case, if size had the value 100 and this were run on a computer with iv processors, the iterations of the loop might get allocated such that processor 1 gets iterations 1-25, processor two gets iterations 26-50, processor three gets iterations 51-75, and processor 4 gets iterations 76-99. This assumes a scheduling policy called static scheduling. We'll discuss scheduling policies in more depth later in this article.

One thing that is not explicit in this program is a barrier synchronization at the finish of the parallel region. All threads will block there until the concluding thread completes.

If the lawmaking shown previously did not use #pragma omp for, then each thread would execute the complete for loop, and as a outcome each thread would practice redundant piece of work:

              #pragma omp parallel // probably non what was intended { for(int i = ane; i < size; ++i)         x[i] = (y[i-ane] + y[i+1])/2; }                          

Since parallel loops are the most common parallelizable work-sharing construct, OpenMP provides a shorthand manner to write #pragma omp parallel followed past #pragma omp for:

              #pragma omp parallel for for(int i = one; i < size; ++i)     x[i] = (y[i-1] + y[i+1])/2;                          

Detect that there are no loop-carried dependencies. This means i iteration of the loop does not depend upon the results of another iteration of the loop. For example, the following two loops have ii different loop-carried dependent properties:

              for(int i = 1; i <= north; ++i)    // Loop (1)     a[i] = a[i-1] + b[i];  for(int i = 0; i < n; ++i)     // Loop (2)     ten[i] = 10[i+1] + b[i];                          

Parallelizing loop i is problematic because, in lodge to execute iteration i of the loop, you need to accept the result of iteration i-one. There is a dependency from iteration i to i-i. Parallelizing loop 2 is also problematic, although for a different reason. In this loop yous can compute the value of x[i] before computing the value of 10[i-1], but in doing then you tin can no longer compute the value of x[i-1]. At that place is a dependency from iteration i-1 to i.

When parallelizing loops, yous must brand sure that your loop iterations practise non have dependencies. When in that location are no loop-carried dependencies, the compiler can execute the loop in any lodge, fifty-fifty in parallel. This is an of import requirement that the compiler does not cheque. You lot are finer asserting to the compiler that the loop being parallelized does not comprise loop-carried dependencies. If it turns out that the loop does have dependencies and you lot told the compiler to parallelize it, the compiler volition exercise as it was told and the cease result will be a bug.

Additionally, OpenMP does place restrictions on the form of for loops that are allowed within of a #pragma omp for or #pragma omp parallel for block. The loops must accept the post-obit form:

              for([integer type] i = loop invariant value;      i {<,>,=,<=,>=} loop invariant value;      i {+,-}= loop invariant value)                          

This is required so that OpenMP tin can determine, at entry to the loop, how many iterations the loop will perform.

Comparing OpenMP and Win32 Threads

It's instructive to compare the #pragma omp parallel for example shown before with what would be necessary when using the Windows® API to practise threading. As y'all tin see in Figure 2, there is a lot more lawmaking required to accomplish the same event, and at that place's even some magic taking place behind the scenes here. For case, the constructor for ThreadData is figuring out what start and stop should be for each thread invocation. OpenMP automatically handles all of these details, and additionally gives the programmer some more knobs to plough to configure their parallel regions and lawmaking.

Figure 2 Multithreading in Win32

              grade ThreadData { public:    ThreadData(int threadNum); // initializes starting time and stop    int start;    int finish; };  DWORD ThreadFn(void* passedInData)  {     ThreadData *threadData = (ThreadData *)passedInData;    for(int i = threadData->start; i < threadData->stop; ++i )       10[i] = (y[i-1] + y[i+1]) / two;    return 0;  }  void ParallelFor() {    // Get-go thread teams    for(int i=0; i < nTeams; ++i)       ResumeThread(hTeams[i]);     // ThreadFn implicitly called here on each thread     // Await for completion     WaitForMultipleObjects(nTeams, hTeams, True, INFINITE); }  int main(int argc, char* argv[]) {    // Create thread teams    for(int i=0; i < nTeams; ++i)     {       ThreadData *threadData = new ThreadData(i);       hTeams[i] = CreateThread(Zero, 0, ThreadFn, threadData,             CREATE_SUSPENDED, NULL);    }     ParallelFor(); // simulate OpenMP parallel for     // Make clean upwards    for(int i=0; i < nTeams; ++i)        CloseHandle(hTeams[i]); }                          

Shared Information and Private Data

In writing parallel programs, understanding which data is shared and which is private becomes very important, not only to performance, but also for correct operation. OpenMP makes this distinction apparent to the programmer, and it is something that you can tweak manually.

Shared variables are shared by all the threads in the thread squad. Thus a change of the shared variable in ane thread may become visible to some other thread in the parallel region. Private variables, on the other hand, have individual copies made for each thread in the thread team, so changes fabricated in one thread are not visible in the individual variables in the other threads.

By default, all the variables in a parallel region are shared, with three exceptions. Beginning, in parallel for loops, the loop index is individual. In the example in Effigy three, the i variable is private. The j variable is not private by default, merely it's made private explicitly with the utilize of the firstprivate clause.

Figure 3 OpenMP Clauses and a Nested For Loop

              float sum = 10.0f; MatrixClass myMatrix; int j = myMatrix.RowStart(); int i; #pragma omp parallel  {    #pragma omp for firstprivate(j) lastprivate(i) reduction(+: sum)     for(i = 0; i < count; ++i)     {       int doubleI = 2 * i;       for(; j < doubleI; ++j)        {          sum += myMatrix.GetElement(i, j);       }    } }                          

2nd, variables that are local to the block of the parallel region are private. In Figure 3, the variable doubleI is individual because information technology is declared in the parallel region. Whatever non-static and non-member variables alleged in myMatrix::GetElement volition be private.

And third, any variables listed in the private, firstprivate, lastprivate, or reduction clauses are private. In Figure 3, the variables i, j, and sum are fabricated private to each thread in the thread team. This is done past making a singled-out copy of each of these variables for each thread.

Each of the iv clauses takes a list of variables, but their semantics are all unlike. The individual clause says that each variable in the list should have a private re-create made for each thread. This private copy volition be initialized with its default value (using its default constructor where appropriate). For example, the default value for variables of type int is 0.

The firstprivate clause has the same semantics equally private, except it copies the value of the private variable earlier the parallel region into each thread, using the re-create constructor where advisable.

The lastprivate clause has the same semantics as individual, except the final iteration or department of the work-sharing construct causes the values of the variables listed in the lastprivate clause to be assigned to the variable of the main thread. When advisable, the re-create assignment operator is used to copy objects.

The reduction clause has similar semantics to private, but it takes both a variable and an operator. The fix of operators are express to the operators listed in Effigy 4, and the reduction variable must exist a scalar variable (such equally float, int, or long, but not std::vector, int [], and and so on). The reduction variable is initialized to the value listed in the table for each thread. At the cease of the code block, the reduction operator is practical to each of the private copies of the variable, likewise as to the original value of the variable.

Figure 4 Reduction Operators

Reduction operator Initialized value (approved value)
+ 0
* 1
- 0
& ~0(every bit set)
| 0
^ 0
&& 1
|| 0

In the example in Effigy 3, the sum value is implicitly initialized in each thread with the value 0.0f. (Note that the approved value in the table is 0, which becomes 0.0f since the type of sum is float.) Afterward the #pragma omp for block is completed, the threads apply the + operation to all the individual sum values and the original value (the original value of sum in this instance is 10.0f). The outcome is assigned to the original shared sum variable.

Non-Loop Parallelism

OpenMP is typically used for loop-level parallelism, merely it also supports function-level parallelism. This machinery is called OpenMP sections. The structure of sections is straightforward and can exist useful in many instances.

Consider one of the most important algorithms in estimator scientific discipline, the quicksort. The example that is used hither is a unproblematic recursive quicksort algorithm for a list of integers. For simplicity, we have not used a generic templatized version, merely the same OpenMP ideas would utilise. The code in Figure v shows the chief quicksort part using sections (we've omitted the Partition function for the purpose of simplicity).

Effigy 5 Quicksort with Parallel Sections

              void QuickSort (int numList[],  int nLower, int nUpper)     {    if (nLower < nUpper)            {       // create partitions       int nSplit = Sectionalization (numList, nLower, nUpper);       #pragma omp parallel sections       {          #pragma omp section          QuickSort (numList, nLower, nSplit - one);           #pragma omp department          QuickSort (numList, nSplit + 1, nUpper);       }    } }                          

In this instance, the first #pragma creates a parallel region of sections. Each section is preceded past the directive #pragma omp section. Each department in the parallel region is given to a single thread in the thread team, and all of the sections can be handled concurrently. Each parallel department calls QuickSort recursively.

As in the #pragma omp parallel for construct, you lot are responsible for ensuring that each department is independent of the other sections so they can be executed in parallel. If the sections update shared resources without synchronizing access to those resources, the results are undefined.

Note that this case uses the shorthand #pragma omp parallel sections in a mode analogous to #pragma omp parallel for. You can also utilize #pragma omp sections within a parallel region equally a standalone directive, only as you tin can with #pragma omp for.

There are a few things to note virtually the implementation in Figure 5. Get-go, the parallel sections are called recursively. Recursive calls are supported with parallel regions and, in this case, specifically with parallel sections. Also, with nesting enabled, new threads will be spawned as the program recursively calls QuickSort. This may or may not be what the awarding programmer desires, as information technology can result in quite a lot of threads. The programme tin disable nesting to bound the number of threads. With nesting disabled, this awarding will recursively phone call QuickSort on two threads, and will non spawn more than two threads, even recursively.

In addition, compiling this application without the /openmp switch will generate a perfectly right serial implementation. One of the benefits of OpenMP is that information technology coexists with compilers that don't sympathise OpenMP.

Synchronization Pragmas

With multiple threads running meantime, there are ofttimes times when it's necessary to accept one thread synchronize with another thread. OpenMP supplies multiple types of synchronization to help in many different situations.

One blazon is implied bulwark synchronization. At the cease of each parallel region is an implied barrier synchronization of all threads within the parallel region. A barrier synchronization requires all threads to accomplish that betoken before any can proceed across it.

At that place is also an implied barrier synchronization at the terminate of each #pragma omp for, #pragma omp single, and #pragma omp sections block. To remove the implied bulwark synchronization from these three types of work-sharing blocks, add the nowait clause:

              #pragma omp parallel     {     #pragma omp for nowait     for(int i = ane; i < size; ++i)         x[i] = (y[i-1] + y[i+1])/ii; }                          

Every bit you can see, the nowait clause on the work-sharing directive indicates that the threads practice not take to synchronize at the end of the for loop, although the threads volition have to synchronize at the terminate of the parallel region.

Some other type is explicit bulwark synchronization. In some situations you'd like to place a barrier synchronization in a place too the exit of a parallel region. This is placed in the lawmaking with #pragma omp barrier.

Critical sections tin can be used as barriers. In the Win32® API, a critical section is entered and exited through EnterCriticalSection and LeaveCriticalSection. OpenMP gives the aforementioned capability with #pragma omp critical [name]. This has the same semantics as the Win32 critical department, and EnterCriticalSection is used under the covers. You can employ a named critical section, in which case the cake of lawmaking is just mutually exclusive with respect to other critical sections with the same name (this applies across the whole process). If no proper name is specified, then the implementation maps to some user-unspecified name. These unnamed critical sections are all mutually exclusive regions with respect to each other.

In a parallel region, at that place are often sections of lawmaking where limiting admission to a single thread is desired, such as when writing to a file in the middle of a parallel region. In many of these cases it does not matter which thread executes this code, as long as information technology is just one thread. OpenMP has #pragma omp single to practise this.

Sometimes specifying single-threaded execution for a section of code in a parallel region with the unmarried directive is non sufficient. On occasion yous may want to ensure that the master thread is the thread that executes the section of code—for example, if the principal thread is the GUI thread and you lot need the GUI thread to do some work. #pragma omp master does this. Different single, in that location is no implied barrier at entry or exit of a chief cake.

Retention fences are implemented with #pragma omp affluent. This directive creates a memory fence in the program, which is effectively equivalent to the _ReadWriteBarrier intrinsic.

Notation that the OpenMP pragmas must be encountered by all threads in the thread team in the same order (or none at all). Thus, the following code snippet is illegal and has undefined runtime beliefs (crashing or hanging would not exist out of the ordinary in this item state of affairs):

              #pragma omp parallel {     if(omp_get_thread_num() > 3)      {         #pragma omp single    // May not be accessed by all threads         x++;     } }                          

Execution Environment Routines

Along with the pragmas discussed earlier, OpenMP also has a set of runtime routines that are useful for writing OpenMP applications. There are three broad classes of routines available: execution surroundings routines, lock/synchronization routines, and timing routines. (The timing routines are not discussed in this article.) All of the OpenMP runtime routines brainstorm with omp_ and are defined in the header file omp.h.

Functionality provided by the execution environment routines allows yous to query or ready various aspects of the operating environment for OpenMP. Functions that begin with omp_set_ may only be chosen outside of parallel regions. All other functions can exist used in both parallel and not-parallel regions.

In order to get or set the number of threads in the thread team, use the routines omp_get_num_threads and omp_set_num_threads. omp_get_num_threads will render the number of threads in the electric current thread team. If the calling thread is non in a parallel region, it will render 1. omp_set_num_thread volition set the number of threads that should exist used for the next parallel region that the thread will come across.

But that's not the whole story when it comes to setting the number of threads. The number of threads used for parallel regions likewise depends on two other aspects of the OpenMP environment: dynamic threads and nesting.

Dynamic threading is a Boolean property that is disabled by default. If this property is disabled when a thread encounters a parallel region, the OpenMP runtime creates the thread team with the number of threads as specified past omp_get_max_threads. omp_get_max_threads by default is gear up to the number of hardware threads on the computer or the value of the OMP_NUM_THREADS environs variable. If dynamic threading is enabled, the OpenMP runtime volition create the thread team with a variable number of threads that does not exceed omp_get_max_ threads.

Nesting is another Boolean property that is disabled past default. Nesting of parallel regions occurs when a thread that is already in a parallel region encounters some other parallel region. If nesting is enabled, and then a new thread squad will be formed with the rules specified in the previous dynamic threads section. If nesting is not enabled, then the thread squad is formed with just that single thread.

The condition of dynamic threads and nesting is set and queried with the runtime routines omp_set_dynamic, omp_get_dynamic, omp_set_nested, and omp_get_nested. Each thread can besides query its surround. A thread can notice out what its thread number is from inside the thread team with the omp_get_thread_num call. This is not the Windows thread ID, but rather a value from 0 to one less than omp_get_num_threads.

A thread can detect out if it is currently executing in a parallel region with omp_in_parallel. And with omp_get_num_procs, a thread tin find out how many processors are in the computer.

To help make some of the interactions betwixt the various surround routines more articulate, y'all should take a wait at the code in Figure six. In this instance, in that location are four distinct parallel regions, with 2 nested parallel regions.

Figure 6 Using OpenMP Environs Routines

              #include <stdio.h> #include <omp.h>  int main()     {    omp_set_dynamic(1);    omp_set_num_threads(10);    #pragma omp parallel        // parallel region 1    {       #pragma omp single       printf("Num threads in dynamic region is = %d\n",        omp_get_num_threads());    }    printf("\northward");    omp_set_dynamic(0);    omp_set_num_threads(10);    #pragma omp parallel        // parallel region ii    {       #pragma omp single       printf("Num threads in non-dynamic region is = %d\n",        omp_get_num_threads());    }    printf("\northward");    omp_set_dynamic(1);    omp_set_num_threads(10);    #pragma omp parallel        // parallel region iii    {       #pragma omp parallel        {          #pragma omp single          printf("Num threads in nesting disabled region is = %d\n",            omp_get_num_threads());       }    }    printf("\north");    omp_set_nested(1);    #pragma omp parallel        // parallel region four    {       #pragma omp parallel        {          #pragma omp unmarried          printf("Num threads in nested region is = %d\north",            omp_get_num_threads());       }    } }                          

Running on a typical dual-processor computer, a program compiled using Visual Studio 2005 prints the following from this code:

              Num threads in dynamic region is = 2  Num threads in not-dynamic region is = 10  Num threads in nesting disabled region is = i Num threads in nesting disabled region is = 1  Num threads in nested region is = 2 Num threads in nested region is = 2                          

The first parallel region enabled dynamic threads and ready the number of threads to 10. Looking at the output of the programme, yous can see that with dynamic threads enabled the OpenMP runtime decided to classify just two threads to the thread team, since the estimator has two processors. In the 2d parallel region, the OpenMP runtime allocated 10 threads to the thread squad because dynamic threads were not enabled.

In the third and 4th parallel regions, you see the effect of having nesting enabled or disabled. In parallel region three, nosotros disabled nesting, and in this case no new threads were allocated for the nested parallel regions. Thus at that place was a total of two threads for both the outer and nested parallel regions. In parallel region iv, where we enabled nesting, the nested parallel region created a thread team with two threads (a total of four threads in the nested parallel region). This process of doubling threads for each nested parallel region can continue until yous run out of stack infinite. In exercise, yous can generate several hundred threads, though the overhead tin can easily outweigh any performance benefits.

As you lot may accept noticed, for parallel regions three and four, dynamic threading was enabled. What if the code was executed with dynamic threads disabled, as shown here:

              omp_set_dynamic(0); omp_set_nested(1); omp_set_num_threads(10); #pragma omp parallel {     #pragma omp parallel      {         #pragma omp single         printf("Num threads in nested region is = %d\n",          omp_get_num_threads());     } }                          

Hither yous tin can meet the expected beliefs. The first parallel region started with a thread team of 10 threads. The subsequent nested parallel region started with x threads for each of the ten threads in the enclosing parallel region. Thus a full of 100 threads executed within the nested parallel region:

              Num threads in nested region is = 10 Num threads in nested region is = x Num threads in nested region is = 10 Num threads in nested region is = 10 Num threads in nested region is = 10 Num threads in nested region is = 10 Num threads in nested region is = 10 Num threads in nested region is = ten Num threads in nested region is = 10 Num threads in nested region is = 10                          

Synchronization/Lock Routines

OpenMP has runtime routines to assist in synchronization of lawmaking. There are two lock types, simple and nestable, each of which tin exist in ane of 3 states: uninitialized, locked, and unlocked.

Simple locks (omp_lock_t) can not be locked more than than once, even by the same thread. Nestable locks (omp_nest_lock_t) are identical to simple locks, except when a thread attempts to set a lock that it already holds, it will not block. In addition, nestable locks are reference counted and go along rails of how many times they have been fix.

There are synchronization routines that human action on these locks. For each routine at that place is a unproblematic lock and a nestable lock variant. There are five actions that can be performed on the lock: initialize, set (learn lock), unset (release lock), examination, and destroy. These all map very closely to Win32 critical section routines—in fact OpenMP is implemented as a sparse wrapper on elevation of these routines. Figure seven shows the mapping from the OpenMP routines to the Win32 routines.

Effigy seven OpenMP Lock Routines and Win32

OpenMP Simple Lock OpenMP Nested Lock Win32 Routine
omp_lock_t omp_nest_lock_t CRITICAL_SECTION
omp_init_lock omp_init_nest_lock InitializeCriticalSection
omp_destroy_lock omp_destroy_nest_lock DeleteCriticalSection
omp_set_lock omp_set_nest_lock EnterCriticalSection
omp_unset_lock omp_unset_nest_lock LeaveCriticalSection
omp_test_lock omp_test_nest_lock TryEnterCriticalSection

Developers can choose to apply either the synchronization runtime routines or the synchronization pragmas. The reward of the pragmas is that they are extremely structured. This can make them easier to understand, and it makes it easier to look at the program to determine the location of the entrance or go out of your synchronized regions.

The advantage of the runtime routines is their flexibility. You can pass a lock to a different function and within of this other function the lock can be prepare or unset. This is not possible with the pragmas. In general information technology makes sense to utilize synchronization pragmas unless yous need a level of flexibility that is only bachelor with the runtime routines.

Parallel Execution of Data Structure Traversal

The code in Figure eight shows two examples of executing a parallel for loop where the runtime does not know the number of iterations it will perform when it first encounters the loop. The start example is the traversal of an STL std::vector container, and the second is a standard linked list.

Figure viii Treatment Variable Iterations at Run Time

              #pragma omp parallel {    // Traversing an STL vector in parallel    std::vector<int>::iterator iter;    for(iter = xVect.brainstorm(); iter != xVect.terminate(); ++iter)     {       #pragma omp single nowait       {          process1(*iter);       }    }     // Walking a standard linked-list in parallel    for(LList *listWalk = listHead; listWalk != Nix;            listWalk = listWalk->next)     {       #pragma omp single nowait       {          process2(listWalk);       }    } }                          

For the STL container portion of the example, every thread in the thread team executes the for loop and each thread has its ain copy of the iterator. Only merely one thread per iteration enters the unmarried block (those are the semantics of single). The OpenMP runtime handles all of the magic that ensures that the single cake is executed once and just in one case for each iteration. There is considerable runtime overhead to this method of iteration, so it'south simply useful when in that location's a lot of work happening in the process function. The linked listing example follows the same logic.

It's worth pointing out that for the STL vector example we tin can rewrite the code to use std::vector.size to determine the number of iterations needed prior to inbound the loop, which then puts the loop in the canonical OpenMP for loops form. This is demonstrated in the following lawmaking:

              #pragma omp parallel for    for(int i = 0; i < xVect.size(); ++i)        process(xVect[i]);                          

Because this method has much lower OpenMP runtime overhead, we recommend it for arrays, vectors, and any other container that can exist walked in canonical OpenMP for loop grade.

Avant-garde Scheduling Algorithms

By default, OpenMP parallelized for loops are scheduled with an algorithm chosen static scheduling. This means that each thread in the thread team is given an equal number of iterations. If at that place are n iterations and T threads in the thread team, each thread will get north/T iterations (and yes, OpenMP correctly handles the case when n is non evenly divisible past T). But OpenMP also offers a few other scheduling mechanisms which are appropriate for many situations: dynamic, runtime, and guided.

To specify one of these other scheduling mechanisms use the schedule clause with the #pragma omp for or #pragma omp parallel for directives. The schedule clause has the form:

              schedule(schedule-algorithm[, chunk-size])                          

Here are some examples of the pragma in use:

              #pragma omp parallel for schedule(dynamic, fifteen) for(int i = 0; i < 100; ++i)  { ...  #pragma omp parallel    #pragma omp for schedule(guided)                          

Dynamic scheduling schedules each thread to execute the number of threads specified by clamper-size (if chunk-size is not given it defaults to i). Subsequently a thread has finished executing the iterations given to information technology, it requests another fix of chunk-size iterations. This continues until all of the iterations are completed. The terminal fix of iterations may be less than clamper-size.

Guided scheduling schedules each thread to execute the number of threads proportional to this formula:

              iterations_to_do = max(iterations_not_assigned/omp_get_num_threads(),                         chunk-size)                          

After a thread has finished executing the iterations given to it, the thread requests another set of iterations based on the iterations_to_do formula. Thus the number of iterations assigned to each thread decreases over fourth dimension. The concluding set of iterations scheduled may be less than the value divers by the iterations_to_do function.

Here you tin see the process by which four threads might get scheduled for a for loop with 100 iterations using #pragma omp for schedule(dynamic, 15):

              Thread 0 gets iterations 1-fifteen Thread i gets iterations sixteen-30 Thread 2 gets iterations 31-45 Thread iii gets iterations 46-60 Thread 2 finishes Thread 2 gets iterations 61-75 Thread iii finishes Thread 3 gets iterations 76-xc Thread 0 finishes  Thread 0 gets iterations 91-100                          

Here is how four threads might go scheduled for a for loop with 100 iterations using #pragma omp for schedule(guided, 15):

              Thread 0 gets iterations i-25 Thread i gets iterations 26-44  Thread 2 gets iterations 45-59 Thread 3 gets iterations 60-64 Thread ii finishes Thread 2 gets iterations 65-79 Thread three finishes  Thread iii gets iterations eighty-94 Thread 2 finishes Thread 2 gets iterations 95-100                          

Dynamic scheduling and guided scheduling are both perfect mechanisms for those situations where each iteration has variable amounts of work or where some processors are faster than others. With static scheduling there is no way to load-balance the iterations. Dynamic and guided scheduling automatically load-balance the iterations, by the very nature of how they piece of work. Typically, guided scheduling performs better than dynamic scheduling due to less overhead associated with scheduling.

The final scheduling algorithm, runtime scheduling, really isn't a scheduling algorithm at all, but rather a way to dynamically select amidst the three previously mentioned scheduling algorithms. When runtime is specified in the schedule clause the OpenMP runtime uses the scheduling algorithm specified in the OMP_SCHEDULE environment variable for this item for loop. The format for the OMP_SCHEDULE environs variable is type[,chunk size]. For example:

              gear up OMP_SCHEDULE=dynamic,viii                          

Using runtime scheduling gives the end-user some flexibility in determining the blazon of scheduling used, with the default existence static scheduling.

When to Use OpenMP

Knowing when to use OpenMP is just as important as knowing how to use it. In general we find the following guidelines quite useful in making the determination.

Target platform is multicore or multiprocessor In these types of cases, if the awarding is saturating a core or processor, making information technology into a multithreaded awarding with OpenMP will almost certainly increase the awarding's performance.

Application is cross-platform OpenMP is a cross-platform and widely supported API. And since the API is implemented through pragmas, the application can still compile fifty-fifty on a compiler that has no knowledge of the OpenMP standard.

Parallelizable loops OpenMP is at its best parallelizing loops. If the application has loops which take no loop-carried dependencies, using OpenMP is an ideal choice.

Last-minute optimizations needed Because OpenMP does not require re-architecting the application, information technology is the perfect tool for making small surgical changes to go incremental performance improvements.

With all that said, OpenMP isn't designed to handle every multithreading problem. The roots of OpenMP bear witness some of its biases. Information technology was adult for utilise by the high-functioning computing community and it works best in programming styles that have loop-heavy code working on shared arrays of data.

Just as creating a native thread has a cost, creating OpenMP parallel regions has a cost. In society for OpenMP to requite a operation gain, the speedup given by the parallelized region must outweigh the cost of thread team startup. The Visual C++ implementation creates the thread team when encountering the first parallel region. Later that, the thread team is parked until it's needed once again. Under the covers OpenMP uses the Windows thread puddle. Figure 9 shows the speedup using OpenMP on a two processor box for various numbers of iterations, using the elementary instance program that is shown at the beginning of this article. Functioning peaked at almost ane.7x, which is typical on a two-processor organization. (To better indicate the operation difference, the y axis shows a ratio scale of serial performance to parallel operation.) Find that it takes about 5,000 iterations earlier the parallel version is running as fast as the sequential version, but this is just about the worst-instance scenario. Virtually parallelized loops will be faster than the sequential version in many fewer iterations. It simply depends on the amount of work that each iteration does. However, this graph shows why it's important to mensurate performance. Using OpenMP doesn't guarantee that performance will ameliorate.

Figure 9 Serial vs. Parallelized Performance on Two Processors

Figure 9** Serial vs. Parallelized Performance on Two Processors **

The OpenMP pragmas, while easy to use, don't provide peachy feedback when an error occurs. If you lot're writing a mission-critical awarding that needs to be able to detect faults and gracefully recover from them, then OpenMP is probably not the right tool for you (at to the lowest degree not the electric current incarnation of OpenMP). For example, if OpenMP cannot create threads for parallel regions or cannot create a critical section then the behavior is undefined. With Visual C++ 2005 the OpenMP runtime continues trying for a period of fourth dimension before somewhen bailing out. One of the things we plan to pursue for futurity versions of OpenMP is a standard error-reporting mechanism.

Another place where one should tread carefully is when using Windows threads forth with OpenMP threads. OpenMP threads are built on top of Windows threads, so they execute merely fine in the same process. The problem is that OpenMP has no knowledge of Windows threads that it does non create. This creates two problems: the OpenMP runtime does not keep the other Windows threads every bit function of its "count" and the OpenMP synchronization routines practise not synchronize the Windows threads considering they're not part of thread teams.

Common Pitfalls When Using OpenMP in Apps

While OpenMP is about as easy as it gets for adding parallelism to your application, there are still a few things to exist aware of. The index variable in the outermost parallel for loop is privatized, just index variables for nested for loops are shared past default. When you lot take nested loops, you ordinarily desire the inner-loop index variables to be private. Use the private clause to specify these variables.

OpenMP applications should use care when throwing C++ exceptions. Specifically, when an application throws an exception in a parallel region, it must be handled in the aforementioned parallel region by the same thread. In other words, the exception should not escape the region. As a general dominion, if in that location can be an exception in a parallel region, there should exist a catch for information technology. If the exception isn't caught in the parallel region where it was thrown, the application will typically crash.

The #pragma omp <directive> [clause] argument must finish with a new-line, non a brace, to open a structured block. Directives that end with a brace volition outcome in a compiler error:

              // Bad Bracket #pragma omp parallel {     // won't compile Code }  // Skilful Bracket #pragma omp parallel  {    // Code }                          

Debugging OpenMP apps with Visual Studio 2005 can sometimes exist complicated. In particular, stepping into and/or out of a parallel region with F10/F11 is a less than stellar experience. This is because the compiler creates some additional lawmaking to telephone call into the runtime and to invoke the thread teams. The debugger has no special cognition about this, and then the programmer sees what looks like odd behavior. We recommend putting a breakpoint in the parallel region and using F5 to accomplish it. To exit the parallel region, fix a breakpoint outside of the parallel region, and so hit F5.

From inside a parallel region the debugger "Threads Window" will show the various threads that are running in the thread team. The thread IDs will not correlate with the OpenMP thread IDs; however, they will reverberate the Windows thread ID that OpenMP is built on pinnacle of.

OpenMP is currently not supported with Profile Guided Optimization. Fortunately, since OpenMP is pragma-based, you tin can try compiling your awarding with /openmp or with PGO and decide which method gives you the best performance improvement.

OpenMP and .Cyberspace

High-performance computing and .NET may not be two words that immediately come up to listen together, simply Visual C++ 2005 makes some strides in this area. One of the things we did was to make OpenMP work with managed C++ code. To do this we fabricated the /openmp switch compatible with both /clr and /clr:OldSyntax. This means you can use OpenMP to get parallelism from methods in .NET types, which are garbage collected. Delight note that /openmp is not currently compatible with /clr:condom nor /clr:pure; withal, we plan to make them compatible in the futurity.

There is ane noteworthy restriction associated with using OpenMP and managed code. An application that uses OpenMP must merely exist used in a single awarding domain program. If another application domain is loaded into a procedure with the OpenMP runtime already loaded, the application may arrest.

OpenMP is a simple all the same powerful technology for parallelizing applications. It provides ways to parallelize data processing loops besides as functional blocks of code. Information technology can exist integrated easily into existing applications and turned on or off simply by throwing a compiler switch. OpenMP is an easy style to tap into the processing ability of multicore CPUs. Again, we strongly suggest reading the OpenMP spec for more details. Have fun multithreading.

Kang Su Gatlin is a Program Manager on the Visual C++ team at Microsoft where he spends well-nigh of his solar day trying to figure out systematic means to make programs run faster. Prior to his life at Microsoft, he worked on loftier-performance and grid calculating.

Pete Isensee is the Development Director for the Xbox Advanced Technology Grouping at Microsoft. He has worked in the game manufacture for 12 years and speaks regularly at conferences about optimization and operation.