RWCP OpenMP compiler project

Omni/ST: StackThreads/MP implementation for Nested Irregular Parallelism

What are StackThreads and Omni/ST?
Platforms
Installation
Compiling OpenMP programs with Omni/ST
How Omni/ST basically works
Tips for using Omni/ST

Updates:

Note that Omni 1.3 (later) requires StackThread/MP library 0.78 (later) because this version needs polling feature of StackThreads for performance improvement.

Compiling and Using Omni/ST

What are StackThreads and Omni/ST?

Omni/ST is an experimental nested parallelism support for Omni. It is implemented as an external library called "StackThreads/MP" and a runtime library that calls StackThreads/MP. With Omni/ST enabled, the user can compile your programs both with and without this nested parallelism support. For programs with deeply nested parallelism (e.g., parallel recursion), Omni/ST generally outperforms Omni, especially when the number of processors is large.

Note:: the default Omni supports a limited form of nested parallelism. See
http://pdplab.trc.rwcp.or.jp/pdperf/Omni/
for details.

Supported platforms

Currently, Omni/ST is available on SPARC, MIPS, and i386 architecture machines. Omni/ST assumes the stackStreads compiler "stgcc" is already available in the platform. For StackThreads supported platforms, check http://www.yl.is.s.u-tokyo.ac.jp/sthreads/.

Installation

There are two steps. First, you must install StackThreads/MP library separately. Second, you must enable Omni/ST when you build Omni.

Installing StackThreads/MP library
Download StackThreads/MP library from
http://www.yl.is.s.u-tokyo.ac.jp/sthreads/
and install it. You can see the details in the document of the software.
Before you proceed, make sure command `stgcc' is in your path.
Install with enabling Omni/ST
Add the option `--enable-stackThreads' when you ./configure Omni.
```
         % configure --enable-stackThreads other_options ...
      
```

Compiling OpenMP programs with Omni/ST

When Omni/ST is enabled, you can compile your programs both with and without Omni/ST. To compile a program with Omni/ST, add option `-omniconfig=st' in the command line of the compiler.

     % omcc -omniconfig=st your_program.c

This links your program with StackThreads/MP and the runtime library that calls StackThreads/MP.

Without "-omniconfig=st" option, the default Omni runtime library is linked and the executable is identical to the case where Omni/ST is disabled (i.e., without --enable-stackThreads).

In this way, a single source can be compiled in two ways.

How Omni/ST basically works

Omni/ST creates a fixed number of underlying threads (LWPs). OpenMP-level threads are dynamically mapped on the fixed number of LWPs. For example, when you set OMPC_NUM_PROCS=10 and your program creates 100,000 threads, it only creates 10 LWPs (using the underlying thread package such as Pthreads) and the 100,000 threads are `dynamically' mapped onto 10 LWPs. Therefore, when you observe the number of threads used by your OpenMP program using `top' command, you will see the number of threads is (close to) 10, no matter how many logical threads are created.

To maximize CPU utilization, OpenMP-level threads migrate between LWPs when an LWP runs out of threads. This way, Omni/ST tries to fill LWPs with work (i.e., threads) as much as possible.

Tips for using Omni/ST

Omni/ST will be primarily useful for programs with nested parallelism, such as those using parallel recursions. For such programs, Omni/ST generally exhibits a better speedup than the default Omni. For programs that make no use of nested parallelism, penalty is generally less than 10%.

We are trying to make Omni/ST as `transparent' as possible, in the sense that programs written with the default Omni execution model in mind simply run as fast as or faster than Omni. There are, however, some circumstances where Omni/ST-specific tips are necessary to make effective use of it. Below, we give description of them and suggested programming styles in each situation.

Tip 1: Call runtime library occasionally
Problem description: Speedup may be worse than it should be, when programs do not call Omni runtime library for a long time.

A possible workaround:
Occasionally call any runtime library (e.g., omp_in_parallel) even if it is useless, so that some library calls are made periodically.

Detailed description:
As described above, threads migrate between LWPs to achieve load balancing. For this thread migration to work, Omni/ST assumes each LWP periodically `polls' thread migration requests from other LWPs (`polite' work stealing).

Consider the following code, for example.
```
int main()
{
  omp_set_num_threads(4);
  #pragma omp parallel sections
  {
    #pragma omp section
      some_work();
    #pragma omp section
      small_work();
    #pragma omp section
      small_work();
    #pragma omp section
      small_work();
  }
}

void some_work()
{
  #pragma omp parallel sections
  {
    #pragma omp section
      long_code();
    #pragma omp section
      something(); 
    #pragma omp section
      something(); 
  }
}
    
```
Suppose there are four LWPs (OMPC_NUM_PROCS=4). The main LWP creates four threads, one of which executes "some_work()" and each of the other three "small_work()", which we assume will finish soon.
"some_work()" then creates three threads, one of which executes "long_code()", which we assume contains a long library-free code. The other two threads perform "something()", and they should migrate to one of the other LWPs.
The problem is, if "long_code()" does not perform any polling, the other LWPs cannot obtain the "something()" threads. Currently, pollings are done only inside the runtime library. Therefore, the above situation results if "long_code()" does not contain any runtime library call.
Runtime libraries are called both explicitly and implicitly; explicit calls are OpenMP library functions such as omp_in_parallel and omp_get_dynamic. Implicit calls are made on important events, such as the beginning and end of a parallel section. Therefore, it is often unnecessary to call a runtime library explicitly. A rule of thumb is do not write a long computation kernel inside a parallel section. Insert a useless library call (e.g., omp_in_parallel) if you must have one.
This limitation will be eliminated in future release, by automatically inserting polling in compiled code.
Tip 2: Use guided scheduling where possible

Problem description:
Even if you know work is perfectly distributed with static scheduling, you may observe imperfect speedup (idle time at the end of a loop).

A possible workaround:
Try guided scheduling instead.

Detailed description:
This is another consequence of dynamic thread migration.

Consider the following simple program:
```
int main()
{
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    ...;
    #pragma omp for
      for (i = 0; i < n; i++) {
        work(i);
      }
  }
}
     
```
At the beginning of the above "parallel" directive, 4 threads are created by a main LWP, and they are dynamically migrated to other LWPs, if there are such LWPs. Note that the runtime system does not know if there are such LWPs. Therefore, the following adaptive algorithm are used to distribute threads.

main LWP:
create OMP_NUM_THREADS threads;
distribute_threads_to_LWPs(0, OMP_NUM_THREADS);
where, distribute_threads_to_LWPs() is the following.
```
/* try to distribute threads x ... y - 1 to LWPs */
distribute_threads_to_LWPs(x, y)
{
  if (y - x == 1) {
    /* start working */
    do work for `x';
  } else {
    c = (x + y) / 2;
    wait a while for a thread migration request to come;
    if (a request appears) {
      let the requesting LWP to do 
		distribute_threads_to_LWPs(c, y);
      distribute_threads_to_LWPs(from, c);
    } else {
      /* start working */
      do work for x, ..., y - 1;
    }
  }
}
     
```
This algorithm tends to promptly assign a thread to each LWP if there are (at least) OMP_NUM_THREADS LWPs at the beginning. Otherwise, (at least) one LWP fails to find a request from other LWPs and it simply goes ahead for scheduling multiple threads. Such threads may be migrated later.
The problem is, even if each thread is assigned to an LWP, each LWP may start working at different times. This implies that each LWP reaches the beginning of the "for" construct at different times. As a consequence, even if the amount of work is perfectly balanced by a `static' scheduling, LWPs finish working at different times.
Workaround is to use some form of dynamic scheduling. Guided scheduling is especially recommended.
Tip 3: Avoid barriers

Problem description:
Barrier synchronization is somewhat slower than the default Omni runtime library. Barriers include explicit barriers with #pragma omp barrier and implicit barriers performed at the end of a work-sharing or a parallel construct.

Workaround:
Avoid barriers (esp. an implicit barrier at the end of a "for" construct) where possible. More specifically,
- if `nowait' is safe for a "for" construct, use it
- when the body of a parallel section is a single work-sharing "for" construct, use monolithic "parallel for". That is, rather than writing:
```
#pragma omp parallel 
  {
#pragma omp for
    for (i = 0; i < n; i++) {
      work(i);
    }
  }
```
write it as follows instead.
```
#pragma omp parallel for 
  for (i = 0; i < n; i++) {
    work(i);
  }
```
The first program barriers twice, whereas the second one once.
Tip 4: Do not worry about OMPC_NUM_PROCS and OMP_NUM_THREADS

Suggested style:
Always make OMP_NUM_THREADS equal to OMPC_NUM_PROCS, unless you are absolutely sure your choice is better than this.

Detailed description:
OMPC_NUM_PROCS specifies the number of LWPs created at the beginning of a program, whereas OMP_NUM_THREADS specifies the number of OpenMP-level threads created at each parallel construct. If a program does not have nested parallelism, it does not make sense to make OMP_NUM_THREADS anything but OMPC_NUM_PROCS.
When using the default Omni runtime library, making OMP_NUM_THREADS smaller than OMPC_NUM_PROCS supports a limited form of nested parallelism. That is, in the outermost parallel construct, some LWPs are `reserved' for future parallel constructs.
This is unnecessary in Omni/ST, because of the dynamic thread migration. Simply set OMP_NUM_THREADS to OMPC_NUM_PROCS without thinking.