http://pdplab.trc.rwcp.or.jp/pdperf/Omni/for details.
Currently, Omni/ST is available on SPARC, MIPS, and i386 architecture machines. Omni/ST assumes the stackStreads compiler "stgcc" is already available in the platform. For StackThreads supported platforms, check http://www.yl.is.s.u-tokyo.ac.jp/sthreads/.
There are two steps. First, you must install StackThreads/MP library separately. Second, you must enable Omni/ST when you build Omni.
Download StackThreads/MP library from
http://www.yl.is.s.u-tokyo.ac.jp/sthreads/and install it. You can see the details in the document of the software.
Before you proceed, make sure command `stgcc' is in your path.
Add the option `--enable-stackThreads' when you ./configure Omni.
% configure --enable-stackThreads other_options ...
When Omni/ST is enabled, you can compile your programs both with and without Omni/ST. To compile a program with Omni/ST, add option `-omniconfig=st' in the command line of the compiler.
% omcc -omniconfig=st your_program.c
This links your program with StackThreads/MP and the runtime library that calls StackThreads/MP.
Without "-omniconfig=st" option, the default Omni runtime library is linked and the executable is identical to the case where Omni/ST is disabled (i.e., without --enable-stackThreads).
In this way, a single source can be compiled in two ways.
Omni/ST creates a fixed number of underlying threads (LWPs). OpenMP-level threads are dynamically mapped on the fixed number of LWPs. For example, when you set OMPC_NUM_PROCS=10 and your program creates 100,000 threads, it only creates 10 LWPs (using the underlying thread package such as Pthreads) and the 100,000 threads are `dynamically' mapped onto 10 LWPs. Therefore, when you observe the number of threads used by your OpenMP program using `top' command, you will see the number of threads is (close to) 10, no matter how many logical threads are created.
To maximize CPU utilization, OpenMP-level threads migrate between LWPs when an LWP runs out of threads. This way, Omni/ST tries to fill LWPs with work (i.e., threads) as much as possible.
Omni/ST will be primarily useful for programs with nested parallelism, such as those using parallel recursions. For such programs, Omni/ST generally exhibits a better speedup than the default Omni. For programs that make no use of nested parallelism, penalty is generally less than 10%.
We are trying to make Omni/ST as `transparent' as possible, in the sense that programs written with the default Omni execution model in mind simply run as fast as or faster than Omni. There are, however, some circumstances where Omni/ST-specific tips are necessary to make effective use of it. Below, we give description of them and suggested programming styles in each situation.
Problem description: Speedup may be worse than it should be, when programs do not call Omni runtime library for a long time.
Consider the following code, for example.
int main() { omp_set_num_threads(4); #pragma omp parallel sections { #pragma omp section some_work(); #pragma omp section small_work(); #pragma omp section small_work(); #pragma omp section small_work(); } } void some_work() { #pragma omp parallel sections { #pragma omp section long_code(); #pragma omp section something(); #pragma omp section something(); } }
Suppose there are four LWPs (OMPC_NUM_PROCS=4). The main LWP creates four threads, one of which executes "some_work()" and each of the other three "small_work()", which we assume will finish soon.
"some_work()" then creates three threads, one of which executes "long_code()", which we assume contains a long library-free code. The other two threads perform "something()", and they should migrate to one of the other LWPs.
The problem is, if "long_code()" does not perform any polling, the other LWPs cannot obtain the "something()" threads. Currently, pollings are done only inside the runtime library. Therefore, the above situation results if "long_code()" does not contain any runtime library call.
Runtime libraries are called both explicitly and implicitly; explicit calls are OpenMP library functions such as omp_in_parallel and omp_get_dynamic. Implicit calls are made on important events, such as the beginning and end of a parallel section. Therefore, it is often unnecessary to call a runtime library explicitly. A rule of thumb is do not write a long computation kernel inside a parallel section. Insert a useless library call (e.g., omp_in_parallel) if you must have one.
This limitation will be eliminated in future release, by automatically inserting polling in compiled code.
Consider the following simple program:
int main() { omp_set_num_threads(4); #pragma omp parallel { ...; #pragma omp for for (i = 0; i < n; i++) { work(i); } } }
At the beginning of the above "parallel" directive, 4 threads are created by a main LWP, and they are dynamically migrated to other LWPs, if there are such LWPs. Note that the runtime system does not know if there are such LWPs. Therefore, the following adaptive algorithm are used to distribute threads.
where, distribute_threads_to_LWPs() is the following.
/* try to distribute threads x ... y - 1 to LWPs */ distribute_threads_to_LWPs(x, y) { if (y - x == 1) { /* start working */ do work for `x'; } else { c = (x + y) / 2; wait a while for a thread migration request to come; if (a request appears) { let the requesting LWP to do distribute_threads_to_LWPs(c, y); distribute_threads_to_LWPs(from, c); } else { /* start working */ do work for x, ..., y - 1; } } }
This algorithm tends to promptly assign a thread to each LWP if there are (at least) OMP_NUM_THREADS LWPs at the beginning. Otherwise, (at least) one LWP fails to find a request from other LWPs and it simply goes ahead for scheduling multiple threads. Such threads may be migrated later.
The problem is, even if each thread is assigned to an LWP, each LWP may start working at different times. This implies that each LWP reaches the beginning of the "for" construct at different times. As a consequence, even if the amount of work is perfectly balanced by a `static' scheduling, LWPs finish working at different times.
Workaround is to use some form of dynamic scheduling. Guided scheduling is especially recommended.
#pragma omp parallel { #pragma omp for for (i = 0; i < n; i++) { work(i); } }write it as follows instead.
#pragma omp parallel for for (i = 0; i < n; i++) { work(i); }
The first program barriers twice, whereas the second one once.
When using the default Omni runtime library, making OMP_NUM_THREADS smaller than OMPC_NUM_PROCS supports a limited form of nested parallelism. That is, in the outermost parallel construct, some LWPs are `reserved' for future parallel constructs.
This is unnecessary in Omni/ST, because of the dynamic thread migration. Simply set OMP_NUM_THREADS to OMPC_NUM_PROCS without thinking.