Cluster-enabled Omni OpenMP on a software distributed shared
memory system SCASH.
Current status and restriction, known problems and bugs
Current Status
In Omni 1.3, the status is out of beta state.
The home node of large array objects is allocated in "block"
distribution. We extends the directives to specify the data mapping
and loop scheduling. For this extensions, see
"Omni extensions for for
Cluster-enabled OpenMP/SCASH".
The SCASH system in SCore 3.3.2 supports SMP clusters, which consist
of shared memory multi-processor nodes.
Multiple processors in a SMP node can share the data through shared memory
hardware.
For example, suppose the group host "smpc" have 4 dual
processor nodes,
% scout -g smpc
% scrun -nodes=4x4 a.out
"-nodes=4x4" specifies using 4 processors in each 4 nodes.
Omni/SCASH can correctly execute OpenMP programs with barrier
synchronization, which are usually found in data-parallel programs.
We still have the following problems remains.
- In SMP nodes, flush directives in an thread running independently
may fail to update the last value
which the processor have written, because other processors in the same
SMP nodes may update with the value from different nodes.
This problem is due to the fact that there is no way to write-protect
the pages from other processors in the same SMP in Linux.
These may cause a serious problem on OpenMP programs running in SPMD
models.
We are currently thinking a new synchronization primitive specific to
Omni/SCASH to solve the above problem and for fast synchronization in
SCASH environment.
Restrictions and Known problems in SCASH implementation
Although the Omni OpenMP for SCASH is almost compatible to OpenMP for
SMPs, there are a few restrictions and incompatibility.
- The variables declared in external libraries must be declared as
"threadprivate". For example in C, "stderr" in the C standard I/O
library must be declared as "threadprivate".
Global variables without "threadprivate" are re-allocated in a share
address space in SCASH.
- System calls (such as read/write) which access in SCASH shared
address space directly may cause "segmentation faults". To prevent
this problem, touch in the uninitialized shared memory space before
calling the system call (for example, use "bzero" to initialize).
- In SCASH, modified data is updated explicitly at a memory
synchronization point such as barrier.
For example, the following code may cause dead lock.
int counter = 0;
main ()
{
#pragma omp parallel
{
int node_id = omp_get_thread_num ();
if (node_id != 0) {
while(counter == 0) {
#pragma omp flush
}
} else {
counter = 1;
}
/* "count" is not updated until the barrier below.
* The threads waiting update of "count" wait forever,
* resulting in dead lock.
*/
#pragma omp barrier
;
}
}
- Only local variables declared in the function in which a
parallel region is defined are shared.
For example in C, the following code works correctly,
foo(){
int x; /* local variable, which can shared by the parallel region */
...
#pragma omp parallel
{
.. = x; /* OK */
}
}
But the following code may cause an error:
foo(){
int x; /* local variable, which can shared by the parallel region */
...
goo(&x);
}
goo(int *p){
#pragma omp parallel
{
.. = *p; /* NG */
}
}
The variable 'x' may be declared as "static" to be shared.
- If a local variable are shared, it is copied into
shared memory space at the beginning of parallel region.
When the size of the variable is large, it may cause large overhead.
- Dynamically allocated heap by allocated by "malloc" is not
shared. Use "ompsm_galloc" to allocate memory in the shared memory
space. For the detail interface, see
Omni/SCASH shared memory allocator "ompsm_galloc" .
- Large local variables may cause an error because of the
limitation of the stack size in SCASH.
- In a cluster environment, I/O occurs independently at each node.
File pointers are not shared, different from SMP's environment.
- The number of processors are given by "scrun", not
"OMP_NUM_THREAD".
For the details, also refer to
"Implementation Notes".