Related pages: SPMD introduction and SPMD library.
If you got to these pages by google, you may be looking at an old cached version. Please click your reload button.
Because of its extreme simplicity, SGI's C$DOACROSS has a very limited functionality that can be easily implemented by means of a small library on top of MPI and even a parser or preprocessor that translates directives in a source program into library calls. This library can also directly be called by the user and facilitates MPI programming, eliminating any MPI debugging. OpenMP has more features but, as for DOACROSS, has not been designed for high-level communications. There is a need to define an SPMD directive standard that allows for a transparent implementation with communication optimizations for different platforms, using a high-level library and parser on top of MPI or a lower-level system-specific library in combination with a compiler. The latter could resemble OpenMP, but explicit communications must be included to obtain the best performance on systems with distributed memory.
DOACROSS's functionality allows loop threading with hidden communications, the definition of locals/globals, plus reductions. The only luxury beyond a static loop scheduling is a dynamic one in which the user can define a chunk size. In SPMD applications every CPU activated in a cluster (e.g. mpirun -np 4) has already a copy of all variables and arrays, the user only needs to take care of data initializations and communications. The latter must be implemented such that all possibilities are covered. The only requirements are:
The last point is already far from trivial because using all CPUs in all loops can make a program less efficient (granularity, communication costs). The user should be aware of communication costs and in the future we can develop an intelligent parser that analyzes loops and necessary communications in order to optimize the number of CPUs to be used.
This document is a first version that targets a basic functionality which, with the existing SPMDlib and a few trivial extensions, allows to test a first parser implementation and to gain experience in practice. To simplify the parser implementation it would be preferable to include "implicit none" in Fortran code. Feedback from users is required to finetune the functionality and/or to include other features that most users would appreciate or think absolutely necessary. However, it should be stressed that data-parallel SPMD applications with a large granularity will be targetted in the future. The obvious reason for this is that, with increasing CPU performances but lagging interconnects, a best efficiency can be obtained.
Below we describe the directives that can be used with the SPMDlib. All directives are typed in capitals and where necessary library routines are referred to by the spmd_ prefix. It is written for novice users, hence it could be the very small manual referred to above, although it contains explanations which are not strictly necessary.
MPI is an implementation of the SPMD paradigm in which each CPU in a
cluster starts executing the same program, but each CPU has a unique
identifier (here called spmd_myid) that can be tested:
if (spmd_myid.eq.3) then
...
cpu 3 does something
...
else if ...
such that each can do a part of a job (a DO loop) or even completely different
things. By starting the program (mpirun) each CPU starts with the same variables
and arrays. If different CPUs are doing different things, sooner or later data
must be exchanged. Data communications are rather expensive: every
millisecond spent with communications on a 4-CPU PIII at 450
MHz implies the loss of 1.2 MFLOPs. Hence, the user must structure
his program such that the granularity (grain size) is large: many and
heavy computations and doing communications only when absolutely necessary,
also maximizing the parallelism of the application (Amdahl's law).
The C$SPMD directive has been designed such that the user can parallelize code with a minimum effort (user productivity). The main focus is on (a) DO loops, (b) independent serial code blocks that can also contain parallel DO loops, (c) scalar and array communications and (d) reductions, also both scalar and array. All CPUs contribute equally to the work to be done (well, there's a way to specify the number of CPUs to be used), but one CPU called "root" will play a special role in cases that data need to be distributed or assembled. Communications include root-to-all, all-to-root and all-to-all, plus two reductions: reduction-on-root and reduction-on-all. Because in most cases all CPUs will take an equal share of a loop (over arrays, i.e. will do equal parts of arrays) and the parser knows about array types as well as dimensions, it is not necessary to include these in the directive clauses. The only "luxury" is the possibility to use different numbers of CPUs for different loops and arrays (but this is already an advanced feature that should not be used in normal applications). A future version on the basis of an intelligent parser will assist the user in the optimization of the number of CPUs to be used (workload and communication overheads). Another future "luxury" might be a dynamic scheduling of DO loops by specifying a "chunksize" (this is not trivial when not using a compiler and CPU interrupts, however it could be done at the SPMDlib level by appointing one CPU as "master" to distribute chunks; this CPU would not contribute to the work and this solution might only be efficient when a sufficient number of CPUs is available).
Limitations:
At the moment (for a first parser implementation) we apply the following
restrictions:
(1) Arrays in subroutines that are going to be processed in parallel should be
passed with all complete dimensions or be declared with explicit dimensions.
(2) Such subroutines must be called by all CPUs.
(3) Parallel DO loops must be of the type: do [label] (integer scalar)=1,nloop
where nloop is a multiple of the number of CPUs to be used and the loop start and
stride are always one. In addition, nloop must be specified explicitly by an
integer number, or it can be an array dimension specified in a real etc or
dimension declaration or in a parameter statement.
(4) Parallel DO loops over 2D/3D arrays should always be over the last dimension,
i.e. be the outer loop of the nested ones.
(5) At the moment 1D, 2D and 3D array communications, and arrays must start with
one - example arr(1,1). Scalar and array types: integer, real, double and
complex (no reductions for double and complex; this can change!).
(6) Only two independent serial code blocks (that can each contain one or more
parallel DO loops) can be parallelized, but this can be done several times.
(7) Due to the use of subroutines and scalars that are prefixed with
spmd_, the user is strongly adviced not to use the same prefix.
(8) All directives and clauses should be given in uppercase and no blanks are
allowed (only scalars and arrays can be specified in lowercase).
(9) Although it is possible to mix SPMDlib with normal MPI calls, the user should
not do this unless he understands the library.
Directives start in the first column and have the form
C$SPMD clause[, clause] ...in which the clauses are executed from left to right (depending on the context). Clauses can be continued to occupy multiple lines, but each line starts with C$SPMD and clauses are separated with comma blank. No blanks are allowed in clauses. Comment can be put after an ! in which case no more clauses after the ! will be interpreted. C$SPMD lines should not be "mixed" with C$DOACROSS or C$OMP directives, although these are allowed but will be ignored. Directives can be "out-commented" by putting an extra C: CC$SPMD lines will be ignored.
INIT initialize SPMDlib and MPI END end SPMDlib and MPI BARRIER explicit CPU synchronization RONLY root only TIME and TTIME time and total time * PARBLOCK parallel code blocks DOPAR parallel DO loop R2A[variable list] root-to-all A2R[variable list] all-to-root A2A[variable list] all-to-all RoR[variable list] reduction-on-root RoA[variable list] reduction-on-allThe following abbreviations and combinations are used in some clauses:
* TCPUS=n the total number of CPUs to be used
* NCPUS=n the number of CPUs to be used in DO loops etc
E:arrayname entire array
P:arrayname parts of array
P(X=n):arrayname exchange n elements/lines/planes of 1/2/3D arrays
P(XC=n):arrayname cyclic exchange
SUM:variable type of scalar or array reduction
also: MAX, MIN, etc
IN=variable input scalar or array in reduction
OUT=variable output scalar or array in reduction
General comments:Please keep in mind that a parser/preprocessor does not execute the code, i.e. it does not have access to runtime values. The user is responsible for the parallelization, although in some cases the parser may detect an inconsistency and print a warning. The code should be as straightforward as possible and advanced features like NCPUS should be used only when absolutely necessary.
Below the basic clauses are explained in more detail, together with the SPMDlib routines invoked (this is not necessary for the final manual!).
INIT
Calls spmd_init(ierr) that provides predefined spmd_myid, spmd_nprocs,
spmd_mystart and spmd_myend (all integers), as well as prefixed scalars for
timing purposes (normal reals). Determines spmd_nprocs (4 with mpirun -np 4) and
spmd_myid (0 to nprocs-1). Root gets spmd_myid 0 (this is not necessary but
based on common sense). Must be included right at the start or where the
CPUs need to start executing different things.
END
Calls spmd_end(ierr). Must be included before stop/end or where all CPUs cease
executing different things.
BARRIER
Calls spmd_barrier(ierr) and must be used where an explicit CPU synchronization
is required, for example when root only (see RONLY below) needs a significant CPU
time. Note that all communication and reduction clauses cause an implicit CPU
synchronization because of the synchronous mpi_send and mpi_recv used in the
library. If BARRIER is used together with a closing RONLY, the program will not
be deadlocked because it will be executed after the RONLY code (the parser will
detect a BARRIER inside two RONLYs and will print a warning).
RONLY
To define regions in which Root-ONLY is active, for example to prepare or output
data (read/write a file, print); the other CPUs will skip this region and continue
the execution of the subsequent code until they encounter a barrier or a communication
instruction. Must be inserted before and after the code lines
(or only once for timing, see TIME! below), and invoke only an if-endif:
C$SPMD RONLY
call readfile(...,array,...)
call normalize(...,array,...)
C$SPMD RONLY, R2A[P:array] !no barrier necessary: R2A does
becomes
if (spmd_myid.eq.0) then
call readfile(...,array,...)
call normalize(...,array,...)
end if
call spmd_r1DPtoall(array,dim)
If the CPU time is large and the region not immediately followed by e.g. an R2A
communication directive, the second RONLY can be combined with BARRIER.
TIME
TTIME
To be inserted pairwise: call spmd_time() before and after a code block to be
timed. TTIME (Total-TIME) is for timing a larger code block that needs timing of
subblocks by using TIME. Both should be combined with RONLY and result in the
predefined normal reals spmd_rtime and spmd_rttime. The timing starts and stops
on the line with RONLY. Both TIME and TTIME can be used several times, but the
sequence will always imply start/stop/start/stop, although TIME and TTIME are
independent. A directive with RONLY, TIME! or RONLY, TTIME! does not start or end
a normal RONLY region; it's only used for timing purposes (and the ! does not start
comment). Examples:
C$SPMD RONLY, TIME! !does not start an RONLY region
...
code block to be timed
...
C$SPMD RONLY, TIME
print*,'Block X took ',spmd_rtime,' seconds'
C$SPMD RONLY
-------------------------------------------------------
C$SPMD RONLY
root does some work
C$SPMD RONLY, TIME !timer starts here
...
code block to be timed
...
C$SPMD RONLY, TIME !timer stops here
root does something
print*,'Block X took ',spmd_rtime,' seconds'
C$SPMD RONLY
-------------------------------------------------------
C$SPMD RONLY, TIME!, TTIME!
...
C$SPMD RONLY, TIME
print*,'Block X took ',spmd_rtime,' seconds'
C$SPMD RONLY
...
...
C$SPMD RONLY, TTIME
print*,'Large block Y took ',spmd_rttime,' seconds'
C$SPMD RONLY
DOPAR
C$SPMD DOPAR
do i=1,nloop
The parser will only change the code to
call spmd_split(nloop)
do i=spmd_mystart,spmd_myend
and the spmd_split routine does, apart from checking that nloop is a multiple
of spmd_nprocs, only
spmd_mypart = nloop/spmd_nprocs
spmd_mystart = spmd_myid*spmd_mypart +1
spmd_myend = spmd_mystart+spmd_mypart -1
The communication routines are a bit more complicated but still quite trivial.
Communications
All communications (not reductions) are done by R2A (root-to-all), A2R (all-to-root)
and A2A (all-to-all), and must be done explicitly indicating scalars, entire arrays
or parts of arrays. Below P means parts and E means entire. Remember that all CPUs
have the same scalars and the same entire arrays.
Assuming 4 CPUs and a DO loop over a 2D array (in the case of 2D and 3D
arrays always the outer loop over the last dimension in Fortran):
real arr(100,200)
C$SPMD DOPAR
do iy=1,200
do ix=1,100
arr(ix,iy)=...
end do
end do
C$SPMD A2R[P:arr]
In this case the 4 CPUs with ids 0 to 3, 0 being root, have a copy of the entire
array but will do each only 1/4th:
1 ix 100
|------------| |------------| |------------| |------------| 1
| | | | | | | |
| XX | | | | | | |
| | | | | | | |
|------------| |------------| |------------| |------------|
| | | | | | | |
| | | XX | | | | |
| | | | | | | |
|------------| |------------| |------------| |------------| iy
| | | | | | | |
| | | | | XX | | |
| | | | | | | |
|------------| |------------| |------------| |------------|
| | | | | | | |
| | | | | | | XX |
| | | | | | | |
|------------| |------------| |------------| |------------| 200
CPU 0 CPU 1 CPU 2 CPU 3
after which communications can be done: not necessarily directly after the last
enddo; this can be done later depending on the requirements, e.g. all CPUs could
continue using only their 1/4th of the array. In the example above (A2R) CPUs
1, 2 and 3 send their part to root (0). If all CPUs need a copy of the entire
array, A2R[P:arr] needs to be substituted by A2A[P:arr], which is for large
arrays more efficient than doing: A2R[P:arr], R2A[E:arr].
R2A[varlist;P:list;E:list]
Means Root-To-All, varlist is a list of scalars and "list" a list of arrays
separated by commas. Can contain only one scalar or only one P or only one E.
Varlist, P and E can be in any order but scalars should be grouped together
(the clause should have two ;s). Example:
C$SPMD R2A(E:array3;val1,val2,val3;P:array1,array2)
Notes: val1/2/3 can be of different types and the types nor dimensions of the
arrays are specified. If there are 4 CPUs, root will send the three quarters of
array1 and array2 to the other 3 CPUs (each will get only 1/4th). All CPUs will
get an entire copy of array3 as well as the scalars val1/2/3.
Note: for neighbourhood operations in which each CPU is processing e.g. one quarter
of an array, but each needs to have access to e.g. 2 more elements/lines/planes of
1/2/3D arrays that are processed by its neighbours, the P can be extended by an
X (for eXchange) and a C (cyclic). Cyclic means that CPU 0 gets a neighbourhood from
CPU 1 but also from CPU 3. This is visualized in a special Section below.
For preparing data before cyclic neighbourhood processing use
C$SPMD R2A[P(XC=2):arrayname]
in which instead of "2" it is possible to put an integer scalar.
Root will send the normal array parts plus the additional neighborhoods.
If the same array partitioning is used repeatedly, for example in iterative
spatial filtering in which each CPU will always keep its quarter of an array but
before starting a new iteration it needs again 2 elements/lines/planes from its
neighbours, you must use A2A[P(X=2):arr] or A2A[P(XC=2):arr]. A2A is explained below.
A2R[P:list]
Means All-To-Root, only P and list can only be an array list: all CPUs have
processed a part of an array and root assembles these after its own part. All other
all-to-root operations are reductions, see RoR and RoA below.
A2A[P:...]
Means All-To-All. There are only two uses involving only arrays: (1) the one above
for only eXchanging elem/lines/planes for repeated neighbourhood operations with
P(X=2):array or P(XC=2):array, in which case no normal parts will be sent, and (2)
each CPU has e.g. one quarter of an array but each needs the entire array
assembled from all parts (P without X or XC). The latter is more efficient than
A2R[P:array] followed by R2A[E:array], but this depends on the implementation.
Reductions can be done with only two clauses:
RoR[MAX:ival,rval;SUM:sum1,sum2,array]
Means Reduction-on-Root (only root gets a global maximum etc, the other CPUs will
continue with their local maxima). The SUM of array is done elementwise involving
the entire arrays on all CPUs, i.e. the result is an entire array in which an
element is the sum of all elements at the same position. In the same way you can use:
RoA[MAX:ival,rval;SUM:sum1,sum2,array]
Means Reduction-on-All (all CPUs will continue with the global maxima and sums etc).
The only complication occurs when all CPUs are using the same working arrays (or scalars) and the result should be stored in another array (or scalar). This can be done using for example RoR[SUM:(IN=arr1,OUT=arr2)], in which arr1 and arr2 have the same type and dimensions (could be scalars of the same type). In the case of a SUM reduction the user should initialize arr2 to zero on root (using RONLY), unless the array needs other values. Similarly, RoA using IN and OUT requires an initialization to zero of the output array on all (not using RONLY). In the case of MAX, MIN etc reductions with In and OUT scalars or arrays, the OUT should be properly initialized. All this depends on the implementation!
Reduction types: SUM, MAX, MIN, AND, OR ... (to be defined, also data types). Perhaps the only ones really necessary are scalar MAX and MIN, as well as scalar and array SUM. Data types: at least normal integers and reals.
Reductions can be combined with other communication clauses.
It is rather straightforward to program more complex reductions. For example, if
we want to determine the maximum of a real 1D array as well as its position, we can
apply, assuming 4 CPUs:
real tmp(2,4),maxpos(2)
C$SPMD DOPAR
do i=1,200
if (...) then
tmp(1,spmd_myid+1)=rmax !use the predefined myid of SPMDlib
tmp(2,spmd_myid+1)=real(i)
end if
end do
C$SPMD A2R[P:tmp]
C$SPMD RONLY
do j=1,4
test tmp array to determine rmax and pos
maxpos(1)=rmax
maxpos(2)=pos
C$SPMD RONLY
C$SPMD R2A[E:maxpos] !depends whether all CPUs need this
imax=nint(maxpos(2))
This works but is extremely expensive in terms of communication overheads.
Should be done only when absolutely necessary and a lot of serious computations
can be included in the DOPAR loop, but this applies to all communications. The
user should, once again, know about communication times: there are two options,
(1) assemble the entire array on root (expensive) after which root-only does the
reduction (cheap), or (2) do not assemble the entire array and each CPU reduces
locally (cheap), after which the communications as given in the example above
must be done (expensive). The solution depends on the array size and the best is
to experiment a bit.
We will assume 4 CPUs and a DO loop over a 2D array (the example given above).
In normal cases each CPU needs either an entire array or only its part (uses only
its part but has the entire array!). There are situations in which each CPU needs
its own part but also a small neighborhood of its neighbors, for example in CFD
with finite elements on a regular grid or signal/image processing. Suppose that
root reads an image file into a 2D array (C$SPMD RONLY) and sends three quarters
to CPUs 1, 2 and 3 with R2A[P:arr]. If it is necessary to run a filter with a
size of 5x5 over the image, each CPU needs also two lines before and two lines
after its own part (CPUs 0 and 3 need only one side if the processing is
non-cyclic, see below for the cyclic case). If there are more filter iterations,
only these lines need to be exchanged between all iterations. Suppose we have done
already one iteration and must do another one. If we label the neighborhood lines
with a, b, c etc, we have after the first iteration
1 ix 100
|------------| |------------| |------------| |------------| 1
|aaaaaaaaaaaa| | | | | | |
| XX | | | | | | |
|bbbbbbbbbbbb| | | | | | |
|------------| |------------| |------------| |------------|
| | |cccccccccccc| | | | |
| | | XX | | | | |
| | |dddddddddddd| | | | |
|------------| |------------| |------------| |------------| iy
| | | | |eeeeeeeeeeee| | |
| | | | | XX | | |
| | | | |ffffffffffff| | |
|------------| |------------| |------------| |------------|
| | | | | | |gggggggggggg|
| | | | | | | XX |
| | | | | | |hhhhhhhhhhhh|
|------------| |------------| |------------| |------------| 200
CPU 0 CPU 1 CPU 2 CPU 3
In this example the aaaaa and bbbbb etc lines represent each two (N=2) array
lines. With a special clause A2A[P(X=2):arr] only these line pairs will be eXchanged,
after which we have
1 ix 100
|------------| |------------| |------------| |------------| 1
|aaaaaaaaaaaa| | | | | | |
| XX | | | | | | |
|bbbbbbbbbbbb| |bbbbbbbbbbbb| | | | |
|------------| |------------| |------------| |------------|
|cccccccccccc| |cccccccccccc| | | | |
| | | XX | | | | |
| | |dddddddddddd| |dddddddddddd| | |
|------------| |------------| |------------| |------------| iy
| | |eeeeeeeeeeee| |eeeeeeeeeeee| | |
| | | | | XX | | |
| | | | |ffffffffffff| |ffffffffffff|
|------------| |------------| |------------| |------------|
| | | | |gggggggggggg| |gggggggggggg|
| | | | | | | XX |
| | | | | | |hhhhhhhhhhhh|
|------------| |------------| |------------| |------------| 200
CPU 0 CPU 1 CPU 2 CPU 3
In the case of cyclic (C) processing use A2A[P(XC=2):arr] and the result will
be
1 ix 100
|------------| |------------| |------------| |------------| 1
|aaaaaaaaaaaa| | | | | |aaaaaaaaaaaa|
| XX | | | | | | |
|bbbbbbbbbbbb| |bbbbbbbbbbbb| | | | |
|------------| |------------| |------------| |------------|
|cccccccccccc| |cccccccccccc| | | | |
| | | XX | | | | |
| | |dddddddddddd| |dddddddddddd| | |
|------------| |------------| |------------| |------------| iy
| | |eeeeeeeeeeee| |eeeeeeeeeeee| | |
| | | | | XX | | |
| | | | |ffffffffffff| |ffffffffffff|
|------------| |------------| |------------| |------------|
| | | | |gggggggggggg| |gggggggggggg|
| | | | | | | XX |
|hhhhhhhhhhhh| | | | | |hhhhhhhhhhhh|
|------------| |------------| |------------| |------------| 200
CPU 0 CPU 1 CPU 2 CPU 3
If, like in the example above, root reads an image file into an array that is
going to be filtered, the array parts plus the neigboring lines need to
be sent. Instead of using R2A[P:arr] immediately followed by A2A[P(X=2):arr],
it is more efficient to use R2A[P(X=2):arr] or R2A[P(XC=2):arr].
It depends on the application and the neighborhood size, but also on the implementation (the library), whether it is more efficient to use simply A2A[P:arr] or A2A[P(X=n):arr], or in the distribution before the first iteration R2A[E:arr] or R2A[P(X=n):arr], where X could be XC.
Note that ONLY the data eXchange is done; the user is responsible for the filtering implementation and the cyclic processing, i.e. iy= -1 becomes 200 and iy= 201 becomes 1.
If entire arrays are not required on all CPUs but only on root, the communications are faster and it might be better not to use all CPUs as specified with mpirun for some processing. But this becomes more tricky because a parser will not execute the code and some features of a high-level library need a few extensions. The text below details a possible scenario that is open for discussion. The first problem is that the parser does not know the number of CPUs specified (mpirun -np 8). Hence, for taking proper decisions it is necessary to specify the number of CPUs:
INIT[TCPUS=n]
and the parser will put code lines to test during the execution that TCPUS
equals spmd_nprocs and to print a warning message if they are not equal.
Now other clauses can include a specifier of the number of CPUs to use:
NCPUS[n] - as a clause
(NCPUS=n) - as an argument of a clause
This is a switch. The number n (always smaller than TCPUS) will be used until a
new number is specified or the
number is reset to all CPUs with NCPUS[TCPUS] or (NCPUS=TCPUS). It affects DOPAR,
PARBLOCK as well as all communications and even reductions. Normally (by default)
all CPUs (spmd_nprocs) will be used, which means that arrays are partitioned using
all. In the case of a DOPAR, NCPUS[n] combination the array partitioning will be
different, as will be the communications.
Only in cases where communications, for whatever reason, cannot be done
directly after a DO loop over an array or parallel code blocks, i.e. must be
done after executing other DO loops, the number of CPUs used in the specific DO loop
must be specified. Example: A2R[P(NCPUS=n):arrayname]
DOPAR[NCPUS=n]
To be inserted directly before a parallel DO loop, which will be executed by n CPUs,
starting with root (0 to n-1); the other CPUs will be idle. If there are
more DO loops to be executed by the n CPUs, only a DOPAR is necessary (NCPUS is a
switch). The user must ensure that ALL CPUs are activated again by NCPUS[TCPUS]
and ALL should reach a BARRIER. Before this BARRIER, all communications and reductions
will only involve the n CPUs (use R2A[P/E] to distribute, or A2R[P] to assemble on
root, after the BARRIER use R2A[P/E] etc).
PARBLOCK[NCPUS=n]
Three PARBLOCK directives define two code blocks that can be parallelized. The blocks
can contain parallel DO loops. By default the number of CPUs is divided by two and the
DO loops will be executed by TCPUS/2. Suppose TCPUS=8. If the first or second PARBLOCK
has [NCPUS=2], the first block will be executed by 2 CPUs and the second by 8-2=6 or
vice versa. If both have [NCPUS=3], both blocks and their DO loops will be executed
by 3 CPUs and 2 CPUs will be idle. In principle it is also possible to change NCPUS for
the DO loops, but only to smaller numbers!
Communications become trickier, and we need to think about reasonably simple
solutions that can be implemented easily. One solution is to assemble arrays
always on root by reductions of type SUM and all CPUs must have initialized the
reduction scalars and/or arrays to zero. Examples:
C$SPMD INIT[TCPUS=8]
...
all CPUs initialize arr1 and arr2 to zero
...
C$SPMD PARBLOCK[NCPUS=2] !using 2
...
C$SPMD DOPAR
do i=1,100
arr1(i)=...
end do
...
C$SPMD PARBLOCK[NCPUS=4] !using 4, 2 will be idle
...
C$SPMD DOPAR
do j=1,600
arr2(j)=...
end do
C$SPMD A2R[P:arr2]
...
C$SPMD PARBLOCK, NCPUS[TCPUS], BARRIER, A2R[P(NCPUS=2):arr1],
C$SPMD R2A[P:arr1,arr2]
If there were no A2R in the second PARBLOCK, the last two lines could involve
RoR[SUM:arr1,arr2], R2A[P:arr1,arr2], where the RoR is much more expensive.
Here CPUs 0 and 1 have done 1/2 of arr1 and CPUs 2...7 have done 1/6th of arr2.
The only way to assemble them on root is by summing using Reduction-on-Root (this
is an array reduction, elementwise; that's why the initialization to zero on
all CPUs), after which root can send the entire arrays to all with R2A[E:...]
if necessary. But first NCPUS must be reset.
Note that there is no BARRIER because RoR will cause an implicit synchronization
of ALL CPUs! An Appendix of the SPMDlib manual describes the implementation.
Simple dependencies can be allowed within the context of the directive design
and implementation. For example, since only DO loops from 1 to N are allowed,
the following code is possible:
real arr(100)
...
C$SPMD DOPAR
do 50 i=1,100
arr(i)=...
50 continue
...
C$SPMD A2A[P(XC=1):arr], DOPAR
do i=1,100
if (i.ne.100) then
j=i+1
else
j=1
end if
arr(i)=func(arr(j)) !j=i+1
end do
More complex dependencies need to be resolved by reformulating the problem,
see for example SGI's Fortran User Manual on C$DOACROSS.
For all questions etc please contact Hans du Buf
This page has been visited
times since July, 2001.
Last update: August 2001, HdB.