The SPMD initiative


SPMDdir: directive design
THE ZEN OF DATA-PARALLEL MPI PROGRAMMING

Related pages: SPMD introduction and SPMD library.


If you got to these pages by google, you may be looking at an old cached version. Please click your reload button.


Towards data-parallel directives

Because of its extreme simplicity, SGI's C$DOACROSS has a very limited functionality that can be easily implemented by means of a small library on top of MPI and even a parser or preprocessor that translates directives in a source program into library calls. This library can also directly be called by the user and facilitates MPI programming, eliminating any MPI debugging. OpenMP has more features but, as for DOACROSS, has not been designed for high-level communications. There is a need to define an SPMD directive standard that allows for a transparent implementation with communication optimizations for different platforms, using a high-level library and parser on top of MPI or a lower-level system-specific library in combination with a compiler. The latter could resemble OpenMP, but explicit communications must be included to obtain the best performance on systems with distributed memory.

DOACROSS's functionality allows loop threading with hidden communications, the definition of locals/globals, plus reductions. The only luxury beyond a static loop scheduling is a dynamic one in which the user can define a chunk size. In SPMD applications every CPU activated in a cluster (e.g. mpirun -np 4) has already a copy of all variables and arrays, the user only needs to take care of data initializations and communications. The latter must be implemented such that all possibilities are covered. The only requirements are:

  1. parallel DO loops and independent code blocks, scalar and array communications, reductions
  2. hide all communication programming from the user
  3. portable yet optimizable for specific architectures
  4. an extreme simplicity - can be learned in less than one hour
  5. it must be possible to produce a manual with a very few pages
  6. versatile enough to tackle most problems
  7. the user must not be bored with bookkeeping (data types and array dimensions); parts of arrays can be easily determined on the basis of the number of CPUs used

The last point is already far from trivial because using all CPUs in all loops can make a program less efficient (granularity, communication costs). The user should be aware of communication costs and in the future we can develop an intelligent parser that analyzes loops and necessary communications in order to optimize the number of CPUs to be used.

This document is a first version that targets a basic functionality which, with the existing SPMDlib and a few trivial extensions, allows to test a first parser implementation and to gain experience in practice. To simplify the parser implementation it would be preferable to include "implicit none" in Fortran code. Feedback from users is required to finetune the functionality and/or to include other features that most users would appreciate or think absolutely necessary. However, it should be stressed that data-parallel SPMD applications with a large granularity will be targetted in the future. The obvious reason for this is that, with increasing CPU performances but lagging interconnects, a best efficiency can be obtained.

Below we describe the directives that can be used with the SPMDlib. All directives are typed in capitals and where necessary library routines are referred to by the spmd_ prefix. It is written for novice users, hence it could be the very small manual referred to above, although it contains explanations which are not strictly necessary.


The C$SPMD directive manual (V 1.0)

MPI is an implementation of the SPMD paradigm in which each CPU in a cluster starts executing the same program, but each CPU has a unique identifier (here called spmd_myid) that can be tested:


      if (spmd_myid.eq.3) then
      ...
      cpu 3 does something
      ...
      else if ...

such that each can do a part of a job (a DO loop) or even completely different things. By starting the program (mpirun) each CPU starts with the same variables and arrays. If different CPUs are doing different things, sooner or later data must be exchanged. Data communications are rather expensive: every millisecond spent with communications on a 4-CPU PIII at 450 MHz implies the loss of 1.2 MFLOPs. Hence, the user must structure his program such that the granularity (grain size) is large: many and heavy computations and doing communications only when absolutely necessary, also maximizing the parallelism of the application (Amdahl's law).

The C$SPMD directive has been designed such that the user can parallelize code with a minimum effort (user productivity). The main focus is on (a) DO loops, (b) independent serial code blocks that can also contain parallel DO loops, (c) scalar and array communications and (d) reductions, also both scalar and array. All CPUs contribute equally to the work to be done (well, there's a way to specify the number of CPUs to be used), but one CPU called "root" will play a special role in cases that data need to be distributed or assembled. Communications include root-to-all, all-to-root and all-to-all, plus two reductions: reduction-on-root and reduction-on-all. Because in most cases all CPUs will take an equal share of a loop (over arrays, i.e. will do equal parts of arrays) and the parser knows about array types as well as dimensions, it is not necessary to include these in the directive clauses. The only "luxury" is the possibility to use different numbers of CPUs for different loops and arrays (but this is already an advanced feature that should not be used in normal applications). A future version on the basis of an intelligent parser will assist the user in the optimization of the number of CPUs to be used (workload and communication overheads). Another future "luxury" might be a dynamic scheduling of DO loops by specifying a "chunksize" (this is not trivial when not using a compiler and CPU interrupts, however it could be done at the SPMDlib level by appointing one CPU as "master" to distribute chunks; this CPU would not contribute to the work and this solution might only be efficient when a sufficient number of CPUs is available).

Limitations:

At the moment (for a first parser implementation) we apply the following restrictions:
(1) Arrays in subroutines that are going to be processed in parallel should be passed with all complete dimensions or be declared with explicit dimensions.
(2) Such subroutines must be called by all CPUs.
(3) Parallel DO loops must be of the type: do [label] (integer scalar)=1,nloop where nloop is a multiple of the number of CPUs to be used and the loop start and stride are always one. In addition, nloop must be specified explicitly by an integer number, or it can be an array dimension specified in a real etc or dimension declaration or in a parameter statement.
(4) Parallel DO loops over 2D/3D arrays should always be over the last dimension, i.e. be the outer loop of the nested ones.
(5) At the moment 1D, 2D and 3D array communications, and arrays must start with one - example arr(1,1). Scalar and array types: integer, real, double and complex (no reductions for double and complex; this can change!).
(6) Only two independent serial code blocks (that can each contain one or more parallel DO loops) can be parallelized, but this can be done several times.
(7) Due to the use of subroutines and scalars that are prefixed with spmd_, the user is strongly adviced not to use the same prefix.
(8) All directives and clauses should be given in uppercase and no blanks are allowed (only scalars and arrays can be specified in lowercase).
(9) Although it is possible to mix SPMDlib with normal MPI calls, the user should not do this unless he understands the library.

Directives

Directives start in the first column and have the form


C$SPMD clause[, clause] ...

in which the clauses are executed from left to right (depending on the context). Clauses can be continued to occupy multiple lines, but each line starts with C$SPMD and clauses are separated with comma blank. No blanks are allowed in clauses. Comment can be put after an ! in which case no more clauses after the ! will be interpreted. C$SPMD lines should not be "mixed" with C$DOACROSS or C$OMP directives, although these are allowed but will be ignored. Directives can be "out-commented" by putting an extra C: CC$SPMD lines will be ignored.
The clauses and their meaning are (advanced features are marked with *; these will be explained in a later section):

   INIT                     initialize SPMDlib and MPI
   END                      end SPMDlib and MPI
   BARRIER                  explicit CPU synchronization
   RONLY                    root only
   TIME and TTIME           time and total time
 * PARBLOCK                 parallel code blocks
   DOPAR                    parallel DO loop
   R2A[variable list]       root-to-all
   A2R[variable list]       all-to-root
   A2A[variable list]       all-to-all
   RoR[variable list]       reduction-on-root
   RoA[variable list]       reduction-on-all

The following abbreviations and combinations are used in some clauses:

 * TCPUS=n                  the total number of CPUs to be used
 * NCPUS=n                  the number of CPUs to be used in DO loops etc
   E:arrayname              entire array
   P:arrayname              parts of array
   P(X=n):arrayname         exchange n elements/lines/planes of 1/2/3D arrays
   P(XC=n):arrayname        cyclic exchange
   SUM:variable             type of scalar or array reduction
                            also: MAX, MIN, etc
   IN=variable              input scalar or array in reduction
   OUT=variable             output scalar or array in reduction

General comments:
1. By default the number of CPUs to be used in splitting DO loops and the implicit array partitioning equals the number specified on the command line (for example: mpirun -np 8).
2. In the case of PARBLOCK, two parallel code blocks will be executed by half the number of CPUs and these will execute parallel DO loops in the blocks when marked by DOPAR.
3. Array names in R2A, A2R and A2A must be preceeded by E: (Entire) or P: (Parts), like P:arrayname. This is not necessary in reductions RoR and RoA.
4. The variable list is a list of scalars and arrays separated by commas and semicolons like: scalar1,scalar2;P:arr1,arr2;E:arr3,arr4.
5. In the case of R2A and A2A, P can be combined with (X=...), X meaning eXchange, or (XC=...), XC meaning eXchange Cyclic. Example: R2A[P(XC=3):arr1], where instead of "3" we can use an integer scalar. See the Neighborhood Processing section below.
6. Normally reductions are done using the same variable(s) as used in the code, for example intmax and intmin with the clause RoR[MAX:intmax;MIN:intmin]. In cases where each CPU needs to keep its local max and/or min, different variables can be used for local and global values. Example: RoA[MAX:IN=locmax,OUT=globmax].
7. INIT and END should be used only in the main program. All other clauses can also be used in subroutines provided that all CPUs call these routines.

Please keep in mind that a parser/preprocessor does not execute the code, i.e. it does not have access to runtime values. The user is responsible for the parallelization, although in some cases the parser may detect an inconsistency and print a warning. The code should be as straightforward as possible and advanced features like NCPUS should be used only when absolutely necessary.

Below the basic clauses are explained in more detail, together with the SPMDlib routines invoked (this is not necessary for the final manual!).

Basic clauses

INIT
Calls spmd_init(ierr) that provides predefined spmd_myid, spmd_nprocs, spmd_mystart and spmd_myend (all integers), as well as prefixed scalars for timing purposes (normal reals). Determines spmd_nprocs (4 with mpirun -np 4) and spmd_myid (0 to nprocs-1). Root gets spmd_myid 0 (this is not necessary but based on common sense). Must be included right at the start or where the CPUs need to start executing different things.

END
Calls spmd_end(ierr). Must be included before stop/end or where all CPUs cease executing different things.

BARRIER
Calls spmd_barrier(ierr) and must be used where an explicit CPU synchronization is required, for example when root only (see RONLY below) needs a significant CPU time. Note that all communication and reduction clauses cause an implicit CPU synchronization because of the synchronous mpi_send and mpi_recv used in the library. If BARRIER is used together with a closing RONLY, the program will not be deadlocked because it will be executed after the RONLY code (the parser will detect a BARRIER inside two RONLYs and will print a warning).

RONLY
To define regions in which Root-ONLY is active, for example to prepare or output data (read/write a file, print); the other CPUs will skip this region and continue the execution of the subsequent code until they encounter a barrier or a communication instruction. Must be inserted before and after the code lines (or only once for timing, see TIME! below), and invoke only an if-endif:


C$SPMD RONLY
      call readfile(...,array,...)
      call normalize(...,array,...)
C$SPMD RONLY, R2A[P:array]                  !no barrier necessary: R2A does

becomes

      if (spmd_myid.eq.0) then 
      call readfile(...,array,...)
      call normalize(...,array,...)
      end if
      call spmd_r1DPtoall(array,dim)

If the CPU time is large and the region not immediately followed by e.g. an R2A communication directive, the second RONLY can be combined with BARRIER.
Note: If all CPUs need a table (for example sine and cosine tables for FFTs) it depends on the size of the table, the number of CPUs as well as the system what to do - root computes and sends to all, or all CPUs compute in parallel without any communications.

TIME
TTIME
To be inserted pairwise: call spmd_time() before and after a code block to be timed. TTIME (Total-TIME) is for timing a larger code block that needs timing of subblocks by using TIME. Both should be combined with RONLY and result in the predefined normal reals spmd_rtime and spmd_rttime. The timing starts and stops on the line with RONLY. Both TIME and TTIME can be used several times, but the sequence will always imply start/stop/start/stop, although TIME and TTIME are independent. A directive with RONLY, TIME! or RONLY, TTIME! does not start or end a normal RONLY region; it's only used for timing purposes (and the ! does not start comment). Examples:


C$SPMD RONLY, TIME!            !does not start an RONLY region
      ...
      code block to be timed
      ...
C$SPMD RONLY, TIME
      print*,'Block X took ',spmd_rtime,' seconds'
C$SPMD RONLY

-------------------------------------------------------

C$SPMD RONLY
      root does some work
C$SPMD RONLY, TIME                        !timer starts here
      ...
      code block to be timed
      ...
C$SPMD RONLY, TIME                        !timer stops here
      root does something
      print*,'Block X took ',spmd_rtime,' seconds'
C$SPMD RONLY

-------------------------------------------------------

C$SPMD RONLY, TIME!, TTIME!
      ...
C$SPMD RONLY, TIME
      print*,'Block X took ',spmd_rtime,' seconds'
C$SPMD RONLY
      ...
      ...
C$SPMD RONLY, TTIME
      print*,'Large block Y took ',spmd_rttime,' seconds'
C$SPMD RONLY

DOPAR
To be inserted directly before a parallel DO loop, not counting empty or comment lines (C$OMP). Must always be the outer loop over 2D/3D arrays because it invokes spmd_split(dim) in which dim is the outer loop dimension and the loop goes from 1 to dim (dim should be a normal integer and a multiple of the number of CPUs). Note that the root-CPU (spmd_myid = 0) equally contributes and will process also array parts, i.e. the first parts. Since only a static and equal scheduling will be applied, the user can put a BARRIER after the DO loop if the loads are not balanced (this is not necessary if the loop is immediately followed by a communication or reduction).
This is the main parallelization clause, but its implementation is very simple:

C$SPMD DOPAR
      do i=1,nloop

The parser will only change the code to

      call spmd_split(nloop)
      do i=spmd_mystart,spmd_myend

and the spmd_split routine does, apart from checking that nloop is a multiple
of spmd_nprocs, only

      spmd_mypart  = nloop/spmd_nprocs
      spmd_mystart = spmd_myid*spmd_mypart +1
      spmd_myend   = spmd_mystart+spmd_mypart -1

The communication routines are a bit more complicated but still quite trivial.

Communications

All communications (not reductions) are done by R2A (root-to-all), A2R (all-to-root) and A2A (all-to-all), and must be done explicitly indicating scalars, entire arrays or parts of arrays. Below P means parts and E means entire. Remember that all CPUs have the same scalars and the same entire arrays.
Assuming 4 CPUs and a DO loop over a 2D array (in the case of 2D and 3D arrays always the outer loop over the last dimension in Fortran):


      real arr(100,200)

C$SPMD DOPAR

      do iy=1,200
      do ix=1,100
      arr(ix,iy)=...
      end do
      end do

C$SPMD A2R[P:arr]

In this case the 4 CPUs with ids 0 to 3, 0 being root, have a copy of the entire array but will do each only 1/4th:

                                                 1    ix    100
 |------------|  |------------|  |------------|  |------------| 1
 |            |  |            |  |            |  |            |
 |     XX     |  |            |  |            |  |            |
 |            |  |            |  |            |  |            |
 |------------|  |------------|  |------------|  |------------|
 |            |  |            |  |            |  |            |
 |            |  |     XX     |  |            |  |            |
 |            |  |            |  |            |  |            |
 |------------|  |------------|  |------------|  |------------| iy
 |            |  |            |  |            |  |            |
 |            |  |            |  |     XX     |  |            |
 |            |  |            |  |            |  |            |
 |------------|  |------------|  |------------|  |------------|
 |            |  |            |  |            |  |            |
 |            |  |            |  |            |  |     XX     |
 |            |  |            |  |            |  |            |
 |------------|  |------------|  |------------|  |------------| 200

      CPU 0           CPU 1           CPU 2           CPU 3

after which communications can be done: not necessarily directly after the last enddo; this can be done later depending on the requirements, e.g. all CPUs could continue using only their 1/4th of the array. In the example above (A2R) CPUs 1, 2 and 3 send their part to root (0). If all CPUs need a copy of the entire array, A2R[P:arr] needs to be substituted by A2A[P:arr], which is for large arrays more efficient than doing: A2R[P:arr], R2A[E:arr].

R2A[varlist;P:list;E:list]
Means Root-To-All, varlist is a list of scalars and "list" a list of arrays separated by commas. Can contain only one scalar or only one P or only one E. Varlist, P and E can be in any order but scalars should be grouped together (the clause should have two ;s). Example:
C$SPMD R2A(E:array3;val1,val2,val3;P:array1,array2)
Notes: val1/2/3 can be of different types and the types nor dimensions of the arrays are specified. If there are 4 CPUs, root will send the three quarters of array1 and array2 to the other 3 CPUs (each will get only 1/4th). All CPUs will get an entire copy of array3 as well as the scalars val1/2/3.

Note: for neighbourhood operations in which each CPU is processing e.g. one quarter of an array, but each needs to have access to e.g. 2 more elements/lines/planes of 1/2/3D arrays that are processed by its neighbours, the P can be extended by an X (for eXchange) and a C (cyclic). Cyclic means that CPU 0 gets a neighbourhood from CPU 1 but also from CPU 3. This is visualized in a special Section below. For preparing data before cyclic neighbourhood processing use
C$SPMD R2A[P(XC=2):arrayname]
in which instead of "2" it is possible to put an integer scalar. Root will send the normal array parts plus the additional neighborhoods.
If the same array partitioning is used repeatedly, for example in iterative spatial filtering in which each CPU will always keep its quarter of an array but before starting a new iteration it needs again 2 elements/lines/planes from its neighbours, you must use A2A[P(X=2):arr] or A2A[P(XC=2):arr]. A2A is explained below.

A2R[P:list]
Means All-To-Root, only P and list can only be an array list: all CPUs have processed a part of an array and root assembles these after its own part. All other all-to-root operations are reductions, see RoR and RoA below.

A2A[P:...]
Means All-To-All. There are only two uses involving only arrays: (1) the one above for only eXchanging elem/lines/planes for repeated neighbourhood operations with P(X=2):array or P(XC=2):array, in which case no normal parts will be sent, and (2) each CPU has e.g. one quarter of an array but each needs the entire array assembled from all parts (P without X or XC). The latter is more efficient than A2R[P:array] followed by R2A[E:array], but this depends on the implementation.

Reductions can be done with only two clauses:

RoR[MAX:ival,rval;SUM:sum1,sum2,array]
Means Reduction-on-Root (only root gets a global maximum etc, the other CPUs will continue with their local maxima). The SUM of array is done elementwise involving the entire arrays on all CPUs, i.e. the result is an entire array in which an element is the sum of all elements at the same position. In the same way you can use:

RoA[MAX:ival,rval;SUM:sum1,sum2,array]
Means Reduction-on-All (all CPUs will continue with the global maxima and sums etc).

The only complication occurs when all CPUs are using the same working arrays (or scalars) and the result should be stored in another array (or scalar). This can be done using for example RoR[SUM:(IN=arr1,OUT=arr2)], in which arr1 and arr2 have the same type and dimensions (could be scalars of the same type). In the case of a SUM reduction the user should initialize arr2 to zero on root (using RONLY), unless the array needs other values. Similarly, RoA using IN and OUT requires an initialization to zero of the output array on all (not using RONLY). In the case of MAX, MIN etc reductions with In and OUT scalars or arrays, the OUT should be properly initialized. All this depends on the implementation!

Reduction types: SUM, MAX, MIN, AND, OR ... (to be defined, also data types). Perhaps the only ones really necessary are scalar MAX and MIN, as well as scalar and array SUM. Data types: at least normal integers and reals.

Reductions can be combined with other communication clauses.

It is rather straightforward to program more complex reductions. For example, if we want to determine the maximum of a real 1D array as well as its position, we can apply, assuming 4 CPUs:


      real tmp(2,4),maxpos(2)
      
C$SPMD DOPAR
      do i=1,200
        if (...) then
        tmp(1,spmd_myid+1)=rmax        !use the predefined myid of SPMDlib
        tmp(2,spmd_myid+1)=real(i)
        end if
      end do
C$SPMD A2R[P:tmp]

C$SPMD RONLY
      do j=1,4
      test tmp array to determine rmax and pos
      maxpos(1)=rmax
      maxpos(2)=pos
C$SPMD RONLY

C$SPMD R2A[E:maxpos]               !depends whether all CPUs need this
      imax=nint(maxpos(2))

This works but is extremely expensive in terms of communication overheads. Should be done only when absolutely necessary and a lot of serious computations can be included in the DOPAR loop, but this applies to all communications. The user should, once again, know about communication times: there are two options, (1) assemble the entire array on root (expensive) after which root-only does the reduction (cheap), or (2) do not assemble the entire array and each CPU reduces locally (cheap), after which the communications as given in the example above must be done (expensive). The solution depends on the array size and the best is to experiment a bit.

Neighborhood operations with P, X and XC

We will assume 4 CPUs and a DO loop over a 2D array (the example given above). In normal cases each CPU needs either an entire array or only its part (uses only its part but has the entire array!). There are situations in which each CPU needs its own part but also a small neighborhood of its neighbors, for example in CFD with finite elements on a regular grid or signal/image processing. Suppose that root reads an image file into a 2D array (C$SPMD RONLY) and sends three quarters to CPUs 1, 2 and 3 with R2A[P:arr]. If it is necessary to run a filter with a size of 5x5 over the image, each CPU needs also two lines before and two lines after its own part (CPUs 0 and 3 need only one side if the processing is non-cyclic, see below for the cyclic case). If there are more filter iterations, only these lines need to be exchanged between all iterations. Suppose we have done already one iteration and must do another one. If we label the neighborhood lines with a, b, c etc, we have after the first iteration


                                                 1    ix    100
 |------------|  |------------|  |------------|  |------------| 1
 |aaaaaaaaaaaa|  |            |  |            |  |            |
 |     XX     |  |            |  |            |  |            |
 |bbbbbbbbbbbb|  |            |  |            |  |            |
 |------------|  |------------|  |------------|  |------------|
 |            |  |cccccccccccc|  |            |  |            |
 |            |  |     XX     |  |            |  |            |
 |            |  |dddddddddddd|  |            |  |            |
 |------------|  |------------|  |------------|  |------------| iy
 |            |  |            |  |eeeeeeeeeeee|  |            |
 |            |  |            |  |     XX     |  |            |
 |            |  |            |  |ffffffffffff|  |            |
 |------------|  |------------|  |------------|  |------------|
 |            |  |            |  |            |  |gggggggggggg|
 |            |  |            |  |            |  |     XX     |
 |            |  |            |  |            |  |hhhhhhhhhhhh|
 |------------|  |------------|  |------------|  |------------| 200

      CPU 0           CPU 1           CPU 2           CPU 3

In this example the aaaaa and bbbbb etc lines represent each two (N=2) array lines. With a special clause A2A[P(X=2):arr] only these line pairs will be eXchanged, after which we have

                                                 1    ix    100
 |------------|  |------------|  |------------|  |------------| 1
 |aaaaaaaaaaaa|  |            |  |            |  |            |
 |     XX     |  |            |  |            |  |            |
 |bbbbbbbbbbbb|  |bbbbbbbbbbbb|  |            |  |            |
 |------------|  |------------|  |------------|  |------------|
 |cccccccccccc|  |cccccccccccc|  |            |  |            |
 |            |  |     XX     |  |            |  |            |
 |            |  |dddddddddddd|  |dddddddddddd|  |            |
 |------------|  |------------|  |------------|  |------------| iy
 |            |  |eeeeeeeeeeee|  |eeeeeeeeeeee|  |            |
 |            |  |            |  |     XX     |  |            |
 |            |  |            |  |ffffffffffff|  |ffffffffffff|
 |------------|  |------------|  |------------|  |------------|
 |            |  |            |  |gggggggggggg|  |gggggggggggg|
 |            |  |            |  |            |  |     XX     |
 |            |  |            |  |            |  |hhhhhhhhhhhh|
 |------------|  |------------|  |------------|  |------------| 200

      CPU 0           CPU 1           CPU 2           CPU 3

In the case of cyclic (C) processing use A2A[P(XC=2):arr] and the result will be

                                                 1    ix    100
 |------------|  |------------|  |------------|  |------------| 1
 |aaaaaaaaaaaa|  |            |  |            |  |aaaaaaaaaaaa|
 |     XX     |  |            |  |            |  |            |
 |bbbbbbbbbbbb|  |bbbbbbbbbbbb|  |            |  |            |
 |------------|  |------------|  |------------|  |------------|
 |cccccccccccc|  |cccccccccccc|  |            |  |            |
 |            |  |     XX     |  |            |  |            |
 |            |  |dddddddddddd|  |dddddddddddd|  |            |
 |------------|  |------------|  |------------|  |------------| iy
 |            |  |eeeeeeeeeeee|  |eeeeeeeeeeee|  |            |
 |            |  |            |  |     XX     |  |            |
 |            |  |            |  |ffffffffffff|  |ffffffffffff|
 |------------|  |------------|  |------------|  |------------|
 |            |  |            |  |gggggggggggg|  |gggggggggggg|
 |            |  |            |  |            |  |     XX     |
 |hhhhhhhhhhhh|  |            |  |            |  |hhhhhhhhhhhh|
 |------------|  |------------|  |------------|  |------------| 200

      CPU 0           CPU 1           CPU 2           CPU 3

If, like in the example above, root reads an image file into an array that is going to be filtered, the array parts plus the neigboring lines need to be sent. Instead of using R2A[P:arr] immediately followed by A2A[P(X=2):arr], it is more efficient to use R2A[P(X=2):arr] or R2A[P(XC=2):arr].

It depends on the application and the neighborhood size, but also on the implementation (the library), whether it is more efficient to use simply A2A[P:arr] or A2A[P(X=n):arr], or in the distribution before the first iteration R2A[E:arr] or R2A[P(X=n):arr], where X could be XC.

Note that ONLY the data eXchange is done; the user is responsible for the filtering implementation and the cyclic processing, i.e. iy= -1 becomes 200 and iy= 201 becomes 1.

Advanced clauses (under construction!!!)

If entire arrays are not required on all CPUs but only on root, the communications are faster and it might be better not to use all CPUs as specified with mpirun for some processing. But this becomes more tricky because a parser will not execute the code and some features of a high-level library need a few extensions. The text below details a possible scenario that is open for discussion. The first problem is that the parser does not know the number of CPUs specified (mpirun -np 8). Hence, for taking proper decisions it is necessary to specify the number of CPUs:

INIT[TCPUS=n]
and the parser will put code lines to test during the execution that TCPUS equals spmd_nprocs and to print a warning message if they are not equal.
Now other clauses can include a specifier of the number of CPUs to use:

NCPUS[n] - as a clause
(NCPUS=n) - as an argument of a clause
This is a switch. The number n (always smaller than TCPUS) will be used until a new number is specified or the number is reset to all CPUs with NCPUS[TCPUS] or (NCPUS=TCPUS). It affects DOPAR, PARBLOCK as well as all communications and even reductions. Normally (by default) all CPUs (spmd_nprocs) will be used, which means that arrays are partitioned using all. In the case of a DOPAR, NCPUS[n] combination the array partitioning will be different, as will be the communications. Only in cases where communications, for whatever reason, cannot be done directly after a DO loop over an array or parallel code blocks, i.e. must be done after executing other DO loops, the number of CPUs used in the specific DO loop must be specified. Example: A2R[P(NCPUS=n):arrayname]

DOPAR[NCPUS=n]
To be inserted directly before a parallel DO loop, which will be executed by n CPUs, starting with root (0 to n-1); the other CPUs will be idle. If there are more DO loops to be executed by the n CPUs, only a DOPAR is necessary (NCPUS is a switch). The user must ensure that ALL CPUs are activated again by NCPUS[TCPUS] and ALL should reach a BARRIER. Before this BARRIER, all communications and reductions will only involve the n CPUs (use R2A[P/E] to distribute, or A2R[P] to assemble on root, after the BARRIER use R2A[P/E] etc).

PARBLOCK[NCPUS=n]
Three PARBLOCK directives define two code blocks that can be parallelized. The blocks can contain parallel DO loops. By default the number of CPUs is divided by two and the DO loops will be executed by TCPUS/2. Suppose TCPUS=8. If the first or second PARBLOCK has [NCPUS=2], the first block will be executed by 2 CPUs and the second by 8-2=6 or vice versa. If both have [NCPUS=3], both blocks and their DO loops will be executed by 3 CPUs and 2 CPUs will be idle. In principle it is also possible to change NCPUS for the DO loops, but only to smaller numbers!
Communications become trickier, and we need to think about reasonably simple solutions that can be implemented easily. One solution is to assemble arrays always on root by reductions of type SUM and all CPUs must have initialized the reduction scalars and/or arrays to zero. Examples:


C$SPMD INIT[TCPUS=8]
      ...
      all CPUs initialize arr1 and arr2 to zero
      ...
C$SPMD PARBLOCK[NCPUS=2]            !using 2
      ...
C$SPMD DOPAR
      do i=1,100
      arr1(i)=...
      end do
      ...
C$SPMD PARBLOCK[NCPUS=4]            !using 4, 2 will be idle
      ...
C$SPMD DOPAR
      do j=1,600
      arr2(j)=...
      end do
C$SPMD A2R[P:arr2]
      ...
C$SPMD PARBLOCK, NCPUS[TCPUS], BARRIER, A2R[P(NCPUS=2):arr1], 
C$SPMD R2A[P:arr1,arr2]

If there were no A2R in the second PARBLOCK, the last two lines could involve RoR[SUM:arr1,arr2], R2A[P:arr1,arr2], where the RoR is much more expensive. Here CPUs 0 and 1 have done 1/2 of arr1 and CPUs 2...7 have done 1/6th of arr2. The only way to assemble them on root is by summing using Reduction-on-Root (this is an array reduction, elementwise; that's why the initialization to zero on all CPUs), after which root can send the entire arrays to all with R2A[E:...] if necessary. But first NCPUS must be reset. Note that there is no BARRIER because RoR will cause an implicit synchronization of ALL CPUs! An Appendix of the SPMDlib manual describes the implementation.

Data dependencies

Simple dependencies can be allowed within the context of the directive design and implementation. For example, since only DO loops from 1 to N are allowed, the following code is possible:


      real arr(100)
      ...
C$SPMD DOPAR
      do 50 i=1,100
      arr(i)=...
50    continue
      ...
C$SPMD A2A[P(XC=1):arr], DOPAR
      do i=1,100
      if (i.ne.100) then
          j=i+1
          else
          j=1
          end if
      arr(i)=func(arr(j))       !j=i+1
      end do

More complex dependencies need to be resolved by reformulating the problem, see for example SGI's Fortran User Manual on C$DOACROSS.

Feedback etc

For all questions etc please contact Hans du Buf

Back to the Vision Laboratory


This page has been visited ****" times since July, 2001.

Last update: August 2001, HdB.