logo SPMDlib

MPI for the uninterested

SPMDlib: library design
not only for image processing...

Other pages: Introduction SPMD initiative and SPMD directive design

IF (you have a dual-CPU box .AND. you do not intend to use a bigger system) THEN
better use a compiler with OpenMP support
read on, you may risk to save a lot of time

You can download this mini manual in ASCII and should read it carefully, also the examples provided at the end of "getting started"

* an SPMD programming tool (parallelization library) at no cost
* can be used on any system that is equipped with MPI
* can be optimized for specific interconnects
* extremely simple to use - it takes only minutes to parallelize code
* much easier to learn than MPI and even BSP (bulk-synchronous parallel)
* almost completely tested; no more MPI debugging
* for splitting loops, array communications and reductions
* only 12 basic routines, some of which you won't need
* for integer, real, real*8 and complex data types
* for 1D and 2D arrays (3D easy to make)
* has special communications for iterative 1D and 2D filtering
* only 2 files and almost no installation required

Brief: I have been looking for higher-level MPI tools that would make my life easier, but they are either too complicated or simply don't respond to my needs: to be able to parallelize code with a minimum effort. Therefore I developed a few higher-level routines that evolved into a set of 12 and with some copy-and-paste I created versions for integer, real, real*8 and complex data types, 1D and 2D arrays, plus parts or entire arrays. The routines are:
init, end, barrier, time, split, toroot, toall, sumonroot, toallX, filtX, syncarray and two reductions.
With these I can parallelize all my applications (image and data processing) in a few minutes, without any MPI debugging. This library may be useful to others. I developed the Fortran version, a student is preparing a C version.
The general library (spmd1.f) may not be the fastest but works for an arbitrary number of CPUs and on any system with MPI. The other (spmd2.f) will be optimized for small Beowulfs of two dual-CPU systems (later four, i.e. 8 CPUs). Lower-level optimization (MPI-over-SCSI etc) is in preparation. Have fun,
Hans du Buf

1. Introduction

Installing Linux and MPI on a Beowulf is rather trivial, but using MPI is not. Well, for those used to SMP systems with parallelization directives (SGI's C$DOACROSS or OpenMP on many platforms) it is a pain in the ass. All those MPI routines, a lot of programming overheads, plus a lot of debugging: instead of solving your problem you're fighting MPI...
There exist commercial tools for assisting a programmer in the parallelization and for semi-automatic parallelization - we're still dreaming of an automatic parallelization that will optimize the performance. Hence, the Linux Beowulf community will have to live with MPI. True? NO!

In data-parallel programming, also known as SPMD (same program multiple data), most programmers use a very limited set of MPI routines and often the same communication structures. Most are used to copy-and-paste MPI code blocks and to make changes according to the context, which still requires testing and debugging. This can be avoided if we have a tested set of high-level communication routines which do most standard tasks.
Nevertheless, programmers should know a bit about things like the cost of communications, grain size and data dependencies! (1) A quad-Pentium can deliver more than a gigaflop: each ms (millisecond) spent with communications means the loss of more than one million flops. (2) A sequence of many small loops can be parallelized but will not execute faster: it will be slower - see the previous point. The loops need to be really big (array sizes) and heavy in calculus, or must be organized in one huge outer loop. (3) If a loop over an array depends on the exact sequence, e.g. from 1 to NLOOP, it cannot be parallelized unless the code is changed to remove the dependency. You might read some texts about parallel programming!

2. The spmd_ library: MPI for the uninterested

We are dealing with SPMD: each processor is executing the same program with the same arrays and the same loops over the arrays. However, each CPU has an identifier (a unique number) that can be tested such that each CPU does a part of the work: if (myid.eq.0) then ... else ... end if.
LOOPS AND ARRAYS: this is the basic approach and nothing new, OpenMP has similar features. The only thing that spmd_ lib provides is a very small set of routines that (1) split a loop over arrays - or define a parallel region, and (2) copy (parts of) arrays before or after parallel loops over the arrays.
The spmd_ API hides all MPI stuff from the user. The user does not even need to know how MPI works. Hence: MPI for the uninterested.

3. How it works

Suppose we have one loop over an entire array:

      integer dim
      parameter (dim=100000)
      real rarr(dim)
      do 100 i=1,dim
100   continue
What we need is something like:

      integer dim
      parameter (dim=100000)
      real rarr(dim)
      include '/home/....../spmd.h'
      call spmd_init()
      call spmd_split(dim)
      do 100 i=spmd_mystart,spmd_myend     !parallel loop
100   continue
      call spmd_r1DPtoroot(rarr,dim)       !slaves send parts to root
      if (spmd_myid.eq.0) then
         root does something 
         end if
      call spmd_end()
This example may look stupid, but you may already guess what happens: spmd.h has definitions like spmd_mystart and puts these in a common block that is also used in spmd_split and spmd_toroot. In spmd_init MPI is initialized. Some parameters may not be directly visible (spmd_myid, spmd_nprocs) but can be used to perform specific tasks (root does something). In spmd_split dim is divided by spmd_nprocs to determine spmd_mystart, _mypart and _myend on all CPUs, and these will be available and the same until a new _split is done. If there are more loops with the same size, only one _split is necessary and each CPU "remembers" its own _mystart and _myend. Hence, in spmd_r1DPtoroot all processors with spmd_myid.ne.0 will send their parts to root, which has always _myid.eq.0.
The number of required spmd routines is remarkably small and you'll need only two files: spmd.h and one of the spmd.f source files. At the moment there are only two spmd.f files: one that works on all Beowulfs with arbitrary numbers of processors, and another that has been optimized for a two dual-CPU one (with 4 CPUs) because communications inside the two boxes can be done in parallel (future: 8 CPUs). NOTE: THERE IS ABSOLUTELY NO TESTING IN THE ROUTINES - no ierrs from MPI. Either MPI works or there is something wrong like a cable or connector. The only test done in spmd_split is to see whether dim (example above) is a multiple of spmd_nprocs (set nprocs with mpirun -np 4 or whatever).

4. Subroutines

Described here are the fortran routines, C is in preparation!

spmd_init(ierr) : initializes spmd and mpi
spmd_end(ierr) : quits mpi
spmd_time(rtime) : for timing (rtime is a normal real) to be called twice
spmd_split(nloop) : for splitting a loop over arrays (nloop is integer)

spmd_barrier(ierr) : to synchronize all CPUs. Note: normally all CPUs are synchronized in the communication routines. Only in cases where one or more CPUs risk to enter a region without proper data having been prepared, you must call _barrier.

All routines are prefixed with spmd_ so it is possible to mix spmd routines with normal mpi ones. The other routines described below use the following naming convention included after the prefix:
i/r/d/c mean integer, real, real*8 and complex*8 (see Note below)
1D/2D mean arrays with one or two dimensions. I guess 3D must also be done
P/E mean only Parts of an array or an Entire array

Naming convention: spmd_[i/r/d/c][1D/2D][P/E][function](variables)
spmd_r2DPtoall(rarray,dimx,dimy) root sends equal parts of a real 2D array to all slaves, except his own part
spmd_i1DEtoroot(iarray,dim) root receives from all slaves their parts of an integer 1D array and assembles these after his own part.
Note: in the case of 2D arrays Fortran and the cache prefer an outer loop to do dimy and an inner to do dimx. Only the outer must be split:

      integer    dimx,     dimy
      parameter (dimx=1000,dimy=100)
      real      rarr(dimx,dimy)
      call spmd_split(dimy)
      do 101 iy=spmd_mystart,spmd_myend     !parallel loop
      do 100 ix=1,dimx
100   continue
101   continue
      call spmd_r2DPtoroot(rarr,dimx,dimy)  !slaves send parts to root

Note on i/r/d/c: in most routines this is bogus because MPI only needs to know the number of bytes per element. Implemented are r and d for 4 and 8 bytes: i calls the r routine and c calls the d. The only exceptions are routines which compute something: sumonroot and the reductions (see below).

The other routines are:

toall : root (spmd_myid=0) sends to slaves, i/r/d/c and P/E
toroot : slaves send to root, i/r/d/c but P ONLY! Is there a need for E?
sumonroot : slaves send entire array to root which sums, i/r and E only
syncarray : all parts on all are sent to all, i/r/d/c but P only

All these routines have versions for 1D/2D. The sumonroot (normal integers or reals) can be used as a template to create other functions. This is the only function, in which slaves send an entire array to root, that I ever needed. Actually, Esumonroot is an entire-array reduction and we could develop routines Eredonroot and Eredonall that have a TYPE argument like SUM or MAX etc. Because arrays can be used in many iterations, also on root, I decided to make only one basic version that sums all into another array called output:
spmd_i2DEsumonroot(output,input,dimx,dimy) root sums all input arrays, his own and all from the slaves. Appendix 2 of the mini manual illustrates another use of the Esumonroot. We could actually extend Esumonroot
Another function that could be made is a "stackonroot" in which e.g. 2D arrays are put into a 3D one on root, but this could be done with a spmd_r3DPtoroot routine.

Sometimes each CPU needs to have access to neighbouring data that its "neighbours" have prepared. Instead of updating all data on all CPUs, which costs a lot of communications, there is one routine to exchange only these data (note: arrays are not considered to be cyclic, e.g. the first CPU (id=0) will only get data from the second, not from the last CPU; for cyclic processing a special routine can be easily created). One application is image processing and filtering in the spatial domain with a certain mask size: suppose 4 CPUs which have each 1/4th of an image. If we use a filter with mask size 5x5, each CPU needs 2 more lines beyond its _mystart and _myend (2x2=4 lines), apart from the first and last CPU (1x2=2 lines). This is necessary for multiple filter iterations but also for the first iteration. These are:

filtX : eXchange elements or lines, example spmd_r2DPfiltX(rarr,dimx,dimy,2)
toallX : root sends parts plus additional elements or lines (combines toall and filtX for a first filter iteration): spmd_r2DPtoallX(rarr,dimx,dimy,2).
These are available in i/r/d/c and 1D/2D, but P only.


These routines do only the communications, i.e. a maximum or sum needs to be done with a splitted loop after which globals are determined from the locals with one (or more) reduction call.

RR : reduction on Root (only root gets global maximum (example))
RA : reduction on All (global maximum goes to all)
m : max and min
s : sum and sum of squared
These come in three flavors: i/r/d for integer, real and real*8
Examples: spmd_iRAm(max,min), spmd_dRRs(sum1,sum2)
Note: sending one or two parameters takes the same time. Hence, if you want only a maximum (minimum), put a dummy min (max) with an arbitrary value. Above I wrote "sum of squared" for loops to compute a variance. You can reduce two arbitrary sums or only one (using one dummy). The routines can be used as templates to develop other reductions. As for the case of Esumonroot, we should develop general RRg and RAg routines that take a TYPE argument like SUM or MAX etc...

Question: do we really need more reductions? I never use more than this...
Please keep in mind that it is very easy to do other reductions, see Appendix 3 of the mini manual.

5. Getting started

1. MPI working and tested (mpif77 and mpirun)
2. Copy spmd.h in your ~mydir/spmd/
3. Copy either spmd1.f or spmd2.f (in preparation...) in your working dir or in /spmd/
Note 1: spmd1.f (version being tested; you can take a look at it) works for arbitrary numbers of processors but is slower than spmd2.f. The spmd2.f will be optimized for a two dual-CPU system with 4 CPUs to take advantage of parallelism of communications within a box. Check that the MPI config file has 4 lines: two with the name of the first box and two with the name of the second box. This way root has id 0 and its "neighbour" 1, whereas the CPUs in the other box have 2 and 3.
Note 2: UALG users on the "ping-pong" beowulf do not need spmd.h, do not need to change spmd1.f, and can compile linking the file ~dubuf/mpi/spmd/spmd1.o
4. Facultative: mv spmd1.f spmd.f
5. Facultative: take a look at the spmd.h and spmd.f files
6. Edit the spmd.f file (only all paths in include statements; takes 10 minutes...)
7. Compile: mpif77 -c -O3 spmd.f (forget about all stupid warnings)
8. Facultative: create a lib file and put in libpath
9. Edit one of your sources, compile and run (mpirun -np 4 ./name)
10. Don't forget to compare results with those from the serial code.

To illustrate the application to small-grain image processing, you can take a look at the serial code spmd_segm1.f and the parallel version spmd_segm2.f.
A large-grain example is spmd_vis.f in which a loop over 96 FFTs etc has been parallelised (this is actually my benchmarking program, see the bench file.

You are strongly advised to make a few small programs and to play a bit with the different routines before parallelizing one of your complicated sources!

6. What's next

TCP/IP-over-SCSI is being debugged and then we need a much faster MPI-over-SCSI with routing tables such that communications inside two duals will be done by memcopy.
In parallel we can work on a very small set of directives and a parser that translates these into spmd_ calls. This would be the highest level that we can achieve on the way to an intelligent preprocessor that also optimizes the number of CPUs to be used for different loops.

7. Feedback etc

For all questions etc please contact Hans du Buf

Back to the Vision Laboratory

This page has been visited **** times since July, 2001.

Last update: July 2001, HdB.