Other pages: Introduction SPMD initiative and SPMD directive design
IF (you have a dual-CPU box .AND. you do not intend to use a bigger system) THEN
better use a compiler with OpenMP support
ELSE
read on, you may risk to save a lot of time
END IF
You can download this mini manual in ASCII and should read it carefully, also the examples provided at the end of "getting started"
Features:
* an SPMD programming tool (parallelization library) at no cost
* can be used on any system that is equipped with MPI
* can be optimized for specific interconnects
* extremely simple to use - it takes only minutes to parallelize code
* much easier to learn than MPI and even BSP (bulk-synchronous parallel)
* almost completely tested; no more MPI debugging
* for splitting loops, array communications and reductions
* only 12 basic routines, some of which you won't need
* for integer, real, real*8 and complex data types
* for 1D and 2D arrays (3D easy to make)
* has special communications for iterative 1D and 2D filtering
* only 2 files and almost no installation required
Brief: I have been looking for higher-level MPI tools that would make my
life easier, but they are either too complicated or simply don't respond to my
needs: to be able to parallelize code with a minimum effort. Therefore I
developed a few higher-level routines that evolved into a set of 12 and with
some copy-and-paste I created versions for integer, real, real*8 and complex
data types, 1D and 2D arrays, plus parts or entire arrays. The routines
are:
init, end, barrier, time, split, toroot, toall, sumonroot, toallX, filtX,
syncarray and two reductions.
With these I can parallelize all my applications (image and data processing)
in a few minutes, without any MPI debugging. This library may be useful to
others. I developed the Fortran version, a student is preparing a C
version.
The general library (spmd1.f) may not be the fastest but works for an
arbitrary number of CPUs and on any system with MPI. The other (spmd2.f) will
be optimized for small Beowulfs of two dual-CPU systems (later four, i.e. 8
CPUs). Lower-level optimization (MPI-over-SCSI etc) is in preparation. Have
fun,
Hans du Buf
Installing Linux and MPI on a Beowulf is rather trivial, but using MPI is
not. Well, for those used to SMP systems with parallelization directives
(SGI's C$DOACROSS or OpenMP on many platforms) it is a pain in the ass. All
those MPI routines, a lot of programming overheads, plus a lot of debugging:
instead of solving your problem you're fighting MPI...
There exist commercial tools for assisting a programmer in the parallelization
and for semi-automatic parallelization - we're still dreaming of an automatic
parallelization that will optimize the performance. Hence, the Linux Beowulf
community will have to live with MPI. True? NO!
In data-parallel programming, also known as SPMD (same program multiple
data), most programmers use a very limited set of MPI routines and often the
same communication structures. Most are used to copy-and-paste MPI code blocks
and to make changes according to the context, which still requires testing and
debugging. This can be avoided if we have a tested set of high-level
communication routines which do most standard tasks.
HENCE OUR AIM IS TO PROVIDE AN API WHICH IS EASY TO USE AND WHICH REQUIRES A
MINIMUM PROGRAMMING OVERHEAD TO PARALLELIZE EXISTING OR NEW CODE: THE SPMD_
LIBRARY.
Nevertheless, programmers should know a bit about things like the cost of
communications, grain size and data dependencies! (1) A quad-Pentium can
deliver more than a gigaflop: each ms (millisecond) spent with communications
means the loss of more than one million flops. (2) A sequence of many small
loops can be parallelized but will not execute faster: it will be slower - see
the previous point. The loops need to be really big (array sizes) and heavy in
calculus, or must be organized in one huge outer loop. (3) If a loop over an
array depends on the exact sequence, e.g. from 1 to NLOOP, it cannot be
parallelized unless the code is changed to remove the dependency. You might
read some texts about parallel programming!
We are dealing with SPMD: each processor is executing the same program with
the same arrays and the same loops over the arrays. However, each CPU has an
identifier (a unique number) that can be tested such that each CPU does a part
of the work: if (myid.eq.0) then ... else ... end if.
LOOPS AND ARRAYS: this is the basic approach and nothing new, OpenMP has
similar features. The only thing that spmd_ lib provides is a very small set
of routines that (1) split a loop over arrays - or define a parallel region,
and (2) copy (parts of) arrays before or after parallel loops over the
arrays.
The spmd_ API hides all MPI stuff from the user. The user does not even need
to know how MPI works. Hence: MPI for the uninterested.
Suppose we have one loop over an entire array:
integer dim parameter (dim=100000) real rarr(dim) do 100 i=1,dim rarr(i)=... 100 continueWhat we need is something like:
integer dim parameter (dim=100000) real rarr(dim) include '/home/....../spmd.h' ... call spmd_init() ... call spmd_split(dim) do 100 i=spmd_mystart,spmd_myend !parallel loop rarr(i)=... 100 continue call spmd_r1DPtoroot(rarr,dim) !slaves send parts to root if (spmd_myid.eq.0) then root does something end if ... call spmd_end()This example may look stupid, but you may already guess what happens: spmd.h has definitions like spmd_mystart and puts these in a common block that is also used in spmd_split and spmd_toroot. In spmd_init MPI is initialized. Some parameters may not be directly visible (spmd_myid, spmd_nprocs) but can be used to perform specific tasks (root does something). In spmd_split dim is divided by spmd_nprocs to determine spmd_mystart, _mypart and _myend on all CPUs, and these will be available and the same until a new _split is done. If there are more loops with the same size, only one _split is necessary and each CPU "remembers" its own _mystart and _myend. Hence, in spmd_r1DPtoroot all processors with spmd_myid.ne.0 will send their parts to root, which has always _myid.eq.0.
Described here are the fortran routines, C is in preparation!
spmd_init(ierr) : initializes spmd and mpi
spmd_end(ierr) : quits mpi
spmd_time(rtime) : for timing (rtime is a normal real) to be called twice
spmd_split(nloop) : for splitting a loop over arrays (nloop is integer)
spmd_barrier(ierr) : to synchronize all CPUs. Note: normally all CPUs are synchronized in the communication routines. Only in cases where one or more CPUs risk to enter a region without proper data having been prepared, you must call _barrier.
All routines are prefixed with spmd_ so it is possible to mix spmd routines
with normal mpi ones. The other routines described below use the following
naming convention included after the prefix:
i/r/d/c mean integer, real, real*8 and complex*8 (see Note below)
1D/2D mean arrays with one or two dimensions. I guess 3D must also be done
P/E mean only Parts of an array or an Entire array
Naming convention: spmd_[i/r/d/c][1D/2D][P/E][function](variables)
Examples:
spmd_r2DPtoall(rarray,dimx,dimy) root sends equal parts of a real 2D array to
all slaves, except his own part
spmd_i1DEtoroot(iarray,dim) root receives from all slaves their parts of an
integer 1D array and assembles these after his own part.
Note: in the case of 2D arrays Fortran and the cache prefer an outer loop to
do dimy and an inner to do dimx. Only the outer must be split:
integer dimx, dimy parameter (dimx=1000,dimy=100) real rarr(dimx,dimy) ... call spmd_split(dimy) do 101 iy=spmd_mystart,spmd_myend !parallel loop do 100 ix=1,dimx rarr(ix,iy)=... 100 continue 101 continue call spmd_r2DPtoroot(rarr,dimx,dimy) !slaves send parts to root
Note on i/r/d/c: in most routines this is bogus because MPI only needs to know the number of bytes per element. Implemented are r and d for 4 and 8 bytes: i calls the r routine and c calls the d. The only exceptions are routines which compute something: sumonroot and the reductions (see below).
The other routines are:
toall : root (spmd_myid=0) sends to slaves, i/r/d/c and P/E
toroot : slaves send to root, i/r/d/c but P ONLY! Is there a need for E?
sumonroot : slaves send entire array to root which sums, i/r and E only
syncarray : all parts on all are sent to all, i/r/d/c but P only
All these routines have versions for 1D/2D. The sumonroot (normal integers
or reals) can be used as a template to create other functions. This is the
only function, in which slaves send an entire array to root, that I ever
needed. Actually, Esumonroot is an entire-array reduction and we could develop
routines Eredonroot and Eredonall that have a TYPE argument like SUM or MAX
etc. Because arrays can be used in many iterations, also on root, I decided to
make only one basic version that sums all into another array called
output:
spmd_i2DEsumonroot(output,input,dimx,dimy) root sums all input arrays, his own
and all from the slaves. Appendix 2 of the mini manual illustrates another use
of the Esumonroot. We could actually extend Esumonroot
Another function that could be made is a "stackonroot" in which e.g. 2D arrays
are put into a 3D one on root, but this could be done with a spmd_r3DPtoroot
routine.
Sometimes each CPU needs to have access to neighbouring data that its "neighbours" have prepared. Instead of updating all data on all CPUs, which costs a lot of communications, there is one routine to exchange only these data (note: arrays are not considered to be cyclic, e.g. the first CPU (id=0) will only get data from the second, not from the last CPU; for cyclic processing a special routine can be easily created). One application is image processing and filtering in the spatial domain with a certain mask size: suppose 4 CPUs which have each 1/4th of an image. If we use a filter with mask size 5x5, each CPU needs 2 more lines beyond its _mystart and _myend (2x2=4 lines), apart from the first and last CPU (1x2=2 lines). This is necessary for multiple filter iterations but also for the first iteration. These are:
filtX : eXchange elements or lines, example
spmd_r2DPfiltX(rarr,dimx,dimy,2)
toallX : root sends parts plus additional elements or lines (combines toall
and filtX for a first filter iteration):
spmd_r2DPtoallX(rarr,dimx,dimy,2).
These are available in i/r/d/c and 1D/2D, but P only.
Reductions:
These routines do only the communications, i.e. a maximum or sum needs to be done with a splitted loop after which globals are determined from the locals with one (or more) reduction call.
RR : reduction on Root (only root gets global maximum (example))
RA : reduction on All (global maximum goes to all)
m : max and min
s : sum and sum of squared
These come in three flavors: i/r/d for integer, real and real*8
Examples: spmd_iRAm(max,min), spmd_dRRs(sum1,sum2)
Note: sending one or two parameters takes the same time. Hence, if you want
only a maximum (minimum), put a dummy min (max) with an arbitrary value. Above
I wrote "sum of squared" for loops to compute a variance. You can reduce two
arbitrary sums or only one (using one dummy). The routines can be used as
templates to develop other reductions. As for the case of Esumonroot, we
should develop general RRg and RAg routines that take a TYPE argument like SUM
or MAX etc...
Question: do we really need more reductions? I never use more than
this...
Please keep in mind that it is very easy to do other reductions, see Appendix
3 of the mini manual.
1. MPI working and tested (mpif77 and mpirun)
2. Copy spmd.h in your ~mydir/spmd/
3. Copy either spmd1.f or spmd2.f (in preparation...) in
your working dir or in /spmd/
Note 1: spmd1.f (version being tested; you can take a look at it) works for
arbitrary numbers of processors but is slower than spmd2.f. The spmd2.f will
be optimized for a two dual-CPU system with 4 CPUs to take advantage of
parallelism of communications within a box. Check that the MPI config file has
4 lines: two with the name of the first box and two with the name of the
second box. This way root has id 0 and its "neighbour" 1, whereas the CPUs in
the other box have 2 and 3.
Note 2: UALG users on the "ping-pong" beowulf do not need spmd.h, do not need
to change spmd1.f, and can compile linking the file ~dubuf/mpi/spmd/spmd1.o
4. Facultative: mv spmd1.f spmd.f
5. Facultative: take a look at the spmd.h and spmd.f files
6. Edit the spmd.f file (only all paths in include statements; takes 10
minutes...)
7. Compile: mpif77 -c -O3 spmd.f (forget about all stupid warnings)
8. Facultative: create a lib file and put in libpath
9. Edit one of your sources, compile and run (mpirun -np 4 ./name)
10. Don't forget to compare results with those from the serial code.
To illustrate the application to small-grain image processing, you can take
a look at the serial code spmd_segm1.f and the
parallel version spmd_segm2.f.
A large-grain example is spmd_vis.f in which a loop
over 96 FFTs etc has been parallelised (this is actually my benchmarking
program, see the bench file.
You are strongly advised to make a few small programs and to play a bit with the different routines before parallelizing one of your complicated sources!
TCP/IP-over-SCSI is being debugged and then we need a much faster
MPI-over-SCSI with routing tables such that communications inside two duals
will be done by memcopy.
In parallel we can work on a very small set of directives and a parser that
translates these into spmd_ calls. This would be the highest level that we can
achieve on the way to an intelligent preprocessor that also optimizes the
number of CPUs to be used for different loops.
For all questions etc please contact Hans du Buf
This page has been visited times since July, 2001.
Last update: July 2001, HdB.