SPMDlib: THE ZEN OF DATA-PARALLEL MPI PROGRAMMING (in Fortran) Or: MPI for the uninterested Very brief manual ================= SPMD (single program multiple data) programming involves attributing specific tasks to all processors in a cluster or farm. The most common application is to parallelise loops over arrays and occasionally to send arrays or parts of arrays from one processor to another, although the work to be done by each CPU can be completely different if there are independent serial code blocks. SPMDlib uses the master-slave model in which the master (also called root) distributes and collects data. The master also contributes equally to the work to be done. The goal of the SPMDlib is to hide all MPI communication calls from the user and to be able to parallelise code with a minimum effort and as fast as possible. In other words, the user can concentrate completely on solving his problems without any debugging of MPI communication structures. The user does not even need to know how MPI works... In the future we may develop a small set of directives together with a parser or preprocessor that translates directives into SPMD calls. A related project concerns the reduction of communication times by using cheap but fast SCSI boards (TCP/IP-over-SCSI) and to eliminate most communication overheads (MPI-over-SCSI). The latter means that MPI calls are almost directly layered above the SCSI services in order to short- cut all TCP/IP stacks. Example: communications between two CPUs in a dual-CPU system is only a factor 2.6 faster than using a 100Mb/s ethernet cable... Nevertheless, it is better to use a Beowulf consisting of dual-CPU systems because communications can be partly done simultane- ously. But all this will be hidden from the user; he or she will only notice faster communication times. Design philosophy ================= There is not really a philosophy because the library evolved following my needs. Being used to SGI's C$DOACROSS directive (also OpenMP) and the need to structure code such that most work is done in huge or heavy loops, new hardware consisting of off-the-shelf Linux boxes confronted me with MPI. But previous experience with a small Parsytec system and routines that are similar to MPI led me think about ways to parallelize code more easily, as if we had the compiler -mp processor with communication routines but not a DOACROSS directive (well, still thinking about a parser that automatically translates directives into SPMDlib calls, but I need one or two students for this). But before reaching this highest possible level, the target is to hide all MPI communications and to have a library that is even more easy to use than Oxford's BSP (bulk-synchronous parallel), without fancy tools like profilers and debuggers. The reason is very simple: once you need such tools you are already heading in the wrong direction. Grain size should be large and you should think twice before doing communications (every MILLIsecond spent with communications you loose more than one Mflop on a two dual-PIII 450MHz Beowulf...). In normal applications you should not paral- lelize the transposition or an FFT of a 2D array, let alone parallelizing a simple reduction, unless you are sure that you gain time. In summary, the goals are: * extreme simplicity, after directives (OpenMP) the fastest way to //ize * easy to memorize routine names, minimum programming involved * concentrate on your processing instead of waisting your time with all fancy MPI features which might reduce CPU times a few percent * functionality at the directive level: splitting loops, array communications (entire or parts, do not be concerned about addressing) and reductions * data-parallel applications, no math library (ScalaPack) * to be used on any system with MPI - specific versions will be optimized for small dual-CPU Beowulfs * specific features for 1D and 2D (iterative) filtering or other neighborhood operations * in short: get your Beowulf to work for you instead of being its slave 1. How it works =============== SPMD implies that each CPU in the cluster is executing exactly the same program, although each CPU has a unique identifier that can be tested in order to determine it's part of the work. Suppose we have four CPUs avail- able. By including the file spmd.h, which contains a few declarations and a common block, after calling spmd_init() the following variables have been initialised: spmd_nprocs - the number of CPUs (4 in this case) spmd_myid - the id of each CPU (from 0 to 3) if the program is run with "mpirun -np 4". You can run the same program with "mpirun -np 8", in which case spmd_nprocs will be 8 and the id's will be 0 to 7, but the program will not run faster because each CPU will simulate 2 CPUs. There are more variables (see spmd.h) but in priciple the user does not need to know these. If there are 4 independent serial code blocks, these could be executed in parallel by using if statements: if (spmd_myid.eq.0) then ... cpu 0 does something ... else if (spmd_myid.eq.1) then ... cpu 1 does something ... etc ... end if after which some communications should be done because each CPU has done something but the others don't know about results (unless each CPU prints something or opens a file). It is in principle possible to execute simultaneously two independent serial code blocks using two CPUs for each, and to parallelise loops inside the blocks using the two CPUs, but it may be better to wait until we have devel- oped spmd tools for this (in preparation). See also Appendix 2 if your com- piler has a case statement. The most common application is to parallelise loops, i.e. each CPU will do part of the loop simultaneously. Example: do 100 i=1,1000 becomes call spmd_split(1000) do 100 i=spmd_mystart,spmd_myend In the _split routine three variables are set: spmd_mypart (=1000/4) and each CPU will get its own spmd_mystart and _myend, i.e. the first gets 1 and 250, the second 251 and 500, etc. It is important that 1000 is a multiple of the number of CPUs and that, if a loop variable is used, this is an integer (integer nloop; nloop=1000; do i=1,nloop). (TO BE DONE: developing a _split2(start,end,stride) for general loops over arrays, e.g. do i=-1000,1000,10 using _mystart, _myend and _stride, but is there really a need for this?) Normally a loop will involve processing one or more arrays such that all array elements are done, before or after which arrays or parts of arrays must be sent to one or all CPUs. The SPMD API looks as if it uses a virtual star topology with one root processor (id=0) and slaves (id.ne.0). Well, it does normally because there are routines like "toroot" and "toall" for "all send to root" and "root sends to all", but other routines like "syncarray" and "reduce on all" may not follow this strategy because of a lower performance (this is hidden from the user). All SPMD routines are described below. Examples and "getting started" are given and described in the SPMD webpages. Note: some routines look like returning an error number (ierr) but this will not be done because there is absolutely no testing in the routines. Either MPI (and SPMD) works or there is something wrong with a cable or connector. After calling spmd_init() you may include a line like print*,spmd_myid,' is working in the set of ',spmd_nprocs or you can have a very small test program that only does this (see Appen- dix 1). 2. Fortran routines =================== -> spmd_init(ierr) initializes spmd and mpi; must be called before calling any other spmd routine; must include integer ierr -> spmd_end(ierr) quits mpi; must be called after the last communications and before stop/end; must include integer ierr -> spmd_time(rtime) for timing parts of programs; rtime is a normal real; to be called twice (time1, time2) after which time2-time1 is the number of seconds -> spmd_split(nloop) for splitting a loop (over arrays); nloop is a normal integer; NOTE: needs to be called only once if all loops have the same size and all arrays have the same size (of the last dimension). Needs to be called every time that a differently-sized loop or array is processed -> spmd_barrier(ierr) to synchronize all CPUs. Note: normally all CPUs are synchronized in the communication routines. Only in cases where one or more CPUs risk to enter a region without proper data having been prepared, you must call _barrier. Must include integer ierr All routines are prefixed with spmd_ so it is possible to mix spmd calls with normal mpi ones. The other routines described below use the following naming convention after the prefix: -> spmd_[i/r/d/c][1D/2D][P/E][function](variables) -> i/r/d/c mean integer, real, real*8 and complex*8 (see Note below) -> 1D/2D mean arrays with one or two dimensions. (3D must still be done) -> P/E mean only Parts of an array or an Entire array In the case of calling a P(arts) routine, spmd_split must have been called before in order to determine the array parts (of the last array dimension). If all loops have the same number of iterations and the arrays the same size, _split needs only be called once. If the number of iterations and the array size changes, you need to call _split again! Examples: spmd_r2DPtoall(rarray,dimx,dimy) root sends equal parts of a real 2D array to all slaves, except his own part spmd_i1DEtoroot(iarray,dim) root receives from all slaves their parts of an integer 1D array and assembles these after his own part. Note: in the case of 2D arrays Fortran and the cache prefer an outer loop to do dimy and an inner to do dimx. Only the outer must be split: integer dimx, dimy, ierr parameter (dimx=1000,dimy=100) real rarr(dimx,dimy) include '/.../spmd.h' ... call spmd_init(ierr) ... call spmd_split(dimy) do 101 iy=spmd_mystart,spmd_myend !parallel loop do 100 ix=1,dimx rarr(ix,iy)=... 100 continue 101 continue call spmd_r2DPtoroot(rarr,dimx,dimy) !slaves send parts to root ... Note on i/r/d/c: in most routines this is bogus because MPI only needs to know the number of bytes per element. Implemented are r and d for 4 and 8 bytes: i calls the r routine and c calls the d. The only exceptions are routines which compute something: sumonroot and the reductions (see below). All the functions are: -> toall root (spmd_myid=0) sends to slaves; i/r/d/c, 1D/2D and P/E -> toroot slaves send to root; i/r/d/c, 1D/2D but P ONLY! Is there a need for E? -> sumonroot slaves send entire array to root which sums; i/r, 1D/2D and E only -> syncarray all parts on all are sent to all; i/r/d/c, 1D/2D but P only The sumonroot (normal integers or reals) can be used as a template to create other functions. This is the only function, in which slaves send an entire array to root, that I ever needed for histograms, Hough transforms etc. Because arrays can be used in many loops, also on root, I decided to make only one basic version that sums all into another array called output: spmd_i2DEsumonroot(output,input,dimx,dimy) root sums all input arrays, his own and all from the slaves, into output. Don't forget to (a) declare an extra array or use one "spare" with the same dimensions and (b) to initialize output with zero. Appendix 2 illustrates another use of Esumonroot. To be done: create entire-array reductions with a TYPE argument (MAX, SUM, AND etc). Another function that could be made is a "stackonroot" in which e.g. 2D arrays are put into a 3D one on root, but this could be done with a spmd_r3DPtoroot routine. Sometimes each CPU needs to have access to neighbouring data that its "neighbours" have prepared. Instead of updating all data on all CPUs, which costs a lot of communications, there is one routine to exchange only these data (note: arrays are not considered to be cyclic, e.g. the first CPU (id=0) will only get data from the second, not from the last CPU; for cyclic processing a special routine can be easily created). One application is image processing and filtering in the spatial domain with a certain mask size: suppose 4 CPUs which have each 1/4th of an image. If we use a filter with mask size 5x5, each CPU needs 2 more lines beyond its _mystart and _myend (2x2=4 lines), apart from the first and last CPU (1x2=2 lines). This is necessary for multiple filter iterations but also for the first iteration. These functions are: -> filtX eXchange array elements (1D) or complete lines (2D), for example spmd_r2DPfiltX(rarr,dimx,dimy,2) exchanges two lines between all "neighbouring" CPUs. The function has a P in its name to remember that CPUs are dealing with parts. -> toallX root sends the normal array parts plus additional elements or lines (combines toall and filtX for a first filter iteration), for example: spmd_r2DPtoallX(rarr,dimx,dimy,2) These are available in i/r/d/c and 1D/2D, but P only. Obviously, spmd_split must have been called before calling these. 3. Reductions ============= These routines do only the communications, i.e. a maximum or sum needs to be done with a splitted loop after which globals are determined from the locals with one (or more) reduction call. Naming convention: -> RR reduction on Root (only root gets global maximum (for example)) -> RA reduction on All (global maximum goes to all) -> m max and min -> s sum and sum of squared These come in three flavors: i/r/d for integer, real and real*8 Examples: spmd_iRAm(max,min), spmd_dRRs(sum1,sum2) Note: sending one or two parameters takes the same time. Hence, if you want only a maximum (minimum), put a dummy min (max) with an arbitrary value. Do not try spmd_rRAm(rmax,rmax) Above I wrote "sum of squared" for loops to compute a variance. You can reduce two arbitrary sums or only one (using one dummy). The routines can be used as templates to develop other reductions. However, see Appendix 3 for using existing routines in other reduction types. As for the sumonroot case, we can develop general RRg and RAg routines that take a TYPE argument (SUM, MAX, AND etc). 4. Getting started ================== 1. MPI working and tested (mpif77 and mpirun; see below for test program) 2. Copy spmd.h in your ~mydir/spmd/ 3. Copy either spmd1.f or spmd2.f (in preparation...) in your working dir or in /spmd/ Note 1: spmd1.f (version being tested and further optimized; you can take a look at it) works for arbitrary numbers of processors but is slower than spmd2.f. The spmd2.f will be optimized for a two dual-CPU system with 4 CPUs to take advantage of parallelism of communications within a box. Check that the MPI config file has 4 lines: two with the name of the first box and two with the name of the second box. This way root has id 0 and its "neighbour" 1, whereas the CPUs in the other box have 2 and 3. Note 2: UALG users on the "ping-pong" Beowulf do not need spmd.h, do not need to change spmd1.f, and can compile linking the file ~dubuf/mpi/spmd/spmd1.o 4. Facultative: mv spmd1.f spmd.f 5. Facultative: take a look at the spmd.h and spmd.f files 6. Edit the spmd.f file (only all paths in include statements; takes about 15 minutes...) 7. Compile: mpif77 -c -O3 spmd.f (forget about all stupid warnings) 8. Facultative: create a lib file and put in libpath 9. Edit one of your sources, compile and run (mpirun -np 4 ./name) 10. Don't forget to compare results with those from the serial code. If you use gnu's f77, also experiment with -O6 and other compiler options. To illustrate the application to small-grain image processing, you can take a look at the serial code spmd_segm1.f and the parallel version spmd_segm2.f A large-grain example is spmd_vis.f in which a loop over 96 FFTs etc has been parallelised (this is actually my benchmarking program). You are strongly advised to make a few small programs and to play a bit with the different routines before parallelizing one of your complicated sources! Hans du Buf (dubuf@ualg.pt) Univ. of Algarve, June 2001 ************************************************************************ Disclaimer: this is free code, hence absolutely no warranty etc. If you find bugs or if you have specific wishes or ideas: please email. ************************************************************************ Appendix 1 Test program, only to test if MPI and hardware works (do not call "test" because there is a Unix/Linux test command) program test integer ierr include '/...path.../spmd.h' call spmd_init(ierr) print*,spmd_myid,' is working in the set of ',spmd_nprocs call spmd_end(ierr) stop end 1) save into a file like test.f 2) mpif77 -o runtest test.f /.../spmd1.o 3) mpirun -np 4 ./runtest Should print four lines, not necessarily in the correct CPU order. ************************************************************************ Appendix 2 illustrating case and another use of Esumonroot If you compiler supports the case statement (F90) you can easily paralel- lize independent serial code blocks: select case (spmd_myid) case (0) ... case (1) ... etc ... end select It is trickier to parallelize further: if you have 4 CPUs, two independent serial code blocks and within these parallel loops (using case or if-elseif): spmd_nprocs=2 !was 4 select case (spmd_myid) case (0:1) !both 0 and 1 do this ... call spmd_split(1000) do i=spmd_mystart,spmd_myend !parallel loop a(i)=... end do ... case (2:3) !both 2 and 3 do this spmd_myid=spmd_myid-2 !but pretend they are 0 and 1 ... call spmd_split(2000) do i=spmd_mystart,spmd_myend !parallel loop b(i)=... end do ... spmd_myid=spmd_myid+2 !restore id's end select spmd_nprocs=4 !restore nprocs Now, 0 and 1 have part of array a and 2 and 3 have part of array b. To get the entire arrays on root (id=0) you can, if arrays a2 and b2 have been initialized to zero, call spmd_r1DEsumonroot(a2,a,1000) call spmd_r1DEsumonroot(b2,b,2000) and then, assuming that the 4 CPUs will work in parallel on arrays a and b, root must first copy a2 to a and b2 to b before distributing them: if (spmd_myid.eq.0) then do i=1,1000 a(i)=a2(i) end do do i=1,2000 b(i)=b2(i) end do end if call spmd_split(1000) call spmd_r1DPtoall(a,1000) call spmd_split(2000) call spmd_r1DPtoall(b,2000) If the 4 CPUs could continue with arrays a2 and b2 it is only necessary to do call spmd_split(1000) call spmd_r1DPtoall(a2,1000) call spmd_split(2000) call spmd_r1DPtoall(b2,2000) ************************************************************************ Appendix 3 concerning other reductions The max/min and sum1/sum2 reductions are provided because they are so common. For other cases special spmd_ routines may be developed, but with a little bit of programming this can be done with existing routines Ptoroot and Etoall, or even the sumonroot. For example, suppose we want the position of a maximum in a 2D array and there is only one maximum. One solution is to first determine the global maximum using spmd_rRAm(max,min) and then each CPU will check his array part to see where it is. Then, use Esumonroot of an array with only two elements ix and iy (the CPU which has the maximum puts its values, the others put zero). This works but is not efficient because of the double communications. More efficient is to use a 2D real array, say rmaxpos(3,4) in the case of 4 CPUs, in which each CPU puts his max and the x and y positions (converted to reals). Then you can use Ptoroot and root can compare, and if necessary can send the global max and its position to all using the same array or another with only three elements using Etoall. ************************************************************************