The SPMD initiative

SPMDlib and SPMDdir:

Citation: Thomas Sterling, "How to build a hypercomputer," Scientific American, July 2001, on page 30:
Despite their impressive processing speeds, high-end systems do not make good use of the computing resources they have, resulting in surprisingly low efficiency levels. Twenty-five percent efficiency is not uncommon, and efficiences have dropped as low as 1 percent when addressing certain applications.

Mind you, this is one of the experts who can play with the ASCI toys that head the Linpack Top500 list. We ordinary researchers can only dream of such systems with teraflops performances, and now they even start talking about petaflops...

On the other hand, every lab can afford a small Beowulf: a cluster of normal PCs with at least a dedicated 100 Mb/s ethernet link. We measured a performance of 1.2 GFLOP/s on a system with four Pentium III at 450 MHz. But if 75% is waisted because of badly structured code and communication overheads, 1.2 GFLOP reduces to 300 MFLOP and it's easier to use only one CPU ... unless we can structure code better and have tools to parallelize the code easily and efficiently.

Our main aim is to develop three tools: a directive set (SPMDdir) and a library (SPMDlib) on top of MPI, together with a parser that translates the directives into library calls. The idea is based on the extreme simplicity of C$DOACROSS: a novice user should be able to learn the tools in one hour and to apply them in minutes. The ZEN of data-parallel programming is based on KISS: keep it simple, stupid!

The ideas are described below, also MPI-over-SCSI and MPPP, where you find links to the specific webpages.

The SPMDtools initiative

Beowulfs are becoming very popular because they are affordable and can bring a supercomputer performance to all laboratories. There are still two problems that need attention in order to get the best performance:

  • At the lowest level there is the INTERCONNECT: Gigabit Ethernet and switches can easily double the cost of a cluster. Normal 100 Mb/s ethernet with a flat neighborhood network is a good alternative. In the near future we'll get new hardware (see PCI-X, InfiniBand and RapidIO), some of which will be much faster, but probably as expensive as Gigabit now. In any case, much time is lost by the TCP/IP stack and for the moment we are working on MPI-over-SCSI: almost normal and cheap SCSI PCI boards with an interface directly under MPI.

  • At the highest level there is a need for parallelization tools: MPI programming is a pain (you know where :-) and HPF-based tools are for the rich. For most data-parallel applications we need a simple library that is much easier to learn and to use. But thinking twice: once we have a simple library, it would be even much easier if we could include in a source code a few directives to split loops and to indicate what communications should be done. Once we have designed a directive set, we can develop a simple parser or preprocessor that translates directives into library calls. We have already a basic version of the Fortran library SPMDlib and a first version of the directive SPMDdir design.

  • Directives and library

    SPMDdir and SPMDlib were in fact inspired by the extreme simplicity of SGI's C$DOACROSS directive as well as the new OpenMP standard. The only requirements are: (1) to parallelize loops over arrays and independent serial code blocks, (2) to do communications, scalar and array, and (3) to do reductions. The question was whether we could develop something similar for the rapidly growing Beowulf users group. This requires a small communication library for transferring scalars and entire arrays or parts of arrays between the CPUs in the cluster. Data-parallel SPMD programming can be much easier and faster than MPI, or even Oxford's BSP, programming. Most programmers simply don't know how easy C$DOACROSS is because they never could and never can afford an SMP machine. The library is extremely simple to use and hides all MPI communications from the user. A first version is already available for Fortran and C is in preparation. But this is one possibility: the user himself can include SPMDlib calls in his program. It would be even more efficient to use a very limited directive set with about 10 clauses and a parser that translates these into SPMDlib calls. It looks like this (directives start with C$SPMD, hence the code can also be compiled in the normal way for a serial execution; comment starts with an !):

          integer    dim
          parameter (dim=1000)
          real       arr(dim)
    C$SPMD DOPAR             !the DO loop is parallel, each CPU does a part
          do i=1,dim
          end do
    C$SPMD A2R[P:arr]        !All-To-Root, root assembles array parts
    The parser would translate the directives into library calls:
          integer    dim
          parameter (dim=1000)
          real       arr(dim)
          include /.../spmd.h               !definitions and a common block
          call spmd_init()
          call spmd_split(dim)              !split the loop over CPUs
          do i=spmd_mystart,spmd_myend      !loop has been changed
          end do
          call spmd_r1DPtoroot(arr,dim)     !real, 1D array, Parts, to root
          call spmd_end()
    As you can see, the directives are easiest to use because you don't need to specify array types and dimensions (this is trivial for a parser). But the subroutines themselves are also easy to use because there are specific ones for integer, real, double and complex data types, 1D/2D/3D arrays, as well as P(arts) of arrays or E(ntire) arrays. The right function is called by selecting a few characters:
    spmd_ i/r/d/c 1/2/3D P/E toall/toroot/syncarray (arrayname,dimensions)
    But the above example shows only a very few features. It would be possible to parallelize independent code blocks that have parallel DO loops that can be executed with different numbers of CPUs. For example, using 6 CPUs (C means communications):
                    |-------|    |----|---------|C--|
                    |-------|    |    |---------|   |
                    |-------|    |                  |
             -------|-------|C---|  |----------|    |C------
                    |-------|    |--|----------|C---|
                    |-------|       |----------|
                      all 6      two blocks with 2 and 3 CPUs
    Here the program starts with a serial part, then a DO loop using all 6 CPUs, followed by two parallelized code blocks that have both a DO loop. Because of the work in the DO loops and the communication overheads it may be best to use 2 and 3 CPUs (the 6th will be idle). This is far from trivial because the user must know about the balance between the work and the communications. Simply using always all CPUs is the best way to get inefficient code! Hence, in the future we want to develop an intelligent parser that analyses DO loops, that knows about communication overheads, and that helps the user in taking the best decisions.

    Implementation details

    The parser replaces directives by SPMDlib calls, e.g. the clause R2A[E:arr], in which arr is a real 2D array with dimensions (100,200), is going to be replaced by "call spmd_r2DEtoall(arr,100,200)". If there are more communication clauses or more arrays etc to be sent, this will be done by a sequence of calls. At the moment there is no packing of multiple scalars or arrays into one big message. The reason is the availability of SPMDlib with many routines for arrays with different dimensions and data types (see above, the lib has been designed to be user-friendly!). But once we realize a parser it would be easier to have e.g. three generic routines spmd_r2a, spmd_a2r and spmd_a2a that the parser can modify on the basis of the data type, array dimensions and the specific function to be done (part/entire, neighborhood, cyclic etc). This is, the parser can create different versions for specific communications in a program, and these routines must then be compiled together with the program. This would make the maintenance and optimization of the SPMDlib much easier, because there are only a very few basic routines. In a next step the parser could combine different arrays and scalars into single messages, i.e. using packing and unpacking routines (using equivalences and perhaps a common block).

    MPI-over-SCSI and UALG's MPPP

    For developing the tools we use a small Beowulf of two dual-CPU Linux boxes, i.e. four PIII at 450 MHz, the two boxes having a dedicated 100 Mb/s ethernet link. This allows to measure the communication speeds (bandwidth) between the boxes and between the two CPUs in a box, using normal MPI send's and recv's. Fasten your seatbelts (b=bit and B=byte):

    10.7 MB/s between boxes (is 85.8 Mb/s of the 100 Mb/s, quite good)
    28   MB/s between CPUs in a box (only a factor of 2.66 faster)
    If we use a faster interconnect (SCSI) in combination with the TCP/IP protocol, the 10.7 can go up to 28 MB/s, and that's it! Copying data in memory can be done with 80 MB/s, hence the TCP/IP stacks cause a drop to 28 MB/s. In other words, if we are going to use a cheap and fast alternative like SCSI boards, we need also to circumvent the TCP/IP stacks. TCP/IP-over-SCSI is now being tested, then we need to put the SCSI directly under MPI: MPI-over-SCSI.

    Another point is a bigger cluster. We have already some small systems and we'll get a new one with two or four dual-Thunderbirds. But students need more CPUs in order to experiment with parallel algorithms and speedups. There are already many SBCs (single board computers) and SOCs (system on a chip) with the size of a credit card - a little bigger for connectors - that have flash disks etc and that are Linux-ready. See the MPPP initiative.

    Let's make this OPEN, but we need HELP!

    Some bits and pieces are already available. But a lot of work still needs to be done and we need help, in the form of e.g. student projects, to turn this into a state-of-the-art parallel-programming tool. The SPMD initiative is open, like Linux and many of its applications: everybody interested can contribute and software is publicly available! A top-down overview of the elements of the initiative is (between parentheses the active people are listed as well as an estimate of the status where 100% means ready and completely tested):

    Directive design: a simple directive with about 10 clauses for splitting loops, communications and reductions (Hans du Buf, 90%). See SPMDdir

    Intelligent parser: a parser/preprocessor that not only translates directives into library calls, but also analyses code in order to guide the user in optimizing the number of CPUs to use for different loops etc (none, 0%).

    Normal parser: a passive one that only translates directives into library calls, should be a piece of cake (none, 0%).

    SPMDlib generic: a complete but minimum library for splitting loops and doing communications, on top of MPI, that works for an arbitrary number of CPUs and on any system with MPI/MPICH (Hans du Buf: Fortran, 90%; Ricardo Oliveira: C, 50%). See SPMDlib

    SPMDlib on small dual-CPU systems: on systems with dual-CPU nodes the communications can be done such that all communications between the CPUs of the nodes are parallelized, saving time (Hans du Buf, 10%).

    MPI-over-SCSI: using SCSI instead of 100 Mb/s ethernet and bypassing completely the TCP/IP stack, also using routing tables to detect communications inside dual-CPU nodes (Ricardo, 30%).

    TCP/IP-over-SCSI: using SCSI instead of 100 Mb/s ethernet, but still based on TCP/IP services (Pedro Semeano and Luis Sismeiro, 100%; see

    MPPP: cluster hardware - SBCs with a reasonable performance and cost (starts October 2001, 0%)

    If you have time, are interested and could contribute, please send an email to Hans du Buf.

    Feedback etc

    For all questions etc please contact Hans du Buf

    Back to the Vision Laboratory

    This page has been visited ****" times since July, 2001.

    Last update: August 2001, HdB.