Citation: Thomas Sterling, "How to build a hypercomputer," Scientific American,
July 2001, on page 30:
Despite their impressive processing speeds, high-end systems do not make good use of the computing resources they have, resulting in surprisingly low efficiency levels. Twenty-five percent efficiency is not uncommon, and efficiences have dropped as low as 1 percent when addressing certain applications.
Mind you, this is one of the experts who can play with the ASCI toys that head the Linpack Top500 list. We ordinary researchers can only dream of such systems with teraflops performances, and now they even start talking about petaflops...
On the other hand, every lab can afford a small Beowulf: a cluster of normal PCs with at least a dedicated 100 Mb/s ethernet link. We measured a performance of 1.2 GFLOP/s on a system with four Pentium III at 450 MHz. But if 75% is waisted because of badly structured code and communication overheads, 1.2 GFLOP reduces to 300 MFLOP and it's easier to use only one CPU ... unless we can structure code better and have tools to parallelize the code easily and efficiently.
Our main aim is to develop three tools: a directive set (SPMDdir) and a library (SPMDlib) on top of MPI, together with a parser that translates the directives into library calls. The idea is based on the extreme simplicity of C$DOACROSS: a novice user should be able to learn the tools in one hour and to apply them in minutes. The ZEN of data-parallel programming is based on KISS: keep it simple, stupid!
The ideas are described below, also MPI-over-SCSI and MPPP, where you find links to the specific webpages.
Beowulfs are becoming very popular because they are affordable and can bring a supercomputer performance to all laboratories. There are still two problems that need attention in order to get the best performance:
SPMDdir and SPMDlib were in fact inspired by the extreme simplicity of
SGI's C$DOACROSS directive as well as the new OpenMP standard. The only
requirements are: (1) to parallelize loops over arrays and independent serial
code blocks, (2) to do communications, scalar and array,
and (3) to do reductions. The question was whether we could develop something
similar for the rapidly growing Beowulf users group. This requires a small
communication library for transferring scalars and entire arrays or parts of
arrays between the CPUs in the cluster.
Data-parallel SPMD programming can be much easier and faster than MPI, or even
Oxford's BSP, programming. Most programmers simply don't know how easy C$DOACROSS
is because they never could and never can afford an SMP machine. The library is
extremely simple to use and hides all MPI communications from the user. A first version
is already available for Fortran and C is in preparation. But this is one possibility:
the user himself can include SPMDlib calls in his program. It would be even
more efficient to use a very limited directive set with about 10 clauses
and a parser that translates these into SPMDlib calls. It looks like this (directives
start with C$SPMD, hence the code can also be compiled in the normal way for a serial
execution; comment starts with an !):
integer dim parameter (dim=1000) real arr(dim) C$SPMD INIT ... C$SPMD DOPAR !the DO loop is parallel, each CPU does a part do i=1,dim arr(i)=... end do C$SPMD A2R[P:arr] !All-To-Root, root assembles array parts ... C$SPMD ENDThe parser would translate the directives into library calls:
integer dim parameter (dim=1000) real arr(dim) include /.../spmd.h !definitions and a common block call spmd_init() ... call spmd_split(dim) !split the loop over CPUs do i=spmd_mystart,spmd_myend !loop has been changed arr(i)=... end do call spmd_r1DPtoroot(arr,dim) !real, 1D array, Parts, to root ... call spmd_end()As you can see, the directives are easiest to use because you don't need to specify array types and dimensions (this is trivial for a parser). But the subroutines themselves are also easy to use because there are specific ones for integer, real, double and complex data types, 1D/2D/3D arrays, as well as P(arts) of arrays or E(ntire) arrays. The right function is called by selecting a few characters:
spmd_ i/r/d/c 1/2/3D P/E toall/toroot/syncarray (arrayname,dimensions)But the above example shows only a very few features. It would be possible to parallelize independent code blocks that have parallel DO loops that can be executed with different numbers of CPUs. For example, using 6 CPUs (C means communications):
|-------| |----|---------|C--| |-------| | |---------| | |-------| | | -------|-------|C---| |----------| |C------ |-------| |--|----------|C---| |-------| |----------| all 6 two blocks with 2 and 3 CPUsHere the program starts with a serial part, then a DO loop using all 6 CPUs, followed by two parallelized code blocks that have both a DO loop. Because of the work in the DO loops and the communication overheads it may be best to use 2 and 3 CPUs (the 6th will be idle). This is far from trivial because the user must know about the balance between the work and the communications. Simply using always all CPUs is the best way to get inefficient code! Hence, in the future we want to develop an intelligent parser that analyses DO loops, that knows about communication overheads, and that helps the user in taking the best decisions.
The parser replaces directives by SPMDlib calls, e.g. the clause R2A[E:arr], in which arr is a real 2D array with dimensions (100,200), is going to be replaced by "call spmd_r2DEtoall(arr,100,200)". If there are more communication clauses or more arrays etc to be sent, this will be done by a sequence of calls. At the moment there is no packing of multiple scalars or arrays into one big message. The reason is the availability of SPMDlib with many routines for arrays with different dimensions and data types (see above, the lib has been designed to be user-friendly!). But once we realize a parser it would be easier to have e.g. three generic routines spmd_r2a, spmd_a2r and spmd_a2a that the parser can modify on the basis of the data type, array dimensions and the specific function to be done (part/entire, neighborhood, cyclic etc). This is, the parser can create different versions for specific communications in a program, and these routines must then be compiled together with the program. This would make the maintenance and optimization of the SPMDlib much easier, because there are only a very few basic routines. In a next step the parser could combine different arrays and scalars into single messages, i.e. using packing and unpacking routines (using equivalences and perhaps a common block).
For developing the tools we use a small Beowulf of two dual-CPU Linux boxes, i.e.
four PIII at 450 MHz, the two boxes having a dedicated 100 Mb/s ethernet link.
This allows to measure the communication speeds (bandwidth) between the boxes
and between the two CPUs in a box, using normal MPI send's and recv's. Fasten
your seatbelts (b=bit and B=byte):
10.7 MB/s between boxes (is 85.8 Mb/s of the 100 Mb/s, quite good) 28 MB/s between CPUs in a box (only a factor of 2.66 faster)If we use a faster interconnect (SCSI) in combination with the TCP/IP protocol, the 10.7 can go up to 28 MB/s, and that's it! Copying data in memory can be done with 80 MB/s, hence the TCP/IP stacks cause a drop to 28 MB/s. In other words, if we are going to use a cheap and fast alternative like SCSI boards, we need also to circumvent the TCP/IP stacks. TCP/IP-over-SCSI is now being tested, then we need to put the SCSI directly under MPI: MPI-over-SCSI.
Another point is a bigger cluster. We have already some small systems and we'll get a new one with two or four dual-Thunderbirds. But students need more CPUs in order to experiment with parallel algorithms and speedups. There are already many SBCs (single board computers) and SOCs (system on a chip) with the size of a credit card - a little bigger for connectors - that have flash disks etc and that are Linux-ready. See the MPPP initiative.
Some bits and pieces are already available. But a lot of work still needs to be done and we need help, in the form of e.g. student projects, to turn this into a state-of-the-art parallel-programming tool. The SPMD initiative is open, like Linux and many of its applications: everybody interested can contribute and software is publicly available! A top-down overview of the elements of the initiative is (between parentheses the active people are listed as well as an estimate of the status where 100% means ready and completely tested):
Directive design: a simple directive with about 10 clauses for splitting loops, communications and reductions (Hans du Buf, 90%). See SPMDdir
Intelligent parser: a parser/preprocessor that not only translates directives into library calls, but also analyses code in order to guide the user in optimizing the number of CPUs to use for different loops etc (none, 0%).
Normal parser: a passive one that only translates directives into library calls, should be a piece of cake (none, 0%).
SPMDlib generic: a complete but minimum library for splitting loops and doing communications, on top of MPI, that works for an arbitrary number of CPUs and on any system with MPI/MPICH (Hans du Buf: Fortran, 90%; Ricardo Oliveira: C, 50%). See SPMDlib
SPMDlib on small dual-CPU systems: on systems with dual-CPU nodes the communications can be done such that all communications between the CPUs of the nodes are parallelized, saving time (Hans du Buf, 10%).
MPI-over-SCSI: using SCSI instead of 100 Mb/s ethernet and bypassing completely the TCP/IP stack, also using routing tables to detect communications inside dual-CPU nodes (Ricardo, 30%).
TCP/IP-over-SCSI: using SCSI instead of 100 Mb/s ethernet, but still based on TCP/IP services (Pedro Semeano and Luis Sismeiro, 100%; see http://ipoverscsi.sourceforge.net).
MPPP: cluster hardware - SBCs with a reasonable performance and cost (starts October 2001, 0%)
If you have time, are interested and could contribute, please send an email to Hans du Buf.
For all questions etc please contact Hans du Buf
This page has been visited
times since July, 2001.
Last update: August 2001, HdB.