CALHAU : HOWTO ============== The Calhau has been installed in the context of a collaboration between three groups. I already measured 5.8 GFLOPS (5,800,000,000 floating point operations per second). CPUs and system resources need to be shared somehow. Here I put some ideas and guidelines. The Calhau compute nodes are connected to a frontend. This system, a dual-PIII, is only used to edit and compile; parallel jobs are only executed on the 8 compute nodes. ONLY IN EXCEPTIONAL CASES, after my approval and some experiments, *ONE* CPU of the frontend can be used as a compute node, but only as a master to distribute/collect data. One example is graphics in which the frontend runs eg pgplot and is the xwindows server. The frontend has a disk of 60 GBytes and all user directories are exported to the compute nodes. All compute nodes have a smaller disk for the Linux installation, but also a /tmp directory that can be used as scratch space by the nodes: one scratch space per node. Since users are not allowed to login/ftp to the compute nodes, users who want to use scratch space must write small MPI programs to maintain the space: create subdirectories and delete files. The best solution might be to open files with the "scratch" option (in Fortran). The mounted file server /void of the Vision Laboratory may not (yet) be used by Calhau users who are not part of the Vision Laboratory. I think the cluster should be available for code development during working days. See below for how to use different CPU partitions. Users should come to an agreement. New users cannot be created without my (HdB) approval. One of the first priorities will be to install tools for having a "single system image", i.e. to be able to fore/background serial production jobs on all nodes (e.g. gnuqueue). Right now the software consists of MPI (Fortran and C bindings) with the normal GNU compilers. In the future we can install additional software packages. At the end of this file I give a few hints. ============================================================================== SECURITY ISSUES Accounts are private and may NOT be shared by different people. NEVER write down passwords where other people could read them. Because students are taught network protocols, and to detect passwords that are not encrypted by telnet, rlogin and ftp, remote users must use ssh, scp and sftp. You can install this software from www.openssh.org ============================================================================== Four things to keep in mind: 1: Calhau is an experimental machine, mainly to get experience with a bigger cluster (8 CPUs iso 4), that needs to be shared by users. It is not really a production machine. Production jobs should be run during the night (20h00 PM to 08h00 AM) and weekends. 2: A PC is not a workstation. Caches are smaller, busses are narrower, and disk access is slower. In addition, all file access goes by NFS/ethernet (each CPU _could_ use its local disk (/tmp) for data storage). 3: Parallelization implies writing parallel programs (Amdahl's Law) and inter-processor communications. The latter are very expensive, and you should consider coarse granularity. Do not try to parallelize all small loops, unless you can save communications (see SPMDlib). 4: AMD XP has L1 cache of 64 kB and L2 of 256 kB. Linpack performance is about 750 MFLOPS for L1 size problem, 575 MFLOPS for L2 size problem, but only 160 MFLOPS for a problem size of 1 MB. You can expect a performance of 8x575 MFLOPS (about 4.5 GFLOPS) when you let each CPU do 1/8th of a large array: 8x256/4 = 512 K-reals or integers. ============================================================================== Compiling and running: MPI uses a scripts like mpif77 to set compiler flags and libraries, but these invoke the normal GNU compilers (for example f77). You should use the options that you always use to optimize (-O6 -mpentiumpro) and link, for example -lpgplot -L/usr/X11R6/lib -lX11 Note that pgplot also needs environment variables in csh or bash. These can be put in your .cshrc or .bashrc file, depending on your default shell: setenv DISPLAY frontend:0.0 (csh) export DISPLAY=frontend:0.0 (bash) Once compiled, you can use mpirun to execute the program: mpirun -np 4 program This will start "program" on 4 CPUs, and always the same CPUs, because MPI uses a default system configuration file that contains only the names of the nodes: calhau1 calhau2 calhau3 calhau4 calhau5 calhau6 calhau7 calhau8 Note: by default mpirun uses also, as the first node, the frontend. You must exclude this by using mpirun -nolocal -machinefile ... Instead of using always all 8 nodes, and to share some CPUs with others..., you can use different "partitions". If you do a "man mpirun", you can see that mpirun can use the option "-machinefile filename". Hence, you can create an ASCII file named e.g. "half2" that contains four lines: calhau5 calhau6 calhau7 calhau8 to start your program "on the second half" use mpirun -nolocal -machinefile half2 -np 4 jobname If you create another file with all 8 names, called "all8" for example, you can use "the first half" by issueing mpirun -nolocal -machinefile all8 -np 4 jobname If you are using one node to distribute work (dynamic load balancing), this node normally does not contribute to the work. Waisting an AMD XP is a pitty. For such applications you can use the same machinefile (all8 or half2), without the -nolocal option. See my comments above concerning this option! In summary, you need only two machinefiles: all8 (calhau1-calhau8) or half2 (calhau5-calhau8), with or without the -nolocal option. The use of -nolocal can be a bit messy when running X11 applications. In the case of pgplot and /XWIN put the following lines in your .bashrc file export PGPLOT_DEV=/XWIN export PGPLOT_DIR=/home/dubuf/pgplot and specify an empty string with pgbeg or pgopen and call pgask(.false.) If you need input (batch mode) use a file and mpirun -stdin filename The use of shared object libraries (.so) can also be messy. Experiment with export LD_LIBRARY_PATH=... or compile with -static if you want to use libname.a BATCH JOBS: Submit with "at mpirun ..." or "mpirun ... &" (do "man at" to see options) If the program needs input: mpirun ... -stdin filename If the program prints output: mpirun ... > filename If you want to logout after starting a job: nohup mpirun ... & ============================================================================== We could do some experiments with other parallelization tools: ADAPTOR is a source-to-source translator; it can convert HPF (High Performance Fortran) into code that calls DALib (Distributed Array Library), implemented on top of MPI. DALib is much more complex than SPMDlib; do not try to use it directly. But if you have HPF code, or if you can write such code (see http:// dacnet.rice.edu/Depts/CRPC/HPFF/ for different versions; basic features are not difficult to learn), you can use ADAPTOR to translate the code and execute it on Calhau. Note that ADAPTOR can also create code with OpenMP directives (www.openmp.org). Omni OpenMP uses standard OpenMP parallelization directives together with a few extensions for array mapping and mapping-CPU affinity. Was developed by the Japanese Real World Computing Partnership. Parallelization with directives is the easiest way to go... (I'm not yet talking about program efficiency! I really need to do a few experiments). Also developed and available is SCore, software to schedule jobs on clusters. Under SCore (scored, scout, scrun) users CANNOT specify where their jobs are executed... See www.hpcc.jp/Omni ============================================================================== HdB, May 2002