Benchmarking; updated with Calhau cluster results ================================================= My program reads 6 images of 256x256 pixels and applies a simple brightness model to each. The model has one lowpass filter plus a bank of Gabor filters with 12 frequencies and 8 orientations. In the program an image is read and FFTeed, the lowpass filter is applied and then, in a parallel loop, the Gabor filters are calculated, multiplied with the image spectrum, and inverse FFTeed. The responses of all filters are summed after some normalisation etc. A few values of each output image are printed to test the model's behaviour. Note: the main part is parallel, but not 100% ! Exactly THE SAME program was run on all systems, using either parallelization directives on SMP systems or MPI on clusters. Results: Origin200 (SMP 4x R10k @ 180MHz) compiled with -n32 -Ofast=ip27 -mp: 1 cpu: 102 seconds 2 cpu: 53 4 cpu: 29 Compaq Alpha (SMP 4x ev5, ticheli server) f90 -fast -omp 1 cpu: 71 2 cpu: 38 4 cpu: 18 Ping (SMP 2x PIII @ 450 MHz) compiled with pgf77 -fast -mp: 1 cpu: 62 2 cpu: 35 King (SMP 2x PIIIE @ 733 MHz coppermine) pgf77 -fast -tp p6 -Mvect=prefetch -mp: 1 cpu: 40 2 cpu: 25 Note the difference with an Alpha ev6 system (Brandao): 1 cpu: 30 (-fast) 1 cpu: 22 (-O5) ----------------------------------------------------------------------- Using Ping AND Pong (2x dual-PIII) with MPI (dubuf's famous spmd_ routines that took only 10 minutes to re-program the code :-) Compiled with mpif77 -O3 that uses the normal gnu f77... 2 cpu: 34 4 cpu: 19 (whow, compare with O200 and ticheli...) ======================================================================= The new cluster "Calhau" with 8x AMD XP 1.5 GHz (measured 5.8 GFLOPS) compiled with: mpif77 -O6 -mpentiumpro run with: mpirun -nolocal -machinefile all8 -np N (N=2/4/8) Because the test program has a serial part and a parallel part, I measured total wall-clock and the parallel filter loop (the latter timings include (+) or exclude (-) MPI communications of type "all-to-root"). WALL LOOPS LOOPS CLOCK +COMM -COMM 2 cpu: 16.0 15.1 14.3 As can be seen from these numbers, the serial part 4 cpu: 9.4 8.3 7.1 takes about one second, and the loops do not scale 8 cpu: 6.6 5.1 3.6 completely because of the communication times As expected, the filter loops without communications scale perfectly. Explanation: 3.6 seconds on one CPU which does 6 images and 12 frequencies: 3.6/6=0.6; 0.6/12=50 ms; one FFT takes 30 ms; the other 20 ms are necessary to compute the MTF of a filter and to multiply the spectrum. ======================================================================= Hans, 3/5/02