Benchmarking; evolution from 2000 to 2011 ========================================= My program reads 6 images of 256x256 pixels and applies a simple brightness model to each. The model has one lowpass filter plus a bank of Gabor filters with 12 frequencies and 8 orientations. In the program an image is read and FFTeed, the lowpass filter is applied and then, in a parallel loop, the Gabor filters are calculated, multiplied with the image spectrum, and inverse FFTeed. The responses of all filters are summed after some normalisation etc. A few values of each output image are printed to test the model's behaviour. The parallel loop is over the 8 orientations with an inner loop over the 12 frequencies, hence in total there are 576 FFTs in the loops plus 2 (one forward of the image and one after the lowpass filter). In addition, 577 filter functions (Gaussians) must be computed. Exactly THE SAME program was run on all systems, using either parallelization directives on SMP systems or MPI on clusters. Unless mentioned, elapsed times include file IO. Early results: Origin200 (SMP 4x R10k @ 180MHz) compiled with -n32 -Ofast=ip27 -mp: 1 cpu: 102 seconds 2 cpu: 53 4 cpu: 29 Compaq Alpha (SMP 4x ev5, ticheli server) f90 -fast -omp 1 cpu: 71 2 cpu: 38 4 cpu: 18 Ping (SMP 2x PIII @ 450 MHz) compiled with pgf77 -fast -mp: 1 cpu: 62 2 cpu: 35 King (SMP 2x PIIIE @ 733 MHz coppermine) pgf77 -fast -tp p6 -Mvect=prefetch -mp: 1 cpu: 40 2 cpu: 25 Note the difference with an Alpha ev6 system (Brandao): 1 cpu: 30 (-fast) 1 cpu: 22 (-O5) ----------------------------------------------------------------------- Using Ping AND Pong (2x dual-PIII) with MPI (dubuf's famous spmd_ routines that took only 10 minutes to re-program the code :-) Compiled with mpif77 -O3 that uses the normal gnu f77... 2 cpu: 34 4 cpu: 19 (whow, compare with O200 and ticheli...) ======================================================================= On the cluster "Calhau" with 8x AMD XP 1.5 GHz (measured 5.8 GFLOPS) compiled with: mpif77 -O6 -mpentiumpro run with: mpirun -nolocal -machinefile all8 -np N (N=2/4/8) Because the test program has a serial part and a parallel part, I measured total wall-clock and the parallel filter loop (the latter timings include (+) or exclude (-) MPI communications of type "all-to-root"). WALL LOOPS LOOPS CLOCK +COMM -COMM 2 cpu: 16.0 15.1 14.3 As can be seen from these numbers, the serial part 4 cpu: 9.4 8.3 7.1 takes about one second, and the loops do not scale 8 cpu: 6.6 5.1 3.6 completely because of the communication times As expected, the filter loops without communications scale perfectly. ======================================================================= On a 2011 Debian server with two quad-core Intel Xeons using OpenMP: 1 core: 14.7 seconds 2 cores: 7.5 yields an almost perfect speedup 4 cores: 4.0 but now we get into Amdahl 8 cores: 3.6 and gain almost nothing using the two Xeons About 600 filter calculations plus 2D matrix multiplications, FFTs and matrix additions in 3.6 seconds means 6 ms per filter. If Moore's Law can be maintained in the future, I'll soon need another benchmark... /proc/cpuinfo gives 16 cpus because of Intel's hyperthreading, but it does not make any sense here. Using 8 cores I assume that the perfectly fair scheduler takes care of not selecting any hypercore. I did not experiment with disabling hyperthreading in the Bios and with libgomp's GOMP_CPU_AFFINITY="0-7" ======================================================================= Last updated June 2011 Hans dB