Machine clock nanosecs/ele Cycles/ele Compiler
VAX 25? 2448 122? (GCC) MIPS 25 365 18 (GCC) SPARC 50 388 39 (GCC) PPC 601 62 63 8 (IBM) PPC 603 75 125 19 (CodeWarrior) Pentium 90 78 14 (GCC) PPC 603e 91 60 11 (CodeWarrior) HP Precision 100 50 10 (GCC) UltraSparc 160 38 12.5 (GCC) MIPS R5000 180 37 13.5 (SGI) MIPS R10000 185 13.4 5.0 (SGI) PPC 604e 200 14 5.6 (CodeWarrior) Pentium II 265 17 9.0 (GCC) Alpha 21064 267 26 13.7 (DEC)
The entry "604e" is interesting. It's my desktop MAC. The code generated by the compiler is nothing special. It makes little use of the fancy PPC instructions (just FP MULT-ADD.) There are 9 instructions in the inner loop, and they aren't ordered very well.
while (...) temp += *(valp++) * x[*cip++];
Using the following register assignment:
r4 x r3 valp r8 cip fp2 temp
The loop code is:
lwz r0,0(r8) # r0 = *cip addi r8,r8,4 # cip++ lfd fp1,0(r3) # fp1 = *valp addi r3,r3,8 # valp++ slwi r0,r0,3 # r0 *= 8 lfdx fp0,r4,r0 # fp0 = x[] fmadd fp2,fp1,fp0,fp2 # temp += () * () cmplw r8,r10 # Compare r8 : r0? blt *-32 # Loop if <
But, the processor's got lots of horsepower.
If you have an interesting processor that's not on this list, I'd like to hear what kind of performance you get. Here's how to run the benchmark:
tar xf benchmark.tar makeThis should cause the code to be compiled and run. It will print out lots of numbers for different implementations of matrix multiplication.