15-740 Fall '97
Sparse Matrix Multiplication

When I gave the lecture in class Monday 9/22, showing the code generated for different machines to do sparse matrix * dense vector, I realized that I only had relatively old processors on my list. So, I went and ran the code on some higher performance machines. Here's the complete list:

Machine         clock nanosecs/ele     Cycles/ele      Compiler


VAX             25?    2448             122?            (GCC)
MIPS            25      365              18             (GCC)
SPARC           50      388              39             (GCC)
PPC 601         62       63               8             (IBM)
PPC 603         75      125              19             (CodeWarrior)
Pentium         90       78              14             (GCC)
PPC 603e        91       60              11             (CodeWarrior)
HP Precision   100       50              10             (GCC)
UltraSparc     160       38              12.5           (GCC)
MIPS R5000     180       37              13.5           (SGI)
MIPS R10000    185       13.4             5.0           (SGI)
PPC 604e       200       14               5.6           (CodeWarrior)
Pentium II     265       17               9.0           (GCC)
Alpha 21064    267       26              13.7           (DEC)

The entry "604e" is interesting. It's my desktop MAC. The code generated by the compiler is nothing special. It makes little use of the fancy PPC instructions (just FP MULT-ADD.) There are 9 instructions in the inner loop, and they aren't ordered very well.

  while (...)
        temp += *(valp++) * x[*cip++];

Using the following register assignment:

        r4      x
        r3      valp
        r8      cip
        fp2     temp

The loop code is:

        lwz     r0,0(r8)        # r0 = *cip
        addi    r8,r8,4         # cip++
        lfd     fp1,0(r3)       # fp1 = *valp
        addi    r3,r3,8         # valp++
        slwi    r0,r0,3         # r0 *= 8
        lfdx    fp0,r4,r0       # fp0 = x[]
        fmadd   fp2,fp1,fp0,fp2 # temp += () * ()
        cmplw   r8,r10          # Compare r8 : r0?
        blt     *-32            # Loop if <

But, the processor's got lots of horsepower.

If you have an interesting processor that's not on this list, I'd like to hear what kind of performance you get. Here's how to run the benchmark:

Retrieve the file benchmark.tar.
Give the commands
```
tar xf benchmark.tar
make
```
This should cause the code to be compiled and run. It will print out lots of numbers for different implementations of matrix multiplication.
The numbers for the table above are extracted from the final line.
Please Email me a copy of the complete results.

15-740 Fall '97 Sparse Matrix Multiplication

15-740 Fall '97
Sparse Matrix Multiplication