Optimizing Compilers, Project Milestone

Recollection

Recall that our project has been to formalize and implement various optimizations specifically for parallelizing repetitive computations found in loops. This is just a brief update on what we've found, what we haven't found, etc.

Problems Encountered

One of the first problems encountered was in our focus on loop induction variables. We wanted to parallelize their update by packing several of them into a long register. The problem with focusing on these variables specifically is that they are often used in loop tests. This would mean, for each loop test, in the most general case we'd have to extract the variable from the register in order to do the test. This proved to be more computationally expensive than we wanted. So we re-thought about the kinds of optimizations we wished to achieve. Our webpage gives a report on this.

The biggest problem we encountered was the relative difficulty to populate SIMD registers with multiple data. This is because the processors were designed to read the data directly from memory rather than easily move between single-data and vector values.

We also considered using SUIF and targeting the IA-32 platform, but investigation showed that the IA-32 platform suffered from this problem more severely than the PlayStation 2 Emotion Engine. The SIMD registers are completely separate from the general purpose registers on IA-32. In addition, the MMX instruction set does not offer many parallel instructions, and they only operate on two 32-bit elements at once. SSE focuses much on floating point values and SSE2 requires Pentium 4 machines.

Of course, we've been overloaded on classes and research like everyone else, but March/April have been particularly bad for us. We expect the next couple of weeks to be much more free, permitting us to complete the project reasonably.

Where we have gone, and the next steps

Given all these caveats, we began looking again at the PlayStation 2. One of the things we're seriously looking at now is ways to parallelize instructions by unrolling loops. GCC 3.0.3 (which we're working with) has a well documented loop unrolling function that we're trying to work with to tackle this project. We have several ideas in mind. To see what we have accomplished so far, you are welcome to read an overview of some of our prescribed optimizations, found on our project webpage. It is difficult for us to say who has done what precisely, since pretty much everything we've done has been the result of discussions between each other.

Changes made to schedule

Both of us will continue working with GCC, in particular on detecting candidate instructions, and on generalizing our current optimizations (so they are applicable in more situations). We plan to continue discussions and coding sessions as a team, Ryan naturally leaning towards specifying and discovering good optimizations, and Steven leaning more towards particular coding aspects.

All in all, our schedule won't change much, we may be only a few days behind schedule which we hope to make up.