15-745 Assignment #2

CMU 15-745 Spring 2007, Assignment 2

Do this first

This assignment needs to be done in groups of 2. Send mail to Mahim asap with a "group name" (this can be the login ID of either of you) and the names of both group members.

Submission

Assignment 2 submission page

Assignment 2 updated results

Your Mission

Make a cluster-sensitive acyclic scheduling pass. As a starting point, we are providing you with a list scheduler which you have to improve upon. You can implement the Leupers algorithm as described in the paper (using our list scheduler in the inner-loop), or something else.

Your code's ``deliverable'' is to record the schedule by putting each operation in a 2-dimensional table indexed by cycle and function unit index. All necessary access can be done through Scheduler private methods

     Operation *getOp(unsigned cycle, unsigned ifu);  //-- returns null if slot is unassigned
     void setOp(unsigned cycle, unsigned ifu, Operation *op);

where ifu indicates the function unit index. The table is resized cycle-wise automatically as necessary.

The schedule will be passed to the default register allocator and assembly emitter to generate assembly that you can test yourself by linking to create an executable that the TI simulator can run.

Updating the compiler and getting the new test case

Start by doing a cvs update in c2dil.

To get the new test case, do

$ cd <WORK>
$ cvs -d/afs/cs/academic/class/15745-s07/repository co suif/passes/c2dil/Sample/3

This test case compiles in the same way as previous ones (one difference is that the dot files are now all dumped in a dotfiles/ subdirecory). To simulate the output on the TI, copy over rawdaudio.c, asm.out, adpcm.h, test.adpcm, and test_out.pcm to your Windows working dir. Rename asm.out to asm.out.s. If you are using the IDE, add rawdaudio.c and asm.out.s to your project. You will also need to do the following:

If you are using the command line tools, add -DindexTable=indexTable -DstepsizeTable=stepsizeTable to the cl6x command line, as follows:
```
cl6x -DindexTable=indexTable -DstepsizeTable=stepsizeTable asm.out.s rawdaudio.c -z ...
```
If you are using the IDE, go to Project->Build Options->Compiler, and add the above options to the front of the line in the top textbox. You may also need to edit rawdaudio.c, and provide the full path on your machine for the out.pcm and test.adpcm files.

Yes, we know the above steps are odd. No, we don't know why you need to do this. The TI compiler has its own whimsies.

After simulation, you will have a file called out.pcm in your working directory. diff it with test_out.pcm to see if you got the correct output.

Submission, how do you know you are done?

See the "Assignment 2 submission page" link at the top for submission instructions.

Our scripts will compile, simulate and record the performance of your code. These results for all groups will be posted to the "Assignment 2 updated results" page. Your target is to beat the entry called "target" (coming shortly) in schedule length, taking at most 5x the execution time (the "default" entry is default list scheduler that has been given to you). We will also have a moving target (called "staff"), that will be our own scheduler as we improve it. Extra credit to whoever beats it.

Coding and other important hints

Start from Procedure::schedAsm() (and where it is called from).
You should only need to modify c2dil/schedasm.cc. It is strongly recommeneded that you do not modify c2dil/schedasm.h, since it is included by interference.cc which you have not been provided (that is the next assignment). Contact us if you absolutely must modify schedasm.h.
Beating target should not need more than about a 100 lines of code from you.
You will get an updated MakeProg and Makefile that include some new options for c2dil. Change these options at your own risk. The one that you can change with interesting results is "basic_blocks"; without this, hyperblocks get formed.
If you are using it, the Leupers algorithm has a lot of magic constants. Tuning these can give you a better tradeoff between schedule length and execution time. We strongly recommend reading up a little bit on the simulated annealing algorithm, and on methods to tune annealing parameters.
Our list scheduler is by no means complete, or even very good. Understanding its innards and improving its heuristics can be very worthwhile.

Basic list scheduling algorithm

List scheduling classifies the nodes into 3 categories: scheduled, ready, and notReady. A "ready" node has no unscheduled predecessors. In general there may be more than one ready node, so choose wisely. When a node is scheduled, some of its successors may transition from notReady to ready.

It is useful to maintain ASAP and ALAP values for the nodes. ASAP - as soon as possible - is the earliest cycle that operation could be scheduled given existing dependency constraints (but ignoring resource constraints).

For example, if Y depends on the output of X, and X has been scheduled, then the ASAP value of Y is (at least) the scheduled cycle of X plus the latency of X. If X has not been scheduled, then the ASAP value of Y is the ASAP value of X plus the latency of X. In fact, it is simplest if when a node is scheduled, you set its ASAP value to its scheduled cycle. Then the ASAP calculation routine doesn't need to worry if a predecessor is scheduled or not, it just looks at its ASAP value.

After calculating ASAP for all of the nodes, you have a lower bound on the schedule length: the earliest cycle at which all operations could complete (again, considering only dependency constraints of unscheduled nodes). In other words,

     max         [n.ASAP + n.latency]
  (n in nodes)

Once you have the schedule length, you can work your way backwards to calculate the ALAP "as late as possible" value for each node. For nodes on the critical path, the ASAP and ALAP values will be the same. A measure of criticality is the difference between the ASAP and ALAP values -- also known as the node's slack. For nodes on the critical path, slack = ALAP - ASAP = 0, indicating high criticality. For nodes off of the critical path, the larger the slack, the less the criticality. Thus slack can be a useful factor in your heuristic for prioritizing nodes.

We are providing routines to tell you whether a certain operation type can execute on a particular function unit. That is just the simplest resource consideration though; you will eventually need to consider the other resource constraints listed in the architecture manual.

Resource restrictions

There are restrictions regarding transfers between the clusters. At most one register value can be read by the opposite cluster (which can be used by at most one unit on the opposite side). Other constraints are detailed in the section 3.7 in the TMS320C6000 manuals. However, this section does not describe operand restrictions for cross-path values: S and M units can use a cross-path value only as src2, not src1; L units can use a cross path value for either src1 or src2 (this restriction is described in Section 2.3).

The other major restriction is on paths for data for memory accesses.

There is a restriction that a given register cannot appear in more than four different operand locations. This seems very unlikely to happen, so you can ignore this.

Finally, there can be just one branch to a label each cycle, even if they never both have true predicates at the same time.

Overview of the default greedy list scheduler

Our current parition/scheduler performs the following steps:

Assign each parameter variable to a side (this is actually determined by the c6x calling convention).
Assign each local variable to a default side (filling in the var2reg hash).
- Even though an actual register is assigned to the variable, it is only the side that matters
Perform a fixed partioning of all nodes, assigning them to side 1 or 2:
- based on the side of its variable (for etas and mus)
- based on required output side (e.g. args (parameters), ret)
- finally, the remaining nodes are assigned randomly to one side or the other
- be careful! - our partitioning code uses 1 and 2 for the sides, with 0 meaning not assigned, whereas the Leupers code uses 0 and 1 for the two sides.
Schedule each circuit:
- Initialize readyList to args, mus, and constants
- Initialize nonReadyList to everything else
- Remove and schedule nodes on the readyList iteratively, checking successors to see if they should then be moved to the readyList. If a side has not been assigned for a node, it could get scheduled on either side.
- When checking if a node can be assigned to a particular function unit for a particular cycle, and it uses a value produced by an node on the other side, it is checked whether either the cross path is available, or if a copy can be scheduled earlier. If neither, it is not a valid slot.

Traversing the circuit graph

The sorted node list is very handy. A node's ASAP cannot be updated accurately until all of its predecessors' ASAPs have been updated. If the nodes are visited in the order of the sorted list, you are ensured that all predecessors have already been visited so you don't have to check. Traverse the list backwards for ALAP calculation.

You usually don't want to cross backedges since we are doing acyclic scheduling. You can use the Operator::inIsBackedge(inputNum) method, and/or ignore eta to mu/switch/hold edges. (See Scheduler::showMyNodes() for an example of using inIsBackedge()).

You don't want to cross to another circuit: A node in one circuit can connect directly to a node in another circuit. Use code like this to skip edges to another circuit, where c is your current circuit and dest is the other node:

	if (!c->getOperations()->member(dest)) continue;

However, with the flags given, inter-circuit edges always originate at etas. So you can avoid both back edges and inter-circuit edges by avoiding any edge originating at an eta.

Notes on handling different node types

Some nodes require no hardware resources but cannot be ignored since they are important for mainaining dependency relations between other nodes. Specifically, a tkand node combines two token edges and outputs one; these edges just keep memory accesses in program order when necessary. So, your scheduler needs to be able to assign a tkand to a cycle without assigning it to hardware. A tkand should have 0 latency.

There will be a number of nop operations in your graph. These are good in that they give the register allocator more flexibility, but bad in that they take up an instruction slot. Hopefully the register allocator can elimnate them later.

Constants that will fit into a 5-bit immediate field are marked as "initially scheduled, no function unit" -- i.e. they are free -- assuming that they will fold into the using instruction. This may be optimistic in some cases. Constants that cannot be handled as immediates must be explicitly loaded into a register for subsequent use. In that case, the one or two instructions to do so must be assigned to actual slots in the schedule.

There is much more information about modifications to the Pegasus graph here.

Hints for improving the list scheduler

If you decide to look into the heuristics that the list scheduler uses, and make improvements, keep the following in mind:

While most of your changes will be in schedasm.{h,cc}, you might want to add some options -- it is easist with the -option xxx / if (OPTION("xxx")) mechanism. You just need to add a line to c2dil.cc to add a new option.
The Leupers paper contains a lot of useful information that you might not have caught the first time you read it. You will gain more appreciation for it after you start trying to implement the cluster-sensitive scheduler! Not that you would want to try to follow that algorithm precisely -- that would be boring.
Note that there is extra information now needed for loads: whether the data is written back to the regfile in the same cluster or in the other cluster.
For each initial value (mu), is it in the A regfile or B regfile? You can assume one side or the other, or define which side you want each value to start on based on your partitioning. However, you cannot assume that there are copies of each variable on both sides.
There is a case you have to watch out for. Say you're evaluating whether you can schedule OP at cycle X on side A. Both of its operands come from side B, and the crosspath is already being used that cycle. You check its first operand and see that you can add a copy earlier in the schedule. You check its second operand and see that you can add a copy earlier in the schedule. Have you made sure you're not counting on the same slot for both copies?
Inserting a copy: look at Scheduler::insertCopy(...)

You probably want to keep track of whether a particular operation's output has already been copied over. I used "Hash<Operation*,Operation*> existingCopy". And here's the code fragment to transfer an input to the copy's output:

            ...Operation *op ...
            ...unsigned i is the input of interest...

            Wire* w = op->input(i);
            if (w == 0) continue;
            Operation* osrc = w->source();
            ....
            ..stuff..
            ....
            ..if osrc is on opposite side..
            ....
            if (existingCopy.member(osrc)) {
                w->disconnectFromOperatorInput(op, i);
                Operation *copy = existingCopy.lookUp(osrc);
                copy->connect(0,op,i);
                continue;
            }

Now we are considering actual data transfer, so you need to differentiate between data edges and token edges. Here is a code fragment:

        for (unsigned i=0; i<op->noInputs(); i++) {
            Wire* w = op->input(i);
            if (w == 0) continue;
            Operation* osrc = w->source();
            ....
            //--- ask osrc if the port driving this wire is a token output
            if (osrc->emitsToken(w->sourceI().getIO())) continue;
            ....
        }