Some mumblage about implementation issues for the RT implementation:


			RUNTIME ERROR SIGNALLING

From the user's viewpoint, a nice way to signal run-time errors in
open-coded stuff is to build a call to the safe interpreter definition of
the function.  This function then signals the error.  This is more
user-friendly, since it shows the function and arguments on the stack.
Furthermore, if the interpreter definition has whizzy error correction,
then it will also apply to compiled code.

Unfortunately, the type-checking strategy we have chosen is seriously
incompatible with this, since type checking is separated from the
operation.  Although we could perhaps to some degree keep track of which
operation is responsible for the check, this isn't good enough, since the
type check may be done on one argument before the other arguments have been
evaluated.  Since the arguments don't exist yet, we can't pass them to the
function.

So this technique cannot in general be applied to open code.  We can use it
for miscops that implement functions, and could also use it for open-coded
(possibly implicit) functions such as Symbol-Value that combine the check
with the operation.


			    PROFILING

It would be nice to have a general profiling mechanism like the one MIPS uses
to gather dynamic instruction statistics.  This could be used for user-level
tuning, but it would be especially appropriate for analyzing the compiler's
performance for tuning or publication.

The basic mechanism is that we associate a counter with each basic block, and
increment the counter when we execute that block.  To make use of these
per-block counts, there must also be some additional information about what is
in each block.

We implement the counters by having a 32bit I-Vector pointed by the profiled
component's constant pool.  When we enter the component, we stash this in a
known, permanently allocated register.  We also have another permanently
allocated register that is used as a scratch location to load the current count
into and increment.

Probably a useful kind of per-block information is to have for each VOP used in
the block, the name, the number of times used, and the generator number that
was used.  We represent this by having a simple-vector parallel to the counters
vector.  Each entry points to per-block information, perhaps represented as a
simple-vector of VOP names and use counts, with the generator number only being
specified when it isn't 0.  We also have the start PC for each block so that we
can use the debug information to find out what the block is doing.

VOP usage information is probably most interesting as far as the compiler is
concerned.  As long as we have accurate cost information, we can directly
determine what percentage of time is spent doing each VM operation, instead of
having to guess from instruction usage statistics.  These results could be used
either to tune performance-critical generators, or to rewrite code so as to
avoid using expensive operations.  

This information could also be useful testing purposes.  If the VOP usage
signature of a test changes, then this a warning that a pessimization may have
been introduced by a compiler change.  We could also detect generators that
aren't used by our test suite.

It should be easy to hack the assembler to dump instruction usage information
too, although this would be of more interest to architecture types.


			    LINK TABLE

There isn't anything particularly good about the current link table format.  It
has gratuitously bad VM locality, since the link table entries for related
functions are scattered about randomly in the hash table.  It also extremely
expensive (~25 ms) to do set-definition, since it requires a scan of the entire
link table.

What we do is allocate a new cell in the symbol that points to the head of the
chain of link-table entries for that symbol.  A link-table entry has the
following format:
    0: argument count for this entry
    1: pointer to next entry for this symbol or 0 if none
    2: environment for call
    3: entry PC for call

Link table entries are stored in a special system table area so that we can
pass around random pointers to them.  We could squeeze the argcount and the
next pointer into a single word if we really wanted to.

This representation makes it trivial to find all the link table entries for a
given symbol, so set-definition can win.  It also maintains some kind of VM
locality, since link table entries will be allocated in load order rather than
randomly.  We also win in load-time VM locality, since we don't have to fault
in a moby hash-table in order to find the load-time fixups.

It might seem that this would take more space than the current representation,
but we can actually expect a 20% space reduction since we need only allocate as
many link table entries as are used.  Although though the total number of
symbols with like table entries is less than 20%, we can get away with
allocating a cell for each symbol since we gain lots of space by flushing the
hashtable.


		        FAST SYSTEM FUNCTION CALL

Adding the notion of a constant function would allow a super-fast calling
sequence.  A constant function is a function which may not be redefined.  We
put the function object and entry PC in our constants vector.  We know which
registers it uses, so we might not need to save any.  We know how many values
it returns.  We know if it changes SP.  Calling a simple constant function may
be nearly as fast as a miscop call.

Call of a constant function known not to trash any of our registers.  Assume
ENV and PC were saved at function entry.  We don't have to do anything on
return.
	lm	env-save, ENV, offset	; Get new ENV and PC
	balr	PC, PC

This would hair up GC some, but not too much: all you have to do is look out for
function-objects followed by PC's while scavenging function space.


Easier and perhaps more worthwhile would be a way to define miscops in Lisp.
There would be some Defun-Miscop form or something.  This would allow basic
"lispy" primitives such as MEMQ, plist hackery, LENGTH, and whatnot, to be
implemented in portable Lisp code with no loss in efficiency.  Generic
arithmetic and perhaps even storage allocations would also be possibilities for
rewriting in Lisp.

The code would be required to obey certain restrictions so that the miscop
calling convention can be used.

The callee-saves convention means that we have to be careful to restore all the
registers we saved on any exit from the code.  The easiest way to do this would
be to require that non-local exits never be done within the dynamic extent of
the function.  This is probably somewhat too restrictive, since we need to be
able to signal errors, and they can throw.  

If we always restore the registers around any full calls that we do, then we
can always call a function to do arbitrary things that may need to be done.
We would have to do the same thing before signaling the error in an inline
error check.

The other distinguishing characteristic of the miscop call convention is that
no environment is passed.  In addition to the obvious restriction that a miscop
function couldn't close over its environment, this also makes it difficult to
reference non-immediate constants.  Fortunately, the link-table allows us to do
almost any kind of global function call.  It is probably acceptable to require
the global variables we need to be explicitly specified as "built-in" when we
make the kernel core, causing the symbols to be placed at a known location.
A global variable access mechanism also provides a way to get non-immediate
constants: just do a DEFCONSTANT.


			MULTIPLE THREAD SUPPORT

    When we support multiple threads under Mach, we will have to scrounge a
THREAD register.  This will point to a structure which contains the
thread-specific context which doesn't fit in registers:
  Current special binding block
  Active catch pointer
  Any number stacks

  Information for management of the resources for this thread:
    Base and size of stacks.
    Descriptive info (name...)
    Policy info (priority...)
    OS level handle on the thread
    Synchronization info such as what lock the thread is waiting on, if any.

The THREAD pointer could actually be the base of the thread's control stack,
with the frequently used info being stored before any real stack stuff.  This
would allow fast checks for continuation validity, since both the top and
bottom of the cstack will be in registers.  This is useful, since a non-local
GO into another thread's stack would cause flaming death.


    We will also have to modify code which manipulates resources shared
between threads so that it does synchronization.  The big issue
performance-wise is the alloc-table: when there is no lock on the table, we
want to get at it real fast.  Contention for the alloc-table could be reduced
by having multiple allocation spaces (and tables) for important types.  You
could just grab a table off of a queue and put it back when you are done.

    It will be necessary to make all miscops "interruptable", since they will
have to be able to run concurrently.  Probably the only resource shared at the
miscop level is the alloc-table.  Note that software interrupts are going away,
so strictly speaking interrupts won't happen, but a scheduling break can
happen at any time.  The only big problem I see is with miscops which
manipulate raw pointers into objects.  The system must either avoid GC while
such a pointer is in use or adjust the raw pointer if the object is moved.
Probably the fix is to allow raw pointers in boxed registers as long as the
real object is also in a boxed register.  This would allow random pointer
hacking to be open-coded as well.  Being preempted during object allocation
would still be a problem, but the allocator will have to get a lock on the
alloc-table before it can do anything, and GC can't run without the alloc
table.

    Of course, we will have to go to deep binding when we support threads.
We can eliminate the binding stack by threading the BIND blocks together on the
control stack.  This will simplify the problem of allocating stack for each
thread.  Deciding on the exact implementation of deep-binding is non-trivial,
since there are many different ways it could be done.  The scheme used on the
S1 seems like a reasonable start, though.  A BIND block looks like this:
    <first symbol bound>
    <ptr to value cell of first symbol>
    .
    .
    .
    <n'th symbol bound>
    <ptr to value cell of n'th symbol bound>
    <pointer to start of previous BIND block>

The value cells are just allocated on the stack like a lexical variable.  The
BIND block is delimited by the prev pointer.  The reason for keeping a pointer
to the cell instead of just using the stack location directly is that it lets
us cache binding lookups.  Whenever we look up a variable, we link a LOOKUP
frame into the bindings.  This lookup frame contains the result of our search
and looks just like a BIND frame as far as the search is concerned.  The theory
is that this will reduce the average distance searched, since commonly used
variables will commonly have been used somewhere near you on the stack.  Since
we look up the address of the value cell rather than the value proper, we only
have to do the lookup once per function.  Note that there is no obligation to
link in a LOOKUP block; if it is unlikely or impossible that any function we
call will reference any of the variables that we reference, then not making a
LOOKUP block is a good idea.

    The S1 has an important optimization of variable lookup.  It is based on
the observation that many special variables are never actually bound.  If a
variable is known not to be bound, we can just go directly for the value cell.
One solution to the problem would be to introduce a new GLOBAL variable kind
which cannot be bound, but we can do nearly as well by noticing the lack of
bindings at run time.  The S1 keeps a count of bindings in a cell in each symbol;
if the count is zero at lookup time, we don't have to do any lookup.  I think
that it would work just as well to have an EVER-BOUND flag which is set and not
reset.  One advantage of this approach is that there is no need to do anything
when a binding is removed.  On the S1, THROW needs to scan up the bindings,
decrementing the count, which requires a way to distinguish BIND frames from
LOOKUP frames.

    There is a trick which might be a win if BIND and LOOKUP blocks
have a large enough average size.  The idea is to use a hash filter to
determine if a given symbol might be in a particular block.  Each symbol name
is hashed to a fixnum with a few (3 or 4) random bits on.  When a block is
built, the OR of these hash values is placed at the beginning.  When looking
for a symbol, we can skip a block if any of the bits for the symbol are zero in
the hash for the block.  All these hash codes are computed at compile
time, of course.  When binding an unknown symbol, we just use all 1's.  We
can't delimit the block with the prev link, since that has to be next
to the hash so that we can skip the block without going to the end.  A block
would look more like this:
    <hash of all symbols in block>
    <pointer to start of previous BIND block>
    <first symbol bound>
    <ptr to value cell of first symbol>
    .
    .
    .
    <n'th symbol bound>
    <ptr to value cell of n'th symbol bound>
    <marker for end of block (0?)>

In a block less than six or so entries, we can reject it with about 90%
probability.  The probability improves as the block size decreases, but the
advantage decreases as well.  The actual advantage also depends on the dynamic
characteristics of the lookup process, since if it turns out that a lookup is
usually satisfied on the previous block, we will never reject any blocks.

Bizzare optimization on deep-binding:

If we notice that a function which we call in a local call is referencing a
variable we bind, we can pass in the value cell pointer as an argument so that
it doesn't have to do the lookup.  It is questionable whether this would help
real code even with massive block compilation, but it sure would win for STAK.

We could actually generalize this optimization to be less obviously bogus.
What we do is shallow-bind the *location* of the current value cell for
each special variable which is both referenced and bound.

At each entry-point to the block, we build a specbind frame for variables
bound in the block (doesn't need to be all).  Initially all the value-cell
pointers are to the looked-up current values.  Within the block, we keep track
of the current value cell by shallow-binding it in some known location.  This
could be in a register or in the specbind frame itself.  Internally to the
block, we can do a special reference by indirecting through the magic location,
and can do a binding by saving the old value in our frame and then storing the
location of the new cell in our frame.  When we do a full function call, we
move any value cell pointers which are in registers into the correct place in
the specbind frame.

This could make STAK *faster* than normal shallow-bound implementations, and
might well help real programs also.


			TYPE SYSTEM


Hash-cons type structures:
    When we define a type, we define the equality and hash functions for each
    slot (default equal and sxhash).  We automatically generate a "constructor"
    function that takes all the slot values as fixed arguments and returns the
    preexisting equivalent type of there is one, or makes a new one if not.
    Doing this makes type equality be EQ and reduces type consing.  We could
    have a family of special comparison and hashing functions in order to get
    an efficiency win.  We can just use a simple-vector as an open hash-table
    Things are simple since we never delete anything.  We must store the hash
    in the type object so that we can rehash without having to know how to hash
    the type object, but having the hash is also useful for a quick collision
    test.

    It is probably undesirable to hash-cons function types, since comparisons
    of function types are something of a marginal proposition.  This won't hurt
    efficiency, since function types are rarely seen as "normal" types by
    ir1opt type inference.  Hash-consing function types would be difficult due
    to complexity of function types, and would probably also substantially
    increase the amount of garbage tied down by the hash table, since there
    will potentially be a lot more function types than other types in a program
    with full declarations.  On the other hand, if all the types are 
    placed in the environment by Proclaim, and cannot be cleared, then the
    types won't be GC'able anyway.  In this case, we would reduce consing by
    hash-consing.

    There isn't any need to consider the nature of a structure type when
    hash-consing it.  We can always use the same type for the same named
    structure, and replace the Defstruct-Description when the structure is
    redefined.  (Could cause problems with bootstrapping, though.)

Have a cache of type-specifier => structure translations:
    We use a fixed size equal/eq hash cache.  We descend into lists up to a
    small fixed depth such as 3, and eq hash when we hit an atom (or bottom out?).
    The fixed recursion can just be written as nested loops.  We use two
    parallel open hash vectors, one for the specifier and one for the type object.
    We must flush the cache whenever a translation may be invalid, i.e. if
    someone does a DEFTYPE.  [Or a DEFSTRUCT?   Could cause hairy type to
    become non-hairy...]

Union, Intersection and Intersects should be methods in addition to subtype.

Should we unfold the Numeric-Type and/or Array-Type objects?  We could keep the
    Type objects mashed together, but still have the necessary distinct
    Type-Info structures to represent the hierarchy.

Ultimately the compiler type stuff should completely replace all the existing type
hackery.  We define Typep, Type-Of and Subtypep by doing the necessary
translations between type specifiers and type descriptors.  Defstruct will be
fitted directly into the system, with the Defstruct-Description becoming a Type
structure.  Structures contain the type descriptor as their first element,
rather than the type name.  We implement the interpreter versions of slot
accessors by having generic accessor and setter functions that close over the
slot description.  The generic functions examine the slot description to find
the types to check the arguments for and the offset within the structure of the
slot.  The compiler can special-case calls to these magical functions.  We can
live with this bit of inefficiency in order to get the major win of always
having accessors that are both compact and safe.  Since we can generate inline
code that is smaller, faster and just as safe, we would never call the accessor
function in compiled code unless a switch was thrown to cause slot accesses to
be late bound.

Whenever we directly represent a type in compiled code, we use the type
descriptor rather than the name.  We have Typep and Check-Type miscops that
take an object and a type descriptor.  The Typep miscop just calls the typep
method out of the type descriptor.  The Check-Type miscop calls the typep
method and then signals an error if it returned false.

We also want to have fast structure type checking.  We can have a
Structure-Typep miscop which takes the type descriptor as an argument.  If the
object is a structure, we grab out the type descriptor and compare it to the
test type.  If they are EQ, we win.  Otherwise, we grab the includes list out
of the object's type and see if the test type is a member.  The structure and
EQ test could quite reasonably be open-coded.  We could also do a quick test
for the includes being null, which would let us punt the member test in the
common case of a structure with no includes.  We could do the member test
faster if the includes were represented as something other than a list, i.e. 0
if no includes and a vector with 0 at the end when there are includes.  We also
want a Check-Structure-Type op.  We probably want a distinct major type for
structures, both to speed up the structure test and to prevent random aref'ing
of structures.  Could omit the G-Vector header and have GC look at the
defstruct description (heh heh heh...)

Probably some amusing bootstrapping problems in getting a kernel core
started.  Type descriptors are structures which have type descriptors
which... 

(defstruct type
  ;;
  ;; The type-info structure which contains the constant information about this
  ;; type. 
  info
  ;;
  ;; The hash of this type structure, so we can rehash without recomputing the
  ;; hash.  This is also useful for a quick collision test.
  hash)


#### Definition of the subtypep methods needs to be rethought...
(defstruct type-info
  ;;
  ;; The name of this type, for printing.
  name
  ;;
  ;; The depth in the type hierarchy that this type appears at.  This is used
  ;; to quickly determine if a subtype relationship might exist, and what that
  ;; relationship must be if it does exist.  This is null in non-hierarchical
  ;; types such as Union.
  (depth nil :type fixnum)
  ;;
  ;; The most specific type which always includes this type.  This serves as a
  ;; guard on the type1 method.  The type1 method is never called with a type2
  ;; that isn't a subtype of the Supertype.
  (supertype *wild-type* :type (or type null) :read-only t)
  ;;
  ;; A list of all the types which have this as their supertype.  This forms an
  ;; exhaustive partition of this type.
  (sutypes () :type list)
  ;;
  ;; Function which computes subtypep when this type is the first arg.
  (subtypep-type1 nil :type function :read-only t)
  ;;
  ;; Function which computes subtypep when this type is the second arg.
  ;; Subtypep always calls this function first, and it is responsible for
  ;; calling the type1 method.
  (subtypep-type2 #'vanilla-type2 :type function :read-only t)
  ;;
  ;; Function that computes Typep for this type.  Object is the first arg and
  ;; the Type object is the second arg.
  (typep nil :type function)
  ;;
  ;; Function which returns a Common Lisp type specifier representing this type. 
  (unparse nil :type function :read-only t)
  ;;
  ;; True if this type has a fixed number of members, and as such could possibly
  ;; be completely specified in a MEMBER type.  This is used by the MEMBER subtype
  ;; methods.
  (enumerable nil :type (member t nil) :read-only t))


Function call code sequences:

@subsection[Named Call Sequence]

The @value(DinkyMachine) code to perform a function call using the link table
is:
@begin(Example)
	stm	CONT, start-reg, -nregs	; Save registers
	cau	NL3, link-tab-high
	lm	NL3, ENV, link-tab-low	; Get entry and env
	balr	PC, MISC-PC		; Go
@end(Example)
First, we save as many as registers as are live at the time of call.  Next, we
load the contents of the link table entry.  We must use two instructions, since
the @value(dinkymachine) cannot generate an absolute address in one
instruction.  After we have the link table entry, we jump to the entry point,
storing the return pc in PC.

@subsection[Anonymous Call Sequence]

Assuming FUN is a register which has already been checked to be a function
entry, the @value(DinkyMachine) code to call an unknown function is:
@begin(Example)
	stm	CONT, start-reg, -nregs	; Save registers
	lm	FUN, ENV, 1		; Load code, offset.
	a	MISC-PC, ENV		; (?) Compute entry point.
	inc	MISC-PC, offset
	lr	ENV, FUN		; Set up call ENV.
	balr	PC, MISC-PC		; Go
@end(Example)
Similar to named call, but we read stuff from the function entry, and compute
the entry point.

@subsection[Tail Recursive Call]

Tail recursive calls are identical to normal calls, except that the return PC
is explicitly supplied, rather than being computer by BALR.  We just move the
PC to PC and BR to the function.


@section(Returning from a Function Call)
@label(Return)
@index(Return)

Returning is fairly simple, since we implement function call tail-recursively,
guaranteeing that any stack values are left directly on top of the caller's
stack frame.  The caller is responsible for receiving or discarding the values.
This isn't very onerous in the usual case of discarding any values, since this
normally only requires moving CONT + a constant into CS.

@subsection[Return Value Passing]

Return values are passed in the same locations that arguments are, with the
first three in registers and the rest on the top of the stack.  However, the
values count is handled somewhat differently than the argument count.  There
are two values passing conventions: single value and variable value.

@subsection[Single Value Return Sequence]

If we are returning exactly one value, then the sequence is simple.  We put
the value in A0 and return at 4 bytes after the return PC.  Assuming the
value is already in A0 and the return pc is the register SAVE-PC, the code
is:
@begin[example]
	inc	SAVE-PC, 4	; Mark as single-value.
	br	SAVE-PC		; Go for it.
@end[example]
The increment of the PC could possibly be combined with any
register-to-register move done on the PC.


@subsection[Variable Value Return Sequence]

If it is possible that we return other than one value, then we set any unused
arg registers to NIL, place any extra values on the top of the stack, and
return directly at the return PC.  The variable value assuming the register
values are already in place, and there are no stack values, the code sequence
looks like this:
@begin[example]
	<cau	A2, nil-h>	; Default unsupplied register values.
	<lr	A1, A2>
	<lr	A0, A1>
	lis	NARGS, count	; Set number of values.
	br	SAVE-PC		; Go.
@end[example]
If there are stack values, then there may be a miscop call that BLT's the
values down on top of the current frame.  Nothing clever, just
Return-Values, taking Start, End and Count arguments.  This miscop will only
be needed when the compiler didn't initially allocate the values at the
beginning of the frame, which is usually possible when the number of values
being returned is known. 


@subsection[Single Value Entry Sequence]

The return point for receiving a single value is fairly simple, since all it
needs to do is blow away any stack values that might be there:
@begin[example]
	cal	CS, CONT, size		; Restore CS.
	lm	CONT, start-reg, -nregs ; Restore registers.
@end[example]
This sequence must immediately follow the balr instruction, since the callee
adds to the return PC to get the correct entry point.  Note that in the
single-value return case, we don't execute the instruction, although it would
be harmless to do so.

@subsection[Multiple Value Entry Sequence]

The return point for receiving two or three values is also simple.  If variable
values are returned, we discard any stack values.  If a single value is
returned, we default the unsupplied values:
@begin[example]
	b	mv-tag		; If MV's, skip.
	cau	A1, nil-h	; Default unsupplied values.
	<lr	A2, A1>
	b	done
mv-tag:
	cal	CS, CONT, size	; Flush stack values.
done:
	lm	CONT, start-reg, -nregs ; Restore registers.
@end[example]
If there are stack values, then we emit code to NIL out the unsupplied
values using a miscop call:
@begin[example]
	b	mv-tag		; If MV's, skip.
	cau	A1, nil-h	; Default register values.
	<lr	A2, A1>
	b	done
	lis	NARGS, 3	; Set value count.
mv-tag:
	cal	CS, CONT, size	; Set CS for desired values.
	dec	NARGS, nvals	; Number to default.
	miscop 	default-values	; Default them.
	
	lm	CONT, start-reg, -nregs ; Restore registers.
@end[example]
Default-Values just sets to NIL the -NARGS cells under CS.