Some mumblage about implementation issues for the RT implementation: RUNTIME ERROR SIGNALLING From the user's viewpoint, a nice way to signal run-time errors in open-coded stuff is to build a call to the safe interpreter definition of the function. This function then signals the error. This is more user-friendly, since it shows the function and arguments on the stack. Furthermore, if the interpreter definition has whizzy error correction, then it will also apply to compiled code. Unfortunately, the type-checking strategy we have chosen is seriously incompatible with this, since type checking is separated from the operation. Although we could perhaps to some degree keep track of which operation is responsible for the check, this isn't good enough, since the type check may be done on one argument before the other arguments have been evaluated. Since the arguments don't exist yet, we can't pass them to the function. So this technique cannot in general be applied to open code. We can use it for miscops that implement functions, and could also use it for open-coded (possibly implicit) functions such as Symbol-Value that combine the check with the operation. PROFILING It would be nice to have a general profiling mechanism like the one MIPS uses to gather dynamic instruction statistics. This could be used for user-level tuning, but it would be especially appropriate for analyzing the compiler's performance for tuning or publication. The basic mechanism is that we associate a counter with each basic block, and increment the counter when we execute that block. To make use of these per-block counts, there must also be some additional information about what is in each block. We implement the counters by having a 32bit I-Vector pointed by the profiled component's constant pool. When we enter the component, we stash this in a known, permanently allocated register. We also have another permanently allocated register that is used as a scratch location to load the current count into and increment. Probably a useful kind of per-block information is to have for each VOP used in the block, the name, the number of times used, and the generator number that was used. We represent this by having a simple-vector parallel to the counters vector. Each entry points to per-block information, perhaps represented as a simple-vector of VOP names and use counts, with the generator number only being specified when it isn't 0. We also have the start PC for each block so that we can use the debug information to find out what the block is doing. VOP usage information is probably most interesting as far as the compiler is concerned. As long as we have accurate cost information, we can directly determine what percentage of time is spent doing each VM operation, instead of having to guess from instruction usage statistics. These results could be used either to tune performance-critical generators, or to rewrite code so as to avoid using expensive operations. This information could also be useful testing purposes. If the VOP usage signature of a test changes, then this a warning that a pessimization may have been introduced by a compiler change. We could also detect generators that aren't used by our test suite. It should be easy to hack the assembler to dump instruction usage information too, although this would be of more interest to architecture types. LINK TABLE There isn't anything particularly good about the current link table format. It has gratuitously bad VM locality, since the link table entries for related functions are scattered about randomly in the hash table. It also extremely expensive (~25 ms) to do set-definition, since it requires a scan of the entire link table. What we do is allocate a new cell in the symbol that points to the head of the chain of link-table entries for that symbol. A link-table entry has the following format: 0: argument count for this entry 1: pointer to next entry for this symbol or 0 if none 2: environment for call 3: entry PC for call Link table entries are stored in a special system table area so that we can pass around random pointers to them. We could squeeze the argcount and the next pointer into a single word if we really wanted to. This representation makes it trivial to find all the link table entries for a given symbol, so set-definition can win. It also maintains some kind of VM locality, since link table entries will be allocated in load order rather than randomly. We also win in load-time VM locality, since we don't have to fault in a moby hash-table in order to find the load-time fixups. It might seem that this would take more space than the current representation, but we can actually expect a 20% space reduction since we need only allocate as many link table entries as are used. Although though the total number of symbols with like table entries is less than 20%, we can get away with allocating a cell for each symbol since we gain lots of space by flushing the hashtable. FAST SYSTEM FUNCTION CALL Adding the notion of a constant function would allow a super-fast calling sequence. A constant function is a function which may not be redefined. We put the function object and entry PC in our constants vector. We know which registers it uses, so we might not need to save any. We know how many values it returns. We know if it changes SP. Calling a simple constant function may be nearly as fast as a miscop call. Call of a constant function known not to trash any of our registers. Assume ENV and PC were saved at function entry. We don't have to do anything on return. lm env-save, ENV, offset ; Get new ENV and PC balr PC, PC This would hair up GC some, but not too much: all you have to do is look out for function-objects followed by PC's while scavenging function space. Easier and perhaps more worthwhile would be a way to define miscops in Lisp. There would be some Defun-Miscop form or something. This would allow basic "lispy" primitives such as MEMQ, plist hackery, LENGTH, and whatnot, to be implemented in portable Lisp code with no loss in efficiency. Generic arithmetic and perhaps even storage allocations would also be possibilities for rewriting in Lisp. The code would be required to obey certain restrictions so that the miscop calling convention can be used. The callee-saves convention means that we have to be careful to restore all the registers we saved on any exit from the code. The easiest way to do this would be to require that non-local exits never be done within the dynamic extent of the function. This is probably somewhat too restrictive, since we need to be able to signal errors, and they can throw. If we always restore the registers around any full calls that we do, then we can always call a function to do arbitrary things that may need to be done. We would have to do the same thing before signaling the error in an inline error check. The other distinguishing characteristic of the miscop call convention is that no environment is passed. In addition to the obvious restriction that a miscop function couldn't close over its environment, this also makes it difficult to reference non-immediate constants. Fortunately, the link-table allows us to do almost any kind of global function call. It is probably acceptable to require the global variables we need to be explicitly specified as "built-in" when we make the kernel core, causing the symbols to be placed at a known location. A global variable access mechanism also provides a way to get non-immediate constants: just do a DEFCONSTANT. MULTIPLE THREAD SUPPORT When we support multiple threads under Mach, we will have to scrounge a THREAD register. This will point to a structure which contains the thread-specific context which doesn't fit in registers: Current special binding block Active catch pointer Any number stacks Information for management of the resources for this thread: Base and size of stacks. Descriptive info (name...) Policy info (priority...) OS level handle on the thread Synchronization info such as what lock the thread is waiting on, if any. The THREAD pointer could actually be the base of the thread's control stack, with the frequently used info being stored before any real stack stuff. This would allow fast checks for continuation validity, since both the top and bottom of the cstack will be in registers. This is useful, since a non-local GO into another thread's stack would cause flaming death. We will also have to modify code which manipulates resources shared between threads so that it does synchronization. The big issue performance-wise is the alloc-table: when there is no lock on the table, we want to get at it real fast. Contention for the alloc-table could be reduced by having multiple allocation spaces (and tables) for important types. You could just grab a table off of a queue and put it back when you are done. It will be necessary to make all miscops "interruptable", since they will have to be able to run concurrently. Probably the only resource shared at the miscop level is the alloc-table. Note that software interrupts are going away, so strictly speaking interrupts won't happen, but a scheduling break can happen at any time. The only big problem I see is with miscops which manipulate raw pointers into objects. The system must either avoid GC while such a pointer is in use or adjust the raw pointer if the object is moved. Probably the fix is to allow raw pointers in boxed registers as long as the real object is also in a boxed register. This would allow random pointer hacking to be open-coded as well. Being preempted during object allocation would still be a problem, but the allocator will have to get a lock on the alloc-table before it can do anything, and GC can't run without the alloc table. Of course, we will have to go to deep binding when we support threads. We can eliminate the binding stack by threading the BIND blocks together on the control stack. This will simplify the problem of allocating stack for each thread. Deciding on the exact implementation of deep-binding is non-trivial, since there are many different ways it could be done. The scheme used on the S1 seems like a reasonable start, though. A BIND block looks like this: . . . The value cells are just allocated on the stack like a lexical variable. The BIND block is delimited by the prev pointer. The reason for keeping a pointer to the cell instead of just using the stack location directly is that it lets us cache binding lookups. Whenever we look up a variable, we link a LOOKUP frame into the bindings. This lookup frame contains the result of our search and looks just like a BIND frame as far as the search is concerned. The theory is that this will reduce the average distance searched, since commonly used variables will commonly have been used somewhere near you on the stack. Since we look up the address of the value cell rather than the value proper, we only have to do the lookup once per function. Note that there is no obligation to link in a LOOKUP block; if it is unlikely or impossible that any function we call will reference any of the variables that we reference, then not making a LOOKUP block is a good idea. The S1 has an important optimization of variable lookup. It is based on the observation that many special variables are never actually bound. If a variable is known not to be bound, we can just go directly for the value cell. One solution to the problem would be to introduce a new GLOBAL variable kind which cannot be bound, but we can do nearly as well by noticing the lack of bindings at run time. The S1 keeps a count of bindings in a cell in each symbol; if the count is zero at lookup time, we don't have to do any lookup. I think that it would work just as well to have an EVER-BOUND flag which is set and not reset. One advantage of this approach is that there is no need to do anything when a binding is removed. On the S1, THROW needs to scan up the bindings, decrementing the count, which requires a way to distinguish BIND frames from LOOKUP frames. There is a trick which might be a win if BIND and LOOKUP blocks have a large enough average size. The idea is to use a hash filter to determine if a given symbol might be in a particular block. Each symbol name is hashed to a fixnum with a few (3 or 4) random bits on. When a block is built, the OR of these hash values is placed at the beginning. When looking for a symbol, we can skip a block if any of the bits for the symbol are zero in the hash for the block. All these hash codes are computed at compile time, of course. When binding an unknown symbol, we just use all 1's. We can't delimit the block with the prev link, since that has to be next to the hash so that we can skip the block without going to the end. A block would look more like this: . . . In a block less than six or so entries, we can reject it with about 90% probability. The probability improves as the block size decreases, but the advantage decreases as well. The actual advantage also depends on the dynamic characteristics of the lookup process, since if it turns out that a lookup is usually satisfied on the previous block, we will never reject any blocks. Bizzare optimization on deep-binding: If we notice that a function which we call in a local call is referencing a variable we bind, we can pass in the value cell pointer as an argument so that it doesn't have to do the lookup. It is questionable whether this would help real code even with massive block compilation, but it sure would win for STAK. We could actually generalize this optimization to be less obviously bogus. What we do is shallow-bind the *location* of the current value cell for each special variable which is both referenced and bound. At each entry-point to the block, we build a specbind frame for variables bound in the block (doesn't need to be all). Initially all the value-cell pointers are to the looked-up current values. Within the block, we keep track of the current value cell by shallow-binding it in some known location. This could be in a register or in the specbind frame itself. Internally to the block, we can do a special reference by indirecting through the magic location, and can do a binding by saving the old value in our frame and then storing the location of the new cell in our frame. When we do a full function call, we move any value cell pointers which are in registers into the correct place in the specbind frame. This could make STAK *faster* than normal shallow-bound implementations, and might well help real programs also. TYPE SYSTEM Hash-cons type structures: When we define a type, we define the equality and hash functions for each slot (default equal and sxhash). We automatically generate a "constructor" function that takes all the slot values as fixed arguments and returns the preexisting equivalent type of there is one, or makes a new one if not. Doing this makes type equality be EQ and reduces type consing. We could have a family of special comparison and hashing functions in order to get an efficiency win. We can just use a simple-vector as an open hash-table Things are simple since we never delete anything. We must store the hash in the type object so that we can rehash without having to know how to hash the type object, but having the hash is also useful for a quick collision test. It is probably undesirable to hash-cons function types, since comparisons of function types are something of a marginal proposition. This won't hurt efficiency, since function types are rarely seen as "normal" types by ir1opt type inference. Hash-consing function types would be difficult due to complexity of function types, and would probably also substantially increase the amount of garbage tied down by the hash table, since there will potentially be a lot more function types than other types in a program with full declarations. On the other hand, if all the types are placed in the environment by Proclaim, and cannot be cleared, then the types won't be GC'able anyway. In this case, we would reduce consing by hash-consing. There isn't any need to consider the nature of a structure type when hash-consing it. We can always use the same type for the same named structure, and replace the Defstruct-Description when the structure is redefined. (Could cause problems with bootstrapping, though.) Have a cache of type-specifier => structure translations: We use a fixed size equal/eq hash cache. We descend into lists up to a small fixed depth such as 3, and eq hash when we hit an atom (or bottom out?). The fixed recursion can just be written as nested loops. We use two parallel open hash vectors, one for the specifier and one for the type object. We must flush the cache whenever a translation may be invalid, i.e. if someone does a DEFTYPE. [Or a DEFSTRUCT? Could cause hairy type to become non-hairy...] Union, Intersection and Intersects should be methods in addition to subtype. Should we unfold the Numeric-Type and/or Array-Type objects? We could keep the Type objects mashed together, but still have the necessary distinct Type-Info structures to represent the hierarchy. Ultimately the compiler type stuff should completely replace all the existing type hackery. We define Typep, Type-Of and Subtypep by doing the necessary translations between type specifiers and type descriptors. Defstruct will be fitted directly into the system, with the Defstruct-Description becoming a Type structure. Structures contain the type descriptor as their first element, rather than the type name. We implement the interpreter versions of slot accessors by having generic accessor and setter functions that close over the slot description. The generic functions examine the slot description to find the types to check the arguments for and the offset within the structure of the slot. The compiler can special-case calls to these magical functions. We can live with this bit of inefficiency in order to get the major win of always having accessors that are both compact and safe. Since we can generate inline code that is smaller, faster and just as safe, we would never call the accessor function in compiled code unless a switch was thrown to cause slot accesses to be late bound. Whenever we directly represent a type in compiled code, we use the type descriptor rather than the name. We have Typep and Check-Type miscops that take an object and a type descriptor. The Typep miscop just calls the typep method out of the type descriptor. The Check-Type miscop calls the typep method and then signals an error if it returned false. We also want to have fast structure type checking. We can have a Structure-Typep miscop which takes the type descriptor as an argument. If the object is a structure, we grab out the type descriptor and compare it to the test type. If they are EQ, we win. Otherwise, we grab the includes list out of the object's type and see if the test type is a member. The structure and EQ test could quite reasonably be open-coded. We could also do a quick test for the includes being null, which would let us punt the member test in the common case of a structure with no includes. We could do the member test faster if the includes were represented as something other than a list, i.e. 0 if no includes and a vector with 0 at the end when there are includes. We also want a Check-Structure-Type op. We probably want a distinct major type for structures, both to speed up the structure test and to prevent random aref'ing of structures. Could omit the G-Vector header and have GC look at the defstruct description (heh heh heh...) Probably some amusing bootstrapping problems in getting a kernel core started. Type descriptors are structures which have type descriptors which... (defstruct type ;; ;; The type-info structure which contains the constant information about this ;; type. info ;; ;; The hash of this type structure, so we can rehash without recomputing the ;; hash. This is also useful for a quick collision test. hash) #### Definition of the subtypep methods needs to be rethought... (defstruct type-info ;; ;; The name of this type, for printing. name ;; ;; The depth in the type hierarchy that this type appears at. This is used ;; to quickly determine if a subtype relationship might exist, and what that ;; relationship must be if it does exist. This is null in non-hierarchical ;; types such as Union. (depth nil :type fixnum) ;; ;; The most specific type which always includes this type. This serves as a ;; guard on the type1 method. The type1 method is never called with a type2 ;; that isn't a subtype of the Supertype. (supertype *wild-type* :type (or type null) :read-only t) ;; ;; A list of all the types which have this as their supertype. This forms an ;; exhaustive partition of this type. (sutypes () :type list) ;; ;; Function which computes subtypep when this type is the first arg. (subtypep-type1 nil :type function :read-only t) ;; ;; Function which computes subtypep when this type is the second arg. ;; Subtypep always calls this function first, and it is responsible for ;; calling the type1 method. (subtypep-type2 #'vanilla-type2 :type function :read-only t) ;; ;; Function that computes Typep for this type. Object is the first arg and ;; the Type object is the second arg. (typep nil :type function) ;; ;; Function which returns a Common Lisp type specifier representing this type. (unparse nil :type function :read-only t) ;; ;; True if this type has a fixed number of members, and as such could possibly ;; be completely specified in a MEMBER type. This is used by the MEMBER subtype ;; methods. (enumerable nil :type (member t nil) :read-only t)) Function call code sequences: @subsection[Named Call Sequence] The @value(DinkyMachine) code to perform a function call using the link table is: @begin(Example) stm CONT, start-reg, -nregs ; Save registers cau NL3, link-tab-high lm NL3, ENV, link-tab-low ; Get entry and env balr PC, MISC-PC ; Go @end(Example) First, we save as many as registers as are live at the time of call. Next, we load the contents of the link table entry. We must use two instructions, since the @value(dinkymachine) cannot generate an absolute address in one instruction. After we have the link table entry, we jump to the entry point, storing the return pc in PC. @subsection[Anonymous Call Sequence] Assuming FUN is a register which has already been checked to be a function entry, the @value(DinkyMachine) code to call an unknown function is: @begin(Example) stm CONT, start-reg, -nregs ; Save registers lm FUN, ENV, 1 ; Load code, offset. a MISC-PC, ENV ; (?) Compute entry point. inc MISC-PC, offset lr ENV, FUN ; Set up call ENV. balr PC, MISC-PC ; Go @end(Example) Similar to named call, but we read stuff from the function entry, and compute the entry point. @subsection[Tail Recursive Call] Tail recursive calls are identical to normal calls, except that the return PC is explicitly supplied, rather than being computer by BALR. We just move the PC to PC and BR to the function. @section(Returning from a Function Call) @label(Return) @index(Return) Returning is fairly simple, since we implement function call tail-recursively, guaranteeing that any stack values are left directly on top of the caller's stack frame. The caller is responsible for receiving or discarding the values. This isn't very onerous in the usual case of discarding any values, since this normally only requires moving CONT + a constant into CS. @subsection[Return Value Passing] Return values are passed in the same locations that arguments are, with the first three in registers and the rest on the top of the stack. However, the values count is handled somewhat differently than the argument count. There are two values passing conventions: single value and variable value. @subsection[Single Value Return Sequence] If we are returning exactly one value, then the sequence is simple. We put the value in A0 and return at 4 bytes after the return PC. Assuming the value is already in A0 and the return pc is the register SAVE-PC, the code is: @begin[example] inc SAVE-PC, 4 ; Mark as single-value. br SAVE-PC ; Go for it. @end[example] The increment of the PC could possibly be combined with any register-to-register move done on the PC. @subsection[Variable Value Return Sequence] If it is possible that we return other than one value, then we set any unused arg registers to NIL, place any extra values on the top of the stack, and return directly at the return PC. The variable value assuming the register values are already in place, and there are no stack values, the code sequence looks like this: @begin[example] ; Default unsupplied register values. lis NARGS, count ; Set number of values. br SAVE-PC ; Go. @end[example] If there are stack values, then there may be a miscop call that BLT's the values down on top of the current frame. Nothing clever, just Return-Values, taking Start, End and Count arguments. This miscop will only be needed when the compiler didn't initially allocate the values at the beginning of the frame, which is usually possible when the number of values being returned is known. @subsection[Single Value Entry Sequence] The return point for receiving a single value is fairly simple, since all it needs to do is blow away any stack values that might be there: @begin[example] cal CS, CONT, size ; Restore CS. lm CONT, start-reg, -nregs ; Restore registers. @end[example] This sequence must immediately follow the balr instruction, since the callee adds to the return PC to get the correct entry point. Note that in the single-value return case, we don't execute the instruction, although it would be harmless to do so. @subsection[Multiple Value Entry Sequence] The return point for receiving two or three values is also simple. If variable values are returned, we discard any stack values. If a single value is returned, we default the unsupplied values: @begin[example] b mv-tag ; If MV's, skip. cau A1, nil-h ; Default unsupplied values. b done mv-tag: cal CS, CONT, size ; Flush stack values. done: lm CONT, start-reg, -nregs ; Restore registers. @end[example] If there are stack values, then we emit code to NIL out the unsupplied values using a miscop call: @begin[example] b mv-tag ; If MV's, skip. cau A1, nil-h ; Default register values. b done lis NARGS, 3 ; Set value count. mv-tag: cal CS, CONT, size ; Set CS for desired values. dec NARGS, nvals ; Number to default. miscop default-values ; Default them. lm CONT, start-reg, -nregs ; Restore registers. @end[example] Default-Values just sets to NIL the -NARGS cells under CS.