-*- Dictionary: int:design; Package: C -*- Structure stuff: Layout: -- contains information needed by the runtime system for type tests & GF dispatching -- changes whenever an open-coded accessor would be invalidated -- may be more than one, because type was redefined in-core, or because the compiler created it. Normally not more than one valid layout. Class: -- Is the "name" of a defined type -- Holds the current run-time info about this type, including the current layout. -- Not duplicated at compile-time when redefined for bootstrapping; only layout and layout-info is duplicated. see INFO TYPE COMPILER-LAYOUT Defstruct description & other meta-info: -- Used by the compiler, (and by introspective tools ?) Dylan: -- naming problems. we'd like to be able to support class names that aren't lisp symbols in the dumper. One solution would be to continue to use load-time-value to reference dylan classes (the bootstrapping issues can be ignored for now.) New Alien/dynamic extent: Staged transforms (delayed inline expansion) variable stack TNs (SPECIFY-TN-SIZE TN Size => Modified-TN) [b.t.w Alien variables are dynamic extent, and should inhibit tail recursion.] Note that variable TNs are just what we need for dynamic-extent allocation. In e.g. move-from-single, we can allocate a stack TN and initialize it in heap format, and return the pointer. Actually, we might want different VOPs for moving from dynamic-extent vars (a new feature of VOP arg restrictions?) In most cases, we wouldn't even need new IR2 convert methods. We could just define STACK-CONS, give it a dynamic-extent restriction, and a lower cost than CONS. Extend VOP temp syntax with :SIZE and :ENVIRONMENT {T | NIL} (incompatible with to/from.) This will only work for objects whose size is a compile-time constant (but could be things such as fixed-length vectors.) Since the stack allocation is associated with the allocator (rather than the variable), dynamic-extent variables could still be dynamically typed (i.e. we could store either a stack CONS pointer, a stack FLOAT pointer, or any random descriptor in the same variable.) We don't want to substitute away let bindings of alien r-values with Lisp representations, since that would actually be a pessimization (inhibiting register allocation.) We really can't do it anyway, since we have no way of knowing if the alien variable is side-effected after the lambda binding. What we *do* want to do is recognize and propagate alien expressions that are compile-time constant (addressing expressions.) We don't exactly want to substitute them, since that would cause redundant computation. But we do want uses of the let var to realize that this expression is a known alien value (and avoid consing an alien-value structure.) Since an alien value contains only a SAP and a Type, and the type should be compile-time constant, we can propagate the Alien type in the LAMBDA-VAR-TYPE, and change the actual variable value to be the SAP. We would then know that the primitive-type of an Alien-type value is actually SAP. (The alien-type must be a compile-time constant, and not merely known to be some sort of Alien.) We would change any non-Alien-operator references to the Alien-type var to be references to (sap-alien sap type), constructing a true alien-value if that reference isn't eventually discovered to be in an Alien context. Debug info: Some way to find the real function from an XEP so that i.e. FUNCTION-DEBUG-FUNCTION can do something sensible. Could handle function and keyword names in general by having a per-debug-info simple-vector map of names that we indirect through. Or even if we didn't use a simple-vector map, having one "dictionary" of names would probably simplify the debug-info format. For objects that will be in the core anyway, a pointer representation is quite compact. It would be in effect, 5 bytes + amortized vector ovherad. The worst-case overhead is 17 bytes, which is not bad. And for the common case of just one name, we could store the name directly in the debug-info, instead of storing a vector. (Component string name can be a special value. Note that if we did in general represent names as pointers, variable names would still want to be representable in a packed binary form, since we don't want to have to keep all those symbols around. We could have both a "packed dictionary" and a "pointer dictionary"; there might be some usefulness in indirecting to get variable names. But since we already have the existing variable format, perhaps its not worth changing. Definition/redefinition: Value namespace: special constant unknown alien variable documentation Function namespace: assumed function function, where-from macro special-form slot accessor inline, inlinep alien operator documentation Setf namespace: assumed function setf function setf macro (inverse or method) slot setter inline documentation Type namespace: builtin deftype defstruct documentation move macros.lisp into C package. V2 Python SSA IR1 optimization Path splitting, loop splitting Byte code (compact executable IR1 representation) Dynamic compilation (dependency recording/who calls) Dynamic call profiling (call count and order) Used both to drive dynamic compilation and for locality (load order) optimization. Demand driven loading Debug info left in fasl file Profile driven loading Global optimization mode Inlines or block-compiles functions and methods according to the in-core definitions and profile info. Leaves dependency info in case the bindings are different at load-time. Byte code: Global map has descriptors for referenced global variable (and methods), and integrates with the dependency recording mechanisms. Local map has variable descriptors (name, type), and indices map directly onto interpreted stack frame slots. Constant pool has types, random constants. Function map has external entry points (offsets, ???), and information used to reconstruct LET lambdas. How to handle continuation substitution (non-stack)? Would like not to inhibit it, since we would like to do all kinds of IR1 optimization before extracting the byte-code and then proceeding to generate IR2. Retain IR1 interpreter for debugging? Need some way to step through the guts of macroexpansions... In addition to the obvious "do something" code, the byte stream represents: -- Type assertions (wherever the continuation has type-check true) -- Non-obvious local derived-type information (from TRULY-THE or dynamic type inference.) -- BIND and ENTRY nodes Use debug-info-like representations for functions, blocks, etc. Probably should share code and representations with debug-info. Byte-code could be considered just another aspect of debug-info. cleanups, nlx-info? tail-set, tail-p? lexenv, source-path? Interpreted debugging: To implement stepper semantics, we need a mechanism for getting at the return values from forms. This seems impossible in compiled code, so this would be a place where the interpreter/compiler distinction would be apparent. (Actually, if the form was a call and we knew it, then we could access the return values from the passing locations.) I suppose something like: Monitor-Result Debug-Function Source-Location Function When a value is returned out of the form designated by Code-Location, call Function on the list of values returned. code-location-blocks code-location => (name . code-location)* code-location-go-tags code-location => (name . code-location)* code-location-variables code-location => debug-variable* code-location-functions code-location => debug-function* Note that with code-location-form-number, we have to map the source path to the form number, and presumably map it right back again. Oh well... Maybe we should provide an interface for grabbing the source, which can do the obvious thing with interpreted code, and can try to grab the source out of the file if compiled. Note that we can fairly easily cache the source-path => form-number translation by entering the paths in an EQUAL hashtable as we do the tree walk. [Can actually use the same code that the compiled debug-info dumper uses.] Paths: We can represent a path by its state vector (or function) which maps each IF to T, NIL, don't-care or not-executed. A path would only have don't-care for an IF if we merged the paths coming off of that IF, either because they weren't interestingly different, or because we were lazy. Advanced optimizations (flow analysis, code-motion and loop optimization) Consider code motion of control structure as a way to optimize OO programs and programs with other automatic inlining. Basic Zadjek and whatever is a good start, but in the delayed-binding OO environment, we may want to consider some other things. First of all, the conditionals due to delayed binding are highly predictable, so it is interesting to make a loop customized for all the probable branches. Consider the "state vector" of a segment to specify {true, false, unknown} for each conditional in the segment. This is an exceedingly compact representation of a particular branch configuration that it might be interesting to customize. For any tests that can be moved to the head, we can make customized versions for all those possible values. In reality, we will probably make only one customized version. (or one for each interesting class?) Possibly we would choose a few interesting state vectors (always including all unknown) and then simultaneously do flow analysis for all these subgraphs. Probably incremental flow analysis will be useful also. The other thing to consider is that in a late-binding environment (and to some degree in all symbolic programming) potentially loop-invariant conditionals can't be moved out of the loop because of possible side-effects (rather than because of data-dependency on a loop variable.) Yet in many cases, the loop will run a long way (or all the way) without these side-effects actually happening. In particular, consider the DEFMETHOD side effect v.s. the inline caching code. As long as the inline-cached code has no DEFMETHOD effect, then we can make a specialized loop that only checks the cache at the loop head. This requires some sort of "induction" step. In general, the loop actually still can have side-effects, but we have to check the cache any time we actually do a side-effect that might invalidate the cache. If the cache is invalidated, then we must branch out to the general-case version of the loop. Note that this might usefully apply to some symbolic loops. In particular, consider a "search and destroy" loop. Many times the loop may run without doing anything much (no side-effects), but sometimes it will do some arbitrary full call which might have any sort of side-effect. I guess the general idea is that some expressions may be "mostly invariant", that is, invariant but for dynamically rare side-effects. Instead of inhibiting invariant motion of these expressions, we allow motion, but validate the motion after each offending side-effect. If the expression isn't conditional, we just re-evaluate it it. If the expression is conditional, then we must evaluate it, and possibly punt out to a general-case loop (if this loop assumed the other sense of the conditional.) Liveness analysis based dead expression elimination might become useful here, since some invariant variables might be dead after a particular side-effect (consider multiple side-effects back-to-back.) There is no point in revalidating a cache if it is about to be invalidated immediately. Note that this loop-splitting will pretty badly mangle the flow graph and IR. Jumping into the middle of the general loop is especially nasty. In addition to updating flow info, we have to effectively assign the specialized loops variables to the general loop's variables (and compute any additional invariants the general loop has cached.) Maybe we would do sets, then redo SSA conversion. Similarly, we would have to somehow transfer over the values of any continuations whose values had been computed but not yet received (urk...) There could be some problems with putting the aggregate flow graph back into SSA, though, since it will be highly non-reducible. I don't know if this can be done more compactly than having a separate general loop for each entry point. Note about SSA: IR1 is typically nearly in SSA already. The only variables that need to be frobbed are set variables. We could put IR1 in SSA in a way that represented joins as tail local calls (i.e. the multi-let concept). In other words, we use SSA conversion to recast conventional iteration into functional iteration. Any set becomes a let of the new value, and when we join two sets, we just do two tail calls to a multi-let that binds the join variable to whichever of the introduced let variables. This has nice properties: -- There is true equivalence between conventional iteration and functional iteration without writing each optimization twice (once for set variables and once for local calls.) -- We convert set variables into unset variables so that: - Optimizations can stop worrying about the correctness implications of assignments. - Programs that use gratuitous assignments won't be penalized. After SSA conversion, the basic unset-variable optimizations can go to town r.e. the "multi-let" idea. Call it an "assignment". It has >= 2 calls, but no return. Environment semantics are like let, but call arguments are treated pretty much like a non-let local call. When calls go to 1, it becomes an ordinary let. A local call is changed to an assignment if <= 1 call is non-TR. The body is spliced in at the non-TR call, much like a let is now. All the TR calls become a combination whose successor is the function head. Note that if we eagerly convert all tail local calls (not in XEPs) to a direct control transfer, then this provides a natural way to drive automatic conversion to an assignment. When conversion of a TR call leaves one or fewer non-TR calls, then we make it an assignment. (Or when a call is deleted.) Until (and if) we convert to a multi-let, the TR calls will look a bit odd, since their CONT will still be the return-result, but their successor will be the function head. But that's currently what TR local calls look like after LTN. We need to be totally sure that the calls will be TR, though, which requires checking whether there is any change in the cleanup that might cause the call to be non-TR. We already have tail-set stuff incrementally noticing when calls are TR (more or less; we don't currently care if the call is true TR, just if it returns its result.) Early conversion is a good idea, since we would like tail calls to be like assignments even when the function needs its own frame. Do we delete the bind node and calls of an assignment if all the variables are deleted? I guess there's no reason why not... Cleverer IR1 type-inference/optimization? It seems that we really would like to know something about loops in IR1. Probably IR1 is the place to do loop induction for strength weakening and type inference. It might also be worthwhile to have IR1 do the flow graph frobbing necessary to allow loop invariant optimization in IR2. This would preserve the property of the flow graph being (mostly) unmodified in IR2. I guess, also, that if we are replicating code in a loop so that we can make the first exit test an invariant, we would also like to be able to do constraint propagation to exploit this. Constraint propagation currently gets confused when there is a loop. If we had dominators info, then we could ignore predecessors that are dominated by the current block. Currently such blocks never have any of the constraints at the loop entry, so we lose all the constraints in the body. Actually, it is seeming more and more like we want to do much of the loop and code-motion stuff on IR1. This is because understanding loop indexes is part of type inference (and thus wants to be part of IR1), and to do this, it is necessary to determine loop-invariant expressions and prove the equality of expressions. For example, loop-induction variable stuff needs to know that some non-constant values such as array dimensions are loop-invariant, so that it can compute the stride before entering the loop. And any sort of strength reduction definitely wants to be done on IR1... It is also attractive to do invariant motion of type checks before type check generation, so that even hairy type checks can be moved. A sort of compromise that may work well is to do common-subexpression and loop-invariant stuff on "movable" (effectless and unaffected) functions in IR1. This information is already in the function database, and the semantically most interesting functions (arithmetic, bounds-checking) are movable. Movable functions are so well-behaved that we can be much less rigorous in doing code motion. We will need reaching-definitions information for set variables, but side-effect analysis is unnecessary. We can move the expressions we are interested in by doing a tree walk that generates new source, then reconverting that source in the loop header. The old code will be cleaned up by normal dead-code deletion (assuming we actually succeed in replacing all references.) This does imply that expressions with conditionals never become invariants, but that is pretty standard. The nastiest issue we have to worry about is "safety": ensuring we never evaluate code that wouldn't have been evaluated in the unoptimized program. IR2 code motion will still be interesting for moving stuff that is affected by side effects: loop invariant motion of special references and other data-structure accesses, common subexpression elimination of cddring, etc. Some IR2 dead code deletion is probably also worthwhile, mainly to eliminate unused variable initializations. Source tree ideas: Also, we need machine specific source directories under "code:". These would hold source files that are specific to a given architecture. The system configuration (compile, load, etc.) files should indirect to standard files in the machine specific directories so that there can be arbitrary machine specific files, but the common configuration stuff can be shared. Split code into directories according to what package it lives in: kernel, system, vm, debugger, extensions, ... How far to go? Might as well make it totally consistent. Note that vm might eliminate the need for machine subdirectories in some directories (such as compiler). It should eliminate them all, I guess, since any code that knows about machine specific stuff is supposed to be in VM. And all O.S. dependent code should be under system or a top-level specific system interface (e.g. CLX). Of course, imported applications such as CLX and PCL may violate this purity by having VM hooks buried in otherwise portable code. Some ideas for how to decompose functionality in a Common Lisp system: System interface Implementation specific interfaces: Unix syscalls, CLX, ... Communication: foreign call, IPC, ... Foreign datatype operations (and definition) Common Lisp: Open, *TERMINAL-IO* The kernel: put some things together to make the system go. Error system System initialization REP loop event handling/scheduling (SERVER) allocation/GC Streams Pathnames Package system Text stream utilities read, print format time parsing, unparsing Portably defined data structure utilities: list, string, sequence, array, hashtable, character, symbol. equality predicates. Numeric code: bignum, complex, ratio, float, irrational, random. (some escape routines) Loader: fast and slow. Type system: typep, type-of, subtypep, o-o stuff. Type definition: deftype, defstruct. Syntax: macros special forms declarations lambda Syntactic environment: what macros, declarations, etc, are in effect. VM interface: non-CL ways of accessing the mechanisms of implementation. When properly used, these mechanisms are portable across all versions of CMU CL. The primary use is in integrating environment tools into the system. Operations on internal data structures used by the evaluator: functions, environments, stack frames, ... Access to the syntactic environment. Access to the type system. Alternative and lower-level (piecemeal) access to the evaluator: compile-from-stream, eval-server. System building: Save, delivery tools. Primitives needed to write escape routines. The evaluator: Interpreter and compiler. Assign meaning to code. VM implementation: provides primitives knows about the instruction set and object representation disassembler mechanisms for defining primitives Environment tools: Debugger Inspector Editor Describe Profiler System management: DEFSYSTEM Purify: Special-casing high-indegree functions by grouping them is a more robust technique than a locality-walk, since paging due to these functions will be roughly the same across all programs/usage patterns. A locality walk will arbitrarily place high-use functions near some single use, penalizing all other users. Note that functions with either a high in-degree or a high out-degree are more important in transitions than other functions, since there is a relatively high probability of a random path going though them. Idea: when optimizing Lisp VM performance, the goal should not be so much to reduce the "working set", as to improve responsiveness. The thing is that a "balanced" VM system today is so lightly loaded that the concept of working set becomes relatively meaningless. For any given task, all you stuff should be resident. The problems arise on transitions, moving from one program to another. That is when you notice poor "locality"; but the problem is not so much that it increases the size of the resident set, but that it prevents lazy evaluation of changes in the resident set. For example, when you really notice that GC is breadth-first is when you try to do "count lines" after a GC. What is the relationship between VM locality and conventional secondary search structures like B trees? It seems that locality is an optimistic assumption (depends on locality), whereas a B tree is optimized to function well in the face of random accesses (and provides highly predictable access time.) There is definitely a difference between the indications of the two. To maximize locality, you try to place related objects on the same page. Whereas in a B tree, the non-leaf nodes contain relatively unrelated (or no) information. The interior nodes contain important structural information that we anticipate we will probably want again soon. I guess another way to look at it, is to say that B trees maximize locality in the presence of random reference patterns. We mostly give up locality on the data (leaves), but we ensure that at least we can find it quickly. The structural information shows good locality because the general pattern of references to the indexing information is determined primarily by the lookup algorithm, and is relatively independent of which data is actually being fetched. We concentrate our probes on a small portion of the data which we can pretty much count on being resident. Given that locality has different recommendations than access minimization, when is each strategy appropriate? Broadly, I think that the answer is to enhance locality where it exists, and to assume randomness where accesses become random. At the instruction, procedure, and even conventional program level, there is a great deal of locality. But when we get to large interactive systems, a lot of random noise in injected into the large scale behavior of the program. After using the debugger, do you go into the editor to fix the bug, or go into the mailer to report the bug? [When things aren't related, there is no mileage in trying to put related things together. I guess another way to put this is that for parts of the call graph with no obvious control or data connection, similarity of graph shape may be a better criterion for use-together than any semantic relationship. But maybe not, or maybe not enough so to make any difference.] This suggests two things: 1] The code and datastructures related to subsystem invocation should be grouped. (Command interpreters, command tables, packages) 2] The data structures intimately related with a leaf routine should be grouped with that routine. (Function symbols, Hemlock Command structures.) Notes about fan-in. The call graph is not tree like, or even very forest like. More functions have zero or one call, than any other number, but a significant minority have many, many calls. These high fan-in functions are generally either: -- Trivial near-leaf library routines (reverse, etc.), or -- Recursions (error, etc.) Note: the call graph for any program with recursions is not a tree. How significant this deviation is depends on how far back up the tree the recursions go. In the case of eval and error handling, the answer is way up. Put specials together to minimize dirty pages. Put function symbols with the function to eliminate unnecessary faults. Separate syntactic symbols (macros and keywords.) Strip out all the meta-compile time info in small cores? (parsed vops) Make dumper share structure and g-vector constants? Avoid directly descending large tables. Don't descent long vectors. What about hemlock string tables? Have a specific exceptions list? After doing initial roots, transport all code objects by finding some roots. Ignore keywords, macros, slot accessors, and other non-run-time stuff. Scanning from functions (rather than symbols) is something of a problem, though, since any given code object might be garbage. Whereas, with interned symbols, we know they aren't going away. Have lisp code somehow precompute the ordering for purify, using its ability to traverse all sorts of data structures? (info environment, setf functions, etc.) Only these obviously reachable functions would be transported on the first cut. I guess functions named by interned symbols are a subset of obviously reachable code. I guess we don't want so much to precompute the ordering, as to precompute which code objects are reachable, and which are roots. Just need a couple bits in the code object. Core file size and small deliverables: We could use statistical compression on compiled code, and then have a dynamic code cache. Some nasty shareability issues, though. We could also have a compiler switch that enabled named calls to be compiled as direct jumps. A bit of magic in the loader and purify would fix up these references. With even more magic + prohibition of auto-map calls, we might be able to have have a totally separate stand-alone call sequence that used raw return PCs, etc. Will want some tools for groveling core files to find out what stuff got sucked in why. For stuff not in the kernel core, this would be better handled by load-time notes of what files we are sucking in and why. Pre-loading of non-kernel files might also want to a controllable processes, rather than mindless loading of all things referenced. Although small heavily-used programs will want to avoid mapping, many small programs might well run with the expectation that they will map in the core. Startup will still take less time than starting a full core and loading the fasl. In effect, the stand-alone core loads itself into the full core, which should be faster than fasl-loading, since load-time operations have already been done and objects have been created. The only work is resolving symbol references. Note: we also need some mechanism for configuring out parts of REINIT-time system initialization. Do we process switches, load init files, open standard streams, etc... Otherwise, this stuff will always have to be written into stand-alone cores and always have to be run before the user code gets a chance. #| Actually, I have a better idea than auto-loading that also has the advantage of being easier to implement. Instead of auto-loading whenever some unexpected function is called, just map in the whole core file and jump. This could even be handled entirely in C by introducing a closure-like thing that we stick in the definition cell of these auto-map symbols. We would keep the idea of a minimal kernel core + eager loading of probable references + tree-shaking + save, but the improbable references would just map in the whole core. Included in these improbable references would be INTERN and GET-INFO-VALUE, so we don't need the package or info database in any form for a stand-alone core. This means we don't need to implement new file formats for these databases (or any associated unloading/loading code.) Even though the stand-alone program has been running, we can mostly get away with just mapping in the full core, since its stuff will be in different generations (different slices). There are two areas that I can think of where the memory maps (or at least the information) for the stand-alone core and the full core will overlap, and hence need to be merged somehow: -- Storage allocator state. -- Symbols. Merging the storage allocator state seems like it would be relatively straightforward, especially if we clean out the ephemeral generation before we save the full core. We would just have to make sure that the slice allocator in the stand-alone core doesn't allocate any slices used in the full core. As for symbols, I think that the best idea is to scan all scavenged spaces at map time, replacing all references to symbols that are in the full core with the full-core pointer. A few system symbols that name variables holding global databases may need to be explicitly merged by special-case lisp code. For example, if the package system did make it into the stand-alone core, then we would just add any user packages to the package list after the map. This could be done by having the C code recognize these static symbols and (after the map) call a function in the full with the old version of the symbol. The idea of having a dense unique integer encoding of symbols based on the symbols in the full core might still be useful here, since we could map a symbol in the stand-alone core to the full-core symbol with a single indirection. If the package system didn't make it into the stand-alone core, then we would enter the stand-alone symbols into the package system in the full core. We would stick information about package name, external-p, etc. in the symbol-package slot. Symbol-package would no longer be a simple cell accessor, it would be an auto-map function call. It might also be nice to have a incremental core save capability. We could then have a "just Common Lisp" core, and an additional incremental core with the editor, compiler, etc. Although it would probably be possible to map in an incremental core after we have already mapped the base core and run for a while, I think that this isn't necessary. You would decide at map time whether to map just the base core or the whole thing (from an environment variable?) I'm not quite sure what would be the wins of having a small base core. I suspect that the biggest might be psychological and administrative. Better to have a 5 meg file and say "this is all you need" (with a 15 meg file lurking in the background.) There would presumably also be some improvement in VM locality for user code. I guess there would also be a pretty big win for people developing large systems, since they could put their system in an incremental save file, and not blow the space & time to save a full core image. During development, the stand-alone mechanism is inappropriate, since you want all the environment tools around. It seems an incremental save-file mechanism might be nearly as good as stand-alone cores for many of our users, and would be easier to implement. For best efficiency, it would be nice to have a Mach hook for finding out if a page is still mapped to a file. But this wouldn't be necessary; we could read the core file and compare the bits with the in-core stuff. |# There are two major reasons why it is difficult to write a "linker" for Lisp: -- The lisp linker/loader is extensible due to top-level code. -- Many things can potentially be referenced, but often aren't by simple programs (bignums, the error system, the debugger, etc.) Rather than trying to mash Lisp into a static linking model, how about using some proven Lisp techniques: -- Garbage collection -- Auto-loading The first step is to try to make a minimal kernel core. This need not be the same kernel core used for debugging new compilers (which would probably want the debugger, etc.) All this core needs is the ability to load stuff combined with the ability to auto-load any parts of the system that are necessary. Probably the simplest thing to do is to not even give this core a top-level R-E-P loop, so that we don't have to worry about spurious references to READ, etc. We will probably want a capability for "eager auto-loading", where loading a file that references a particular variable, function or type, then we automatically load the appropriate subsystem even when it hasn't been actually used at load time. This will minimize the amount of auto-loading that happens at run-time in the normal case. For most stuff that is implicitly referenced by the system, like generic arithmetic, bignums, the error error system and the debugger, we would also have a mechanism for saying to whether to pre-load that stuff. Once we have done our loads and eager auto-loads, we do a tree-shaking GC to discard all the stuff that isn't actually referenced by the program. For truly minimal core sizes, we could burn our bridges at this point, and throw away the loader. But in most cases, we will probably want to retain the ability to do auto-loads. If we do retain the ability to auto-load, we could actually "unload" some functions that had been loaded and whose names are referenced, but which we think we won't need. This would generally be better accomplished by not eagerly loading the functions in the first place, but unloading would be useful for subsystems normally used only at load-time. Note that our auto-loading will need to work somewhat differently from traditional schemes, since we would like to be able to throw away unreferenced symbols. What we need is a mappable database file automatically built from a full core that maps symbol name and package to information about how to intern that symbol (home package, external-p, etc.) and the object files for where to find the definitions of that symbol. This database is really an out-of-core extension of the package system. When you do INTERN (or FIND-SYMBOL), and the package system finds no in-core symbol, it would look in the package database before creating a new symbol (or returning NIL.) It seems that the package database should also hook into the info database somehow. Ideally, if you say (info function documentation 'format), the system should auto-magically suck in the documentation for format. Probably what we should do is have a method associated with each info type that says how to unload that kind of info into the external info database. This can be done either directly (for documentation) or by reference (a file to auto-load). We might want to keep the external package database and external info database somewhat separate, partly to preserve the scoping of info environments. We could have a new "saved info environment" that goes and looks in a file. In the saved info file, symbols would be represented by a integer ID assigned from the package database. This ID could be put in the symbol header (and used for hashing symbols as well as looking them up.) Similarly, William's idea of being able to auto-load debug-info is a good one. We could (conditional at load-time) reduce debug info to a fixnum offset in the fasl file where the debug-info is found. There can be problems with automatically deriving auto-loads from a full core, since a particular file might require some sort of initialization to have been done, yet that initialization is in another file. For a (somewhat bogus) example, consider loading a Hemlock command that uses some character attribute not defined in that file. To automatically resolve these dependencies is intractable, but representing any such dependencies for system code we want to auto-load seems doable. (PROVIDE and REQUIRE?) Note that these inter-file dependencies are almost always though some explicit global namespace (not just groveling some random global variable), so an extensible auto-load mechanism might be able to handle most of these issues. VM definition enhancements: Make representation selection so that it can choose different versions of arbitrary VOPs, not just move VOPs. Useful so that we can decide to represent fixnum values as untagged because we want them that way. Spill TNs that are VOP args when packing result load-Tns? Check that SCs specified for a restricted temp are a subset of the SCs allowed by the primitive type. (requires meta-compile-time primitive-type information). Give warning if an unbounded SC is allowed? Named VOP temporary mechanism. Only wired temps? Mechanism in VOP for automatically emitting different VOPs depending on policy? If defining :conditional VOP, ensure that appropriate codegen-info args defined. In a :Conditional VOP, check that result type is Boolean? We are assuming this now that we set the Predicate attribute. Previously we could have :Conditional templates for non-predicates: these templates just wouldn't be used when not used as a predicate. But this probably isn't very useful. In VOP*, do run-time checking that a legal number of arguments has been supplied. We also want a named temp mechanism for defining particular wired TNs so that we won't have to do all that SC, offset bullshit. We would have a new Define-VOP option :Named-Temporary. This would have the same syntax as :Temporary, but would default options from the stuff globally associated with that name. Compiler tuning: Note that clearly more than 50% of the code size of the compiler is in the back end, and this bloat buys us little. More "interpreted" implementations of code generation could massively reduce this space overhead with little deterioration in run-time (especially given the potential for increased cache hit rates...) Some points: -- Emit functions are potentially very similar (could be cached across unrelated VOPs.) -- The operand loading part of generator functions is very similar, and could be split out of the generator proper. Could have one such function per operand SC restriction signature. Note that this would eliminate the need to recompile the generator when only the SC restrictions change. Might be important... -- Many VOP generators could be pattern-matched as "almost straight-line code", and translated using a "threaded code" approach. It might be a win to represent always-live info for TNs by a per-global-tn bit vector indexed by block numbers. This could efficiently and consistently represent the liveness of "highly live" TNs like environment-live and TNs allocated inter-routine. Also, perhaps we want to move much of the conflict info out of the TN and into a CONFLICTS structure. Local TNs would just point to a local conflicts structure, whereas global TNs would point to a global conflicts. The global conflicts would contain local conflicts structures for each block the TN is actually referenced in, as well as the always-live bit vector. The idea is that there is a big space win by using 1 bit to represent that a a TN is live in a block, rather than hundreds. Also, this block-indexed representing of the blocks the TN is live in can be directly compared with the SB always-live bit-vector to determine whether there are conflicts in those blocks. Totally eliminating the representation of what TNs are live in a block is a lose, though, since we need this info for saving across calls, debug-info, etc. It is a pretty big lose to compute this info from the TN bit vector, since we would have to iterate across all (or most) TNs, checking if it is live in that particular block. It also makes the flow analysis step intractable, since it is hard to tell what set to propagate to the predecessors. Alternately, we could have a per-block bit-vector indexed by global TN numbers. This would allow simple bit-vector operations for flow analysis, and make generating the live set more efficient (especially if we have a tense do-set-bits macro.) This would require the same bits as the per-tn bit vector (modulo greater header overhead for whichever is more numerous.) Unless we changed the pack conflict representation, this would make packing very difficult, though. I think the pack conflict representation will have good properties for localized packing, so I am reluctant to flush it. Instead, maybe we should compute both kinds of bit-vectors simultaneously. We do the flow analysis using the TN-indexed bit-vectors, but use a clever bit-vector operation that simultaneously iterates over the newly set bits. Each newly set bit indicates a TN that should potentially have the block's bit set in the block-indexed vector. I guess we would want both IN and OUT sets, so we are up to 3bits X global-TNs X blocks. But actually, the lifetime flow analysis is such a small part of the total lifetime overhead, that speedups in the flow part are quite unimportant. Much more significant would be reductions in the per-TN-ref overhead of the pre and post passes. [Though this might be reduced relatively much more by compiling for speed, since it is much more structure intensive.] Having the block-indexed TN always-live vectors is separable, and much more interesting, since it not only speeds up pack, but also simplifies the conflict info and seems to have good properties for localized packing. Localities would have a mask bit vector with a bit set for each block in the locality. In addition to providing a quick test for whether a given block is in a given locality, it also can be AND'ed with the TN bit vectors to find the conflicts in that locality. Eliminate TN-REF-NEXT-REF thread. Instead, encode this information with a constant in the VOP-INFO. This can be done reasonably efficiently by still representing the reference order directly as "read this TN", "write that TN", but with TNs designated by a VOP-local number. Before scanning a TN, we put all the args, results and temps in order in a temporary vector. The conflicts info represents TNs with indicies into that vector. We would need some indicator to represent the grouped reads/writes of more operands. Although this saves space and time just by squeezing out a slot in the TN-REF, it is a big win mainly because it allows us to use special "temporary" TNs for VOP temps. These would only have no associated TN-refs and only the bare minimum number of slots. Temporary TNs can also be used for load TNs when operand loading becomes implicit again. [### Another possibility is to represent the intra-vop conflicts explicitly, rather than having lifetime analysis infer conflicts from the reference order. We might even want to change DEFINE-VOP to allow the conflicts to be directly specified (while still supporting the current mechanism), since it would be easier to understand.] Note that with temporary TNs, we can pack then during the lifetime analysis post-pass, not actually storing the lifetime information anywhere. If we did this, then a wired temporary TN would have no variable information, and could be compiled into generators as a constant. In the general case, we could do well by factoring out low-use TN slots into a separate structure that is only tacked on when we need it. Also, some slots are easily computed from other slots or from the context in which the TN is used. [But actually, it seems that currently well more than half of all TNs are temps, and most temps are wired. Until this overhead is massively reduced, all other tuning will be well down in the noise.] Type issues: Do we want some way for hairy types to become non-hairy when we see an appropriate definition? We could remember hairy types, then clobber some slot with the real type when we discover it. Some glue code somewhere would then need to squeeze out these hairy types. (This seems related to handling circular type definition.) On the other hand, this also opens some nasty questions about exactly what the semantics of type definition in Common Lisp are. DEFTYPE type stuff? Fix type-union to handle numeric types correctly. This isn't as important as fixing intersection was, since we aren't producing "wrong" results. But it would be nice for type inference if we computed the union of overlapping ranges of numeric types with the same class and complexp values. [### Actually, not true: we can form an exhaustive partition of a numeric type using union, in which case it becomes equivalent to the type, but would not be considered as such.] Is complex type parsing right? The weirdness there may be a hangover from the old belief that rational was a subtypep of (complex rational). Handle FP overflow/underflow in round-numeric-type. Open float bounds are fucked in general. Open float bounds should be converted to a closed bound of the closest actual possible float. Simple and complex arrays are considered to be disjoint, even when the hairiness of the type forces the complex representation to be used for simple cases (e.g. non-vector simple arrays). This is probably wrong, but we are saved for now because there is no way to get an array that is definitely complex. Probably the right fix is to wimp out on type operations between definitely simple and definitely complex arrays unless the simple array type is complete enough to guarantee that it has a simple representation. [### Or represent complex simple arrays in a way that remembers that they are really simple...] What to do about the cross-product types that can be represented using a CType structure, but not with a type specifier? This mainly happens when some attributes are specified, but others aren't. For example, if we know a value is a float, but not whether it is complex or not. [### I guess we could represent this with a union type...] What about the AND and OR specifiers? Always make a hairy type when a values type is involved? Perhaps union/intersect should continue to do what they currently do, but people should try not to call these ops on values types when they are in a context that they can make sense of values types.] We should recognize (values &rest t) and similar unspecific values type specifiers as being equivalent to *. This might be done by canonicalizing such types to *. Also convert (values foo &rest t) to foo? Note also that the syntax for keyword args in a function type has been clarified, and probably in a way that doesn't imply keywordification, so the parser and existing keyword types must be changed. Fix up named constant handling code to always evaluate the expression at compile time, rather than only evaluating when it is known to be constant. Fix things up to correspond to the cleanup proposal. The main change is in replacing the inlinep values with a general purpose integration level. The globaldb compiler environment support also needs to be rethought. Generational/incremental GC support: There is a problem with schemes for using page protection for controlled scavenging: given a page, you need to be able to find all the boxed cells on that page. This seems difficult/impossible, given that the heap contains a mixture of boxed and unboxed words, and the beginning of a page may fall in the middle of an object, so we can't parse things by skipping from header to header. The most plausible solutions to this problem involve having two heaps. In the most basic scheme, boxed and unboxed objects would be allocated on separate heaps. Any objects that mix boxed and unboxed are problematical. But the only such object is a code object, and users don't directly hack on those. Code that creates code objects can follow special disciplines for GC's benefit. Code objects are never modified after creation, so they don't need to be scavenged. Code objects aren't consed and discarded at high rates, so it is unimportant that they fall within the generational scheme. Instead of just having boxed and unboxed heaps, we can do something different (and perhaps better) if we observe that the real problem is unboxed objects that cross page boundaries. If a page starts with a boxed object, we can find any unboxed objects on the page by scanning looking for header objects. Two-word unboxed objects will never cross a page boundary, since all objects are dual-word aligned. This is somewhat useful for single-floats. Larger objects less than a page long can be forced to the next page boundary when they cross a page. This really only takes a couple extra ALU instructions and a conditional branch: (let ((new-free (+ free size))) (cond ((zerop (logand (logxor free new-free) page-mask)) (prog1 free (setq free new-free))) (t (let ((next-page (logand new-free page-mask))) (setq free (+ next-page size)) next-page)))) Unboxed (and code) objects more than a page long would be allocated in a special area. This area would not need to be scavenged, so we would never protect it so as to cause faults. Actually, we don't really need this to be a separate contiguous range of memory as long as we know which pages are unboxed. This could be recorded on some data structure on the side, such as a simple linear table that we stick each unboxed page address in. The advantage of having a separate area is that we wouldn't have to force the free pointer to the next page boundary after allocating a large unboxed object in addition to beforehand. There could be as much as 2/3 waste space if you alternately allocated a page+1 unboxed object and a 2 word object. GC could compact this, though. But it probably isn't worth this much trouble to get one contiguous heap, since we are already assuming some reasonable VM support. But it seems like it would be worth going for the scheme that allows small unboxed objects to be allocated in the same area as boxed objects, since it would give better reference locality, allow efficient inline number consing without requiring two free pointer registers, etc. Note: we can use a guard page to mark the end of the heap without having to do Appel's gross hack of interpreting the instruction stream to complete initialization. We just have a dedicated register for holding the object we are currently initializing. If we fault, then we copy the area between the new object start and the free pointer, fixing up the new object start. Then when GC is done, we just resume execution and allow initialization to complete. This will work as long as uninitialized memory is zeroed (though zero-fill or whatever), since we can scavenged the partially initialized object without losing. If we guarantee that the first word we write will cause the fault, then it doesn't matter whether we were initializing an unboxed object or not. This is done by writing last-word-first and special-casing objects that the compiler doesn't know to be less than a page long (as in Appel). For early generations it may be better to explicitly zero oldspace rather than taking a hit for zero-fill. Both spaces for the first ephemeral generation should be always resident anyway, so as long as Lisp can zero memory as fast as the OS, it is better to avoid the fault overhead. [And also the overhead of the syscalls to cause oldspace to be zero-filled...] Explicit zeroing is only a lose when it is probable that the memory will be paged out eventually, or (even worse) is currently paged out. In the first case we do an unnecessary disk write & read to page it out and then back in again (to cons on it). In the second case, we do yet another spurious read to read in the page so that we can zero it. For later generations, zero fill is definitely a big win. A note about recording cross-generation writes: In addition to possibly being basically too inefficient, just making a list of all of the write addresses has the unfortunate property that the overhead depends on the use of writes, not the use of extensive consing. So programs that follow the traditional rule that side-effects are better than consing will be doubly screwed. This is basically a violation of the rule that overhead for a facility should fall on the users of that facility. Programs that do extensive consing can suffer whatever penalty they want on writes, but programs that don't do lots of consing shouldn't be affected. In the fault-based scheme, low-consing programs should have minimal overhead. Initially some pages will be faulted to make them writable, but they will stay writable for a long time, since there is no point in protecting them again at least until the next flip. Correctness issues: Timing windows? init-sb-vectors DOLIST var required to be bound to NIL during exit form? This loses with type declarations, since you have to allow NIL even though no sane person exploits this quirk. Some lossage also with DOTIMES, if index required to be = to the bound. In any case, DOLIST expansion is not as safe is it pretends to be, since it CAR's the cons before the ATOM exit test. (So it also can give the var an initial value of NIL.) Probably the cleanest thing to do would be to ditch the setting semantics for these forms, but it isn't clear that that is legal. Also it would be odd in comparison to DO, since that must set the vars. And with DOTIMES it would make it very difficult to automatically apply a declaration on the index to the index computation (although it doesn't totally work even now.) A more troublesome situation which might be worth a warning is the use of the "dangerous" package functions [CLtL p. 173]; if our compiler sees a top-level call to SHADOW, SHADOWING-IMPORT, UNINTERN, or UNEXPORT, it purges the affected symbols from the table of objects already dumped, so that any subsequent references will be interned with the modified package. We found this necessary to avoid some strange and unexpected behavior. Handle recursive types. Currently we never emit type checks for the values of unused continuations. Do we believe in this? In GTN analysis, should we be more careful than just assuming that no known function can be part of a TR loop? A new function attribute, perhaps? How should we do argument type checking with :safe templates? It seems that :safe policies ought to be just safe as :fast-safe. If we never do argument type checking on :safe templates, then this won't be the case. The out-of-line routine might do a check for a less specific type than the one declared, whereas an inline check will always be for the exact type. But how significant is this really? Note that we never do type checking on full call arguments. This seems reasonable. Yet how does a :safe template differ? Not at all that I can tell. In either case, aggressive type inference using operation-specific information could fail to detect a type error because some type assertion early in an expression is used in the inference, but not deemed important to check by the operation. This issue needs some thought, but isn't an immediate concern. But whatever the decision, it will be relevant to all known functions, not just to those that happened to be implemented by :safe templates. Note that this issue interacts with source-level generation of hairy type checks, since whatever our policy is, it has to be anticipated well before LTN runs, since the source-level generation of type checks will happen at the end of the IR1 phases. LTN is too late to flush hairy type checks due to selection of a :safe template or failure to select any template. Staged transforms and inlining: Replace block and component optimize flags with: (current-stage 0 :type index) Optimization is done when the current stage reaches the final stage. The final stage is the (constant) exclusive upper limit on optimization stages. Each transform (or delayed inline expansion) has a stage. During optimization, we only apply transforms whose stage is equal to the current stage Let *current-stage* be what the component's current-stage was at the start of this optimization pass. We reoptimize a block when: (<= (block-current-stage block) *current-stage*) At the start of the optimization pass over a component, we set the component current stage to the last stage. At the end of optimizing a block, the block current stage will be the next stage that we need to optimize the block during. We minimize this into the component current stage. At the end of the component optimization, the component current stage will be the next optimization stage we should do (or the final stage, if we are done.) We maintain a constant *current-stage* during optimization to avoid excessive optimization repetition. Once we reach stage 1, we continue to optimize stage 1 blocks at stage 1. However, during optimization of a block, we locally make the *current-stage* be the block's initial current-stage. This way, we will optimize at stage 0 any newly introduced blocks we come across. Whenever we introduce new code into a block (i.e. mark it for reoptimization), we reduce the block's current-stage to 0. When we join blocks, we maximize the final stage and minimize the current stage. Whenever we update a block's current stage, we minimize the result into the component's current stage. When we split blocks, the new blocks retain the same stage. At the start of optimizing a block, we set the current stage to the final stage. Whenever we optimize a node, we minimize its current stage (after optimization) into the block's current stage. NODE-REOPTIMIZE becomes NODE-CURRENT-STAGE. Instead of setting to T when we do REOPTIMIZE-CONTINUATION, we set to 0. If the NODE-CURRENT-STAGE is greater than *current-stage*, then we don't need to reoptimize that node. Whenever we optimize a non-combination node (or unknown combination), we set the NODE-CURRENT-STAGE to the final stage. We determine the new current stage of a combination to be the lowest stage transform that we delayed as premature. If there were no delayed transforms, then we use the final stage. In addition to IR1 transforms, delayed inline expansions are considered an implicit delayed transform (see below.) The transforms list for a function is sorted with lower stages first. In transform definition, stages are represented by symbolic names (keywords) so that we can introduce new intermediate stages without disturbing existing transform definition. We can also add a sub-stage that is used to determine ordering of transforms for a particular function within a stage. So a stage spec would be something like :INITIAL or (:OBSCURING 2.5). Our stages will be something like: Source transform: Stuff that happens at IR1 conversion time. We will still inline-expand unknown functions at IR1 convert time; there is no point in delaying, since we have no other way to optimize the call. We also still do semi-inline expansion of functions with no transforms at IR1 convert time. These semi-inline calls are eligible for true inline expansion in the local inline stage (see below.) During conversion of semi-inline expansions, further semi-inline expansion should be discouraged. Initial: Type inference and constant folding. Any transform whose result is clearer than the original form (probably the default stage for DEFTRANSFORM). Non-VM-specific array transforms. Arithmetic transforms. If the original function had a derive-type method, then we should expand into functions with DERIVE-TYPE methods (or very specific result types.) The functions that we expand into should have the same attributes (movable, foldable, flushable) as the original function. Obscuring: Obscuring transforms: sequence transforms, VM-specific transforms. Transforms that convert a simple (type-inferencable, movable, flushable, etc.) operation into control structure or sub-primitives that prevent further optimization at the CL level. In particular, transforms that replace a side-effect-free call with code containing side effects on memory (e.g. CONCATENATE.) These transforms are more like specialized inline expansion than semantic simplification. They generally increase the amount of code. [Type check generation is interleaved with obscuring transforms -- see below.] Known-inline: Semi-inline expansion of functions with transforms (but that weren't transformed.) Local-inline: Inline expansion of local functions (not explicitly declared inline) has its own stage, since we want to consider true inline expansion after other optimizations. We then have a much better idea of how big the function is, how many places it is called in, whether those calls are in loops, etc. In particular, we can actually back out of a decision to semi-inline for functions with multiple local calls (if the function didn't shrink as much as we had hoped.) We just clobber the call back to a notinline full call. Finalize: Any transforms we want to put off as long as possible (Alien transforms?) Type check generation? We can selectively suppress certain stages (obscuring transforms and inline expansion) to save compile time. Type check generation is done on the same component scan as obscuring transforms. Before doing obscuring transforms on a block, we do type-check generation. After optimization has been done, the complexity of the code in a function is representative (an upper bound) on the complexity of code resulting from inline expansion. We compute the cost of each new function, and record the expansions and complexity of any functions that are candidates for semi-inlining. Any recorded expansions will be applied according to their recorded complexity. The better our understanding of the code, the more aggressive we can be about inline expansion without causing excessive code bloat. To do cost/benefit analysis, we need some idea of the time benefit as well as the space cost. We would want to take into consideration the speed and space policy at the call location, as well as dynamic frequency indicators such as loop depth at the call site. Profiling call-count information can be incorporated here. Note that unless the cost of a function is near zero, we would never inline it merely in hopes of avoiding a function call. The point of inline expansion is to allow better optimization. Another cue for inline expansion would be if arguments are constant or defaulted to a constant. "Good" constants are: -- Known functional arguments with transforms or templates. -- T, NIL or a (non-argument-syntax) keyword. Automatic inlining of code can potentially lose if there are macros in EVAL-WHEN COMPILE, and similar stuff. We should back out of inline expansion if there are any errors or warnings during conversion. Automatic inlining should probably be suppressed for anything that has an IR2-convert method or templates. We can always conditionally inline particular known functions using IR1 transforms. Is there a problem with inline converting local functions with functions in their FENV that have been let converted? ### #| Analyze loaded system to determine compilation strategy, perhaps using profiling info. Break the system up into bite-sized pieces with lots of calls inside, and block compile each piece. Functions with little code are good candidates for inline expansion. Use debug info to find where definitions are when we want to compile them. |# We still need to arrange for :notinline references converted after the defun to reference the global-var rather than the functional. Also, it would be nice if there was some way for functions to be fully noticed, yet still called through the definition cell (necessary to implement normal non-block compilation). Probably we want to associate the function with the global var used for references to the variable. But we might want a different var for use with :notinline references, since :notinline means "assume nothing", whereas references to the variable would still want type checking (and thus would have the function type in the var). [How about adding a Definition slot in the Global-Var that is set the to true definition when known. This allows all the right stuff to happen in the case where we aren't block-compiling, since IR1 finalize can still find the true definition in order to find out what type it really is. Notinline references are done differently: we make a new Global-Var for each notinline reference. This var is stuffed with default values, and isn't entered in *free-functions* (or anywhere else), so we can never acquire any assumptions about the var. This may cause trouble with the *inlinep* alist, though: once a var is :notinline, we never find the same var twice, so we can't find it in the alist. But it is probably wrong to key off of the var instead of the name anyway. Actually, if we do this, it seems that we don't need the Inlinep slot at all. This is because after IR1 conversion, the only inlinep value that matters is :notinline, and this would be represented by referencing a bogus var. On the other hand, if we went to a more general integration level concept, we might still want more information to control automatic inlining, which would be a post-pass to IR1 conversion. But probably not. It seems that we could get by with our inferred cost of inlining and the optimize tradeoffs in the policy. Probably once we have smart inlining, the random :semi-inline and :always-inline values could be flushed, leaving only :inline, :notinline and NIL (default). The inline declaration just provides a local hint for how to evaluate the space/time tradeoff for that function: it doesn't tell you exactly what to do. Misc tuning: The "anchor-pointing" control optimization can be done even the block contains stuff other than just the IF. We just have to copy all the code in the block. Obviously this could waste lots of space if always done, but it can also reveal optimization opportunities for constraint propagation, since we could have a copy of the code for the predecessors that have an available constraint, and another copy for the ones that don't. A real case where this can win is: (cond ((and foo bar)) ((and foo baz)) ((and foo bletch)) ...) Although users won't write this stuff too much, typecase can easily look like that. A structure predicate is (and structurep (or (eq test) hairy test)). Typecase could become just a chain of EQ tests if we know from a freeze-type declaration that there is no hairy case. And even if not, the structurep can be squeezed out. It seems fairly easy to do CASE optimization of integer (and perhaps character) tags into jump tables (or branch trees). This would be done in control analysis. Currently control analysis already successfully linearizes along else-chains. All it would have to do is notice a sequence of equality tests of the same (unset lexical) variable against character or integer constants. In addition to guarding against out of range values, we also add a guard to check if the operand is of the correct type (i.e. fixnum). There could be problems with doing this as late as control optimization. At the very least, we would have to explicitly do LTN on the guard code. Perhaps control analysis could be moved earlier in IR2. If the guard code might need to be optimized, then we have worse problems. Let-convert XEP calls when there aren't any stray references to the XEP that might be converted. Doing this would require some additional mechanism for deciding that a closure reference is "hopeless": it could never be converted to a local call. If all closure references are hopeless, then we can let-convert the XEP calls. I guess a good test would be "argument to global function with no transforms". If we decided that a function was hopeless, then we could change all optional-dispatch entry-points to normal functions, and delete those with no references (should only be the last ep in a more-arg function). Could change code generator for values defaulting to check if enough stack values are supplied, then emit the moves for all of them. If not enough values are supplied, then we call a miscop that defaults the appropriate number of values. This would be somewhat faster in the usual case where enough values are supplied, and would take less space. [But the number of receivers of more than three values is probably so small that the space is unimportant.] Make IR1 conversion of special optionals less pessimal. Make IR1 optimize cleverly use the Call attribute. [That is, get a worst case by combining the attributes of the actual functional args, rather than totally punting when Call is specified. Change TAGBODY (and BLOCK?) to preserve the drop-thrus in the original code. It would help control optimization if it somehow knew that IR1 explicit error code can't drop through, or at least that error code shouldn't be given a favorable drop-through. If we determined drop-throughs based on cost, then this would fall out if we gave error functions a high cost. But if we source-transform to %primitive, then the error calls become harder to recognize. I guess we could source-transform to another funny function that is known to be costly and takes the error code as an operand so that it could be defined to translate to an error miscop VOP. For IR1 flow analysis, though, it would be useful to know that a branch is truly impossible, rather than just unlikely. Perhaps should probably maybe do something about having someone know somehow that throw and %continue-unwind don't drop through. Right now, we have to emit code (that will never be executed) which delivers 0 values to the result continuation. Would also be nice for error &c, so that IR1 flow analysis could know that those paths aren't possible. [### Something else that we could do which would get most of the win of this would be to use a hash filter to determine whether a given structure (or CLOS instance) could be a subtype of a particular type. This scheme requires that the structure point to the type descriptor (or have the hash bits directly in the structure, in addition to the name). The hash of a type is derived from the SXHASH of the name, so that we avoid compilation order problems. We set a few randomized bits in the filter corresponding to every supertype of that type. To do the test, the compiler loads a (compile time constant) complement of the test type's hash, and tests whether the logior of the filter and the mask is -1. If not, the object is definitely not of this type. If true, we must do the actual scan up the inheritance chain. Note that in the common case of zero supertypes, this test will *always* work, effectively degenerating to a "has any supertypes" test. The chance of a collision (spurious match) increases with the number of supertypes, but remains small then the number of supertypes is small. Or I guess we could include the structure's direct type in the filter, and then always do the hash test, not trying for an EQ test first. This would increase the risk of collision, but would probably save time if the test is false. It might be interesting to discriminate between type checking (test always true) and typecase dispatching (often false.) ] Tuning (call): -- Eliminate RETURN when there is only 1 non-tail call of a known function. [Generalize LET concept or a new kind of let?] -- Stack allocation of lists, closures. tagged floats (?) -- Don't pass old-cont in known calls when there is no unknown value usage (hence we can compute old-cont from our cont.) A better way to represent simple call (call in the same environment): Instead of somehow pretending that the return of the callee is a predecessor of all of the call points, and letting the existing lifetime/conflict analysis stuff do its work, we should richen the representation for conflict information. We add a new kind of global-conflict (:environment perhaps) that represents the fact that the TN conflicts with all TNs live in the specified function (and any other functions that the callee does a simple call of). This is kind of like always-live, but covers a larger extent. We then have to somehow annotate each simple function with the aggregate set of blocks that are in that function and its simple callees. Also, each simple function probably needs a back-pointer to all the :environment global conflicts for that function so that we can add all those TNs to the conflicts when finding the conflicts for a TN in that environment. The idea is that it is silly to use lifetime flow analysis to discover which blocks are in a called simple function, when there are simpler methods. This both saves the cost of flow analysis, and saves the cost of huge numbers of always-live conflicts. This also has the important property of keeping the flow-analysis sets pretty much the same size as when not doing inter-routine register allocation. In the old scheme, the set size in the simple function would effectively be multiplied by the number of calls to that function. Also, this opens the possibility of a simple IR2 representation for simple call. Instead of having to somehow force the call point to a block boundary, we expand the values for Vop-Info-Save-P to include something like :Simple. The conflict analysis post-pass could notice this and use the current set of life TNs to set up the appropriate :environment conflicts and back-pointers. It is worth thinking about how we are going to get the lifetimes right with "simple" call (call in the same environment)? Things actually work pretty straightforwardly, much like TR call. The only sticky point is how we move the return values into the result continuation locations (possibly massaging them). This code would need to be in a block between the actual return from the function and the nominal return point (represented in IR1). This means we would need to introduce a block. One possibility would be to make this block explicit in the IR1 representation of local call from the beginning, but this would probably obscure useful information such as which calls are potentially TR. But if these blocks are introduced later on, they would most naturally be introduced in IR2 conversion; this would require some care, since people such as loop analysis may believe they already understand all existing block. I guess we want some sort of utility which, when given a node and some representation of where to link the block in, will make IR1 and IR2 blocks, using the node to make a new node for the new IR1 block that has the same source context. It would enter these blocks in all appropriate data structures. People could then freely emit whatever IR2 they wanted into the IR2 block, using the new node as the context. [Another possibility: it isn't clear that anyone other than lifetime analysis wants a simple call to look like a control transfer. The weird flow graph will gag IR2 flow analysis, and the optimizations probably won't find anything interesting anyway. It won't be possible to do code motion out of the function into the caller (or at least to consider doing so would involve lots of extra work, since we would have to move the code to all callers.) The only possible win would be in side-effect analysis for code-motion of affected operations. Environment analysis has guaranteed that local functions don't access any variables that aren't arguments, so we don't have to look at the code, but if we did, we might discover that it didn't have any side-effects. So, if only lifetime analysis wants the call/return transfers to appear in the flow graph, it can instantiate them by special casing. The only thing we would have to do is force the local call to end a IR2 block so that the live TNs are easily available for flow analysis. This could be done during the pre-pass to lifetime analysis, where all other block-splitting is done. It is possible that the DFO computed without these arcs may work very badly for lifetime analysis with these arcs. If so, it would be necessary to compute a separate ordering of IR2-blocks.] Pack reimplementation (localized packing): The Costs, SC and Offset slots in the TN would probably all be flushed once loop-local packing is done, since this info would be loop-local. Associate a bit-vector with each loop that represents the set of blocks in the loop, including all nested loops and all code for functions called in the same environment. This allows a quick test for whether a particular block is within the loop we are currently packing, so we know whether conflicts in that block are relevant. To check for conflicts on a particular location in a given extent, we iterate over the global conflicts for the TN, using the bit-vector to tell whether the block is in the extent. We compute this set by doing a walk on the call/loop graph, adding the blocks directly in our loop into the set, then recursing on inferiors and OR'ing their sets into ours. Don't have to pack loops bottom-up breadth-first, since parallel loops are disjoint. We only have to pack all inferior loops before packing each loop: a post-order walk of the loop nesting. It also makes no difference which order we pack functions in as long as they have disjoint environments (modulo boundary conditions such as passing locations). If we have functions that share their environment, then the code in the function is nested within several loops. This causes problems, since packing loops surrounding one call would add conflicts in the called function which would interfere with packing more inner loops surrounding other calls. But this probably wouldn't cause much lossage, since loops within the called function would still get packed first. The only significant effect would be that one caller would be favored in allowing stuff to stay in registers over the call. And this is only an issue if we do implement such calls... Even dealing with this issue sub-optimally, it would surely be better than doing all that saving and restoring. Constant caching: the current scheme of doing lifetime analysis for all constant TNs is losing pretty badly. Probably we want some kind of constant TN that we don't attempt to cache, and then we selectively emit cached constant TNs depending on policy and perhaps on a guess for the likely desirability of caching based on the number of references, loop depth, kind of constant (cost of loading), etc. For example, caching an immediate constant is probably rarely (only worthwhile when we can squeeze out any move from the cache register, since the immediate load would be as fast as the register-to-register move.) Note that we need to discover which TNs are immediate constants anyway so that we can give them special costs which allow packing in the constant SCs. Perhaps have separately specified constant SCs for each primitive-type. These SCs would be in addition to the packed SCs when doing representation selection/pack. If a constant SC comes up best, then we don't attempt to cache. Note that specifying a generator restriction that restricts an argument to a constant SC causes costs to be computed that allow the operand TN only into the constant SCs. This would prevent caching. If you just want to emit an immediate variant of an instruction in the constant case, then the thing to do is to use only one generator that explicitly allows the immediate SC, and then do a sc-case off of the operand within the generator body. Of course, it is silly inhibiting caching of a constant just because someone isn't prepared to use the cache. Such users can still use the constant directly even when it is cached. I guess the way to reflect this would be to have the costs for the ref be all zeros, saying we don't care what the SC is, but when selecting the generator we still don't consider these other SCs to be possible. Specifying 0 move costs into the constant SCs would have the first effect, but would also allow the immediate generator to be used for any SC. [\#\#\# But it probably isn't worth trying to fix restrictions to constant SCs, since it doesn't seem to be useful. We only need the power when an operand being in a particular constant SC modifies the costs for other operands. This doesn't seem very likely. It might be useful to make the write-p slot encode more information. How about making it a "Kind" field, having values such as :Argument, :Result, :Temp-Read and :Temp-Write? If we pack from inner loops to outer, one loop at a time, then it is unimportant (undesirable?) to consider loop depth in costs. This simplifies cost determination stuff. This isn't a pessimization: in our new interpretation, the purpose of costs is determining the best representation for a value within a given pack extent (loop). Weighting by loop factors represented the desirability of favoring the inner loop if its FSC differed from that in the outer loop. In our new scheme, each loop can have its own FSC. Since at any given time, we only pack TNs at the same loop level, all TNs are equally "important". Even supposing we wanted to use some cost-based "importance" to determine packing order, the loop depth would play no role. I guess the old scheme did provide a way to trade off the enhanced importance of an inner loop with important things outside the loop, which the new algorithm is fundamentally unable to do. But it is unlikely that this loss offsets all the gains of more localized packing. When packing in a loop, consider all conflicts for the TN in blocks in enclosed loops, as well as conflicts for blocks directly in that loop. This allows packing of inner loops to totally ignore TNs that aren't referenced within the current loop or its inferiors. In other words, when we pack a TN, we pack it in that loop and in all inferiors of that loop, instead of just in that loop. The "voting" algorithms described in RAOC attempt to avoid the "bad" and favor the "good" by a local examination of the preference/conflict graph for the TN being packed. The basic problem is that at the time we go to pack a TN, we only know stuff about TNs that have already been packed; we don't know how the remaining TNs will be packed, so we don't know how our packing choice will affect the other TNs. In the case of "register allocation", we don't know which location (if any) we can pack that will result in less ultimate cost than in local savings. In the case of "preference optimization", we don't know which preferences we will be precluding by adding additional constraints, since we don't know where unpacked TNs are going to want to be preferenced to. Things seem to be especially ucky in the case of preferences; pack doesn't deal very well with n'th order preference chains. It might be interesting to consider if we could determine good preference equivalence classes in the absence of scarcity constraints. This information can be used directly by unbounded pack, and also might be put to good use by finite pack in determining what preference relationships we would like to exist. The algorithm suggested for unbounded pack is of the right character, but it doesn't consider preference/conflict issues at all. Another thing that we might do in a pre-pass to pack would be to try to get some sort of accurate picture of the resource demands of packing each TN; the number of TNs that it conflicts with and their aggregate cost. This could be used to get a pre-pack estimate of the TNs that should probably not be packed in registers so as to avoid precluding other packings. If we did this kind of stuff, the actual pack phase might look somewhat different. It might start to look more like the "graph coloring" algorithms. What we would do is find the sets of TNs that would like to be packed in the same location for preference reasons, disregarding scarcity constraints. We would then attempt to pack the entire set of TNs in a single (scarce) location. If this isn't possible due to conflicts, then we either pack some subset of the TNs, allowing the remaining TNs to be packed in a single unbounded location, or we unpack some TNs already packed in a scarce location (moving them to their preference-set's corresponding unbounded location). We could also do some combination of unpacking and subset packing. What we might do is initially pack the subset of the TNs that we think have a modest resource demand, and then attempt to pack TNs with more conflicts later on, after we have packed everything else. A central idea to this is that we assign a stack location to each preference set as though all of the TNs in the set are actually allocated on the stack. Then we can choose to pack any subset of the TNs in the preference set into a register, without worrying about interactions with stack packing. In particular, load-TN packing doesn't really have to do anything at all to eject a TN from a register. All it has to do is change the TN to use the stack location for its preference set. I guess this amounts to regarding registers as a cache for the stack. Perhaps this is a useful way to think about it? \#\#\# Is there some good order to consider TNs in that will cleverly exploit the locality in the code? For example, sorting TNs by the "time" they first become live. We might use a particular register to cache a particular preference set over a certain time interval. We would never combine TNs into a preference set if they have different FSCs (or perhaps different FSC SBs), since we really don't want to honor those preferences. And of course, there are no conflicting TNs in a preference set. One way that we might consider preference/conflict issues in an unbounded scheme would be to find sets of TNs that "want" to have their preferences honored, with only partial consideration of conflict issues. We would consider "attractive" preferences: preferences between TNs that don't conflict. We could then label TNs with the component they appear in in the transitive closure of this graph. When determining "attractive" preferences, we might want to try a little harder to consider preferences that seem "good", for example if a TN is preferenced to TNs that mutually conflict, then we would choose which preference we will attempt to honor, and (at least initially) disregard the others. The crucial point is that we want to throw out enough preferences so that there are a fairly large number of components in the closure of the graph, with each of these components representing TNs that are are "fairly strongly" connected. How many preferences we need to throw out depends on how strongly connected the preference graph is once we throw out preferences with direct conflicts. If we consider local preferences at this point (which we would have to), then there are going to be lots of preferences. In this scheme, the local/global preference distinction becomes pretty meaningless. If conflicting TNs appear in a component, then we split the component up so that there are no conflicts. This would basically amount to doing packing separately on each collection of TNs that we believe have interesting preferencing relationships. So the algorithm might looks like this: -- Find the "ideal preference set". This is a collection of TNs that we think would all like to be packed in the same location, but might contain some second-order conflicts. -- Split the ideal preference set as necessary in order to eliminate internal conflicts. One algorithm might be to rank TNs by the total strength of their preferences within the ideal preference set, and then iterate over the TNs in the set in that order, adding all possible ones to an actual preference set, and rejecting TNs that conflict. Rejected TNs would go back to the beginning of the whole packing process. -- Squish preference sets into scarce locations, as mumbled about above (subset packing/unpacking). How about: Representation selection is handled by the initial FSC determination. We select generators before pack, and don't change them. [Which makes conditional packing feasible again. Oh well...] Maybe we want to assume that before pack we determine two SCs for each TN: a finite SC and an unbounded SC. The finite SC may be omitted if the unbounded SC is better. The unbounded SC may be omitted if the TN is restricted. The thing that we are giving up here is the possibility of attempting packing in several finite SCs before packing on the stack. This doesn't seem to be an important limitation. Probably what we would actually do is have an FSC and an NFSC, which might be the same. We compute a "goodness" for each TN that is the difference in the costs of the FSC and the NFSC. We also compute a "badness", which is the sum of the LTN counts of all blocks that the TN is live in (giving some measure of the amount of code that the TN is live in). Then we compute the preference sets. We rank preferences by the benefit (move count * move cost in the FSC) divided by the sum of the "badness" of the preferenced TNs. We then iterate over the ranked preferences, putting the preferenced TNs in the same preference set when the TNs don't conflict and have the same finite and unbounded SCs (or had those SCs missing). But this can result in second-order conflicts within the preference set. We can either compute the aggregate conflicts for the set, and only combine when we don't conflict with any TN in the set, or we can remove conflicts after the fact. Maybe it would be useful to have a sparse representation of the aggregate conflict set of the TNs in the preference set? Perhaps represented with global-conflicts structures (or similar). A list sorted by block number, with either :Live or a LTN conflict bit-vector. An advantage of this is that adding a TN to a preference set would basically involve merging the global-conflict lists, rather than actually computing the conflict set. Using this information, we could consider simultaneously packing all the TNs in a preference set. The advantage would be that we could efficiently compute the aggregate conflict set from the merged global-conflicts lists, instead of having to iterate over the conflicts of each TN that we pack. We would iterate over the TNs in the preference set, packing each TN that doesn't have a conflict for the register in that register, and letting the rest go on the stack. Of course, this would give spurious conflicts on registers, since TNs that went on the stack would still have their conflicts in the corresponding register. But this might work well enough, especially if we pack the preference sets in an order determined by the aggregate ranks of the TNs in the set. This eliminates to business of "infinite ranks" and "urgent TNs", which I don't really understand or see how it works. This really amounts to making all VOP temporaries "urgent" from the time they are allocated, eliminating the need to decide when the TN becomes urgent due to SCs filling up. But maybe we kind of have to really do the same thing anyway to keep track of SCs that non-restricted TNs have been unsuccessfully packed in [But no, since we can record that when the unsuccessful packing attempt is made. The problem with urgent TNs is that we need to anticipate failure of packing attempts so that we know a restricted TN's second choice SC is in fact unavailable, and thus must be packed in it's first choice.] Note that there probably won't be many such TNs, since most VOP temporaries will be wired to specific locations, rather than packed. Some kind of penalty for packing like the P@-(-) in RAOC is probably worthwhile. The idea is to disfavor packing TNs that have lots of competing expensive TNs in their conflict set. Using the average positive cost of packing conflicting TNs seems to be a reasonable basis, but we need a factor that indicates the probable number of TNs excluded. Another possible angle to approach this from would be to use the per-location costs obtained from location selection, abandoning packing the TN in that SC if we can't find a location whose cost is more positive than the negated rank of the TN. [Or if we added the rank to each possible location, a location with a non-negative cost.] This approach seems like it is a lot more accurate than the one in RAOC, since it actually looks at individual locations. Instead of permanently rejecting the TN from the SC, we could just knock it's rank down somehow, in the hope that we would find we could pack it after all (as in RAOC). This seems like it might be a bit ugly, since we would have to (at least sort of) keep this penalty up to date. We don't have to do a very good job though. Although ideally we would like to reduce the penalty for conflict-set TNs whenever we pack a TN, we don't really have to. This may result in ranks being artificially low, but at least the TN will get a second chance eventually. We could just have a penalty for each TN that is initially 0. If we don't pack because there is no location good enough, then we store the cost for the best location as the TN penalty, add it to the rank, and re-rank. Since this would result in negative ranks, the effect would be similar to setting the TNs aside until all other TNs have had a chance. |\# Here are the location selection techniques, roughly in increasing order of difficulty: 1] We favor locations containing TNs that we are preferenced to with the strength of the preference. We use both the global and local preferences in location selection. This is trivial: we just iterate over our preferences, voting for the locations of packed TNs. 2] Favor packing in locations that an unpacked TN that we *don't* conflict with and are preferenced to is preferenced to, in the hope that we can pack the other TN in that location and honor both preference links. The potential benefit is the sum of the strength of the preferences, but it isn't certain that we will be able to honor the preference, so we should reduce the benefit to reflect this uncertainty. [Not true that benefit is the sum, since we can hardly take credit for causing the second-order preference to be honored. The benefit is at most the first-order preference.] For each unpacked TN that we are preferenced to and don't conflict with, we favor all the locations it is preferenced to and can be packed in (as determined by the per-location conflicts.) 3] Disfavor packing in locations that an unpacked TN you conflict with is preferenced to. If we pack in a location that a TN we conflict with is preferenced to, then the preference cannot be honored. The penalty is proportional to the preference strength, but we might want to make less than the whole, since we might not honor the preference anyway. Finding the conflict set for global TNs is somewhat ugly. We have to iterate over the sets for all the blocks the TN is live in, rejecting duplicates. This is acceptable, since we have to know the full conflict set for the TN we are going to pack anyway. [This is kind of a special case of 4. It says that we disfavor locations extra-strongly if a conflict is preferenced to the location. 4] Favor packing in locations that hold TNs that conflict with unpacked TNs that you conflict with. This is so that we don't gratuitously narrow down the packing opportunities of TNs we conflict with. If we pack in a location that TNs in our conflict set are excluded from, then we don't reduce the number of places they could be packed. The benefit is uncertain, but would be proportional to the rank of the excluded TN, since if we did exclude it the cost would be increased by that amount. Using the conflict set of the conflict set is probably not reasonable, since we don't have this information conveniently lying around -- fortunately, we don't need the second order conflict set. Instead, we iterate over the unpacked TNs in the conflict set, favoring locations that they cannot be packed in. We can easily determine this by indexing the per-location conflicts bit-vector with the TN number of the TN in the conflict set. [To support packing rejection, we would want to change this rule to *disfavor* packing in locations that your conflicts *don't* conflict with. (i.e. disfavor locations that your conflicts can be packed in.)] [The penalty can be the conflicting TNs rank divided by the number of locations in the SC that it might be packed into. Assuming we randomly assign to a location, this is the probability it will be assigned to that location. When a preference? It seems our information about all possible packings of the TN in the SC could be used to scale the preference penalty.] The "assumption of infinite registers" used by PQCC seems like it may be a lose on the RT. The assumption that they make is that they can get away without allocating registers for operand loading, under the optimistic assumption that they won't necessary. It looks to me like they get away with it primarily because many architectures don't require register operands in most cases, so even if the arg TN doesn't make it into a register, they won't have to allocate a load register. This is definitely not the case on a load-store architecture. A possible solution would be to get Pack to allocate operand loading TNs. The only real trick in unpacking is choosing a TN to unpack that can be unpacked with minimum disruption and pessimization. We only need to consider unpacking TNs that conflict with the one we need to pack, and obviously we would prefer to unpack the least important such TN. We must avoid unpacking TNs that we may not be able to repack: TNs that are restricted to finite SCs. Probably this can be done with a simple modification of the location selection algorithm. When we throw the switch, packing in a location which conflicts with an unpackable TN would be given an appropriate finite cost, rather than an infinite one. Once we have decided on a location, we eject any TN that we conflict with. It is possible that the load TNs for the victim will force unpacking of yet another TN, but this process should be rare and limited, since each unpacking increases the odds of successful load TN packing due to replacing TNs that have long lifetimes with short-lived load TNs. #| -- Consider doing assembly-level optimizations to eliminate unnecessary reg-to-reg moves, loads from stack or constant pool, and perhaps stores to the stack. Use TN liveness info to tell when a move is futile, etc. Perhaps do intra-block lifetime analysis and register allocation on the assembly code *after* code reorganization? In addition to making it easier to do better reorganization and getting some intra-vop optimization for free, this also eliminates the need to specify lifetimes in VOPs, which is perhaps the most error-prone part. I don't think this will require any instruction set info that isn't already needed for reorganization, i.e. readers and writers. -- New assembler that does pipeline reorganization. Figure out when it is actually an optimization to move register saving to writers. Current heuristic of doing it whenever there is a single writer loses in some cases (such as TAK) where the writes may be executed without the call happening. Probably the cleverest thing to do is to be more conservative about the motion, trying to move to some intermediate place that is provably good. This could be done using dominator info. If there is some save that dominates other saves and is dominated by the only write, then flush the unnecessary saves. Note a restore can also be flushed when there is no reference to the register before a following restore. Possibly this could be integrated with the local packing algorithm. A TN could be packed on the stack within an inner extent, and still be in a register outside. We want to identify single-entry extents with finer granularity than loops. Traces perhaps? Anyway, this is all way down the road. |# Break code up into nesting of interesting parts: function (that has environment) simple function (registers allocated inter-routine) loop trace We start allocating in most probable trace of the innermost loop of a leaf in the call nesting. In a simple model, every TN would still have a save location and a normal location, which would be globally allocated much as they are now. The interesting action comes in with what is currently save-generation/load-TN allocation. We would then concern ourselves mainly with determining when in the trace to move between the save location and the main location. At any entry to the trace, we might require either one or both to be valid. Depending on the state of the TN in the trace we are coming from, we might have to emit a load or a save. Note that entries are always block starts, so we can place info about what we expect where in the ir2-block. We would also need to allocate registers to load constants into, and will also occasionally need to allocate local copies of register TNs to satisfy an operand restriction when the main TN couldn't be globally allocated to a register. These TNs are analogous to the current load-TNs, and would be allocated on the fly during a pass through the trace. If we run out of registers, and thus can't globally allocate a register for the TN, then we have to start getting clever. One thing we can do before resorting to on-the-fly packing would be to try to find a location which we can "globally" allocate the TN to just on this trace. This can win if the global allocation bottleneck is somewhere off-trace. But of course, that can fail. Ultimately we are reduced to some sort of on-the-fly load temp allocation and possibly localized spilling. The on-the-fly allocator would try to do most of its work at calls, since they force spilling to be done. The trace is effectively divided by calls into a sequence of segments. Some TNs live in a given segment may not be used in that segment. There is then, of course, no point in restoring the register after the call, and the register is thus free for the duration of the segment. We could then load some constant or TN for as long as the duration of the segment. Of course, there is no point in allocating the register for longer than the interval between the first and last use of the TN, so a register might be used for several values in a given segment. Even so, there is always the worst case of being forced to spill a register for a load TN. We could do this a lot better than we currently are by trying to spill for as long as possible, since the next VOP will probably need a register too. In some cases, one of these load TNs could be live at a block boundary, and thus at a trace entry point. So on entry to a trace, it may be necessary to load constants and variables into load TNs. Caching constants is an interesting issue. Lisp probably has more to gain from caching constants in registers than many other languages, since Lisp programs use lots of non-immediate constants. In addition to the localized elimination of duplicated loads of constants (which would fall out of load TN allocation), it would also be useful in some cases to move the loading entirely off-trace. If a constant we need is cached at the head of the trace, then it is justified to force any other entries load the constant so that in the normal case no load is necessary. Probably in loops without calls, we should move much more toward a conventional loop-based register allocation scheme. In a loop-w.o.c., we would try to "globally" cache constants for the whole loop, and try to allocate TNs for the duration of the loop when we can't allocate them globally. These loads could percolate out through several loops, since loading would be fobbed off on the trace containing the loop, which would try to move it out some more. In traces that don't correspond to a loop, it is not justified to force loading into the trace we branch off from, since that would at best save space, and would probably slow things down. I see load TN packing as working in two passes over the trace. First a backward pass with lookahead to see if there are any earlier uses. This backward pass allocates the TNs and annotates the TN-refs. Then there is a forward pass intermixed with code generation which calls the load function when it first sees a TN ref claiming that a particular TN should be loaded into a particular load-TN. On the backward pass, we would drop all of our load TNs when we see a call. During this pass, we also determine the saving policy for globally allocated TNs (register-only, both, or stack-only). On the forward pass, we don't need to do anything special at calls for load-TNs, but we save live register-only TNs before the call, and load live "register-only" or "both" TNs after the call. I think that callee-saves registers wouldn't be too hard to fit into this scheme. When we fail both to globally allocate and to trace allocate, we can spill a callee-saves register for the duration of the trace, and trace allocate in that register. Callee-saves somewhat diminishes the interest of save optimization (the segment concept), since you would hopefully be able to put the good stuff in preserved registers. [But not on an xx86...] Alien cleanups There is something of a mess with alignment and address units in the alien code. Probably we eventually want to have a word be 32 bits, and eliminate all old 16 bit assumptions. Also, if the 32bit system accessors don't allow access on 16 bit boundaries, then they should be changed to take a 32bit offset. Naturalize & Deport integer functions don't support all the accesses that the compiler transforms do. In particular, they don't support fields that cross a 16bit boundry. Also, William found some kind of bug in the general boolean accessor functions, I believe. (bit order lossage perhaps? Or maybe a bug in XXX-integer) Seems to be on access. Alien cleanups: Give more substance to alien-types: -- Require them to be defined (represented in an alien-value as a structure) -- Make alignment be a property of a type (reduce or eliminate inferring of alignment by compiler) -- Parameterize alien code for varying word sizes (alignment), address units and byte orderings. -- Change byte/word/long-word terminology to be less historical. Perhaps have short-words, words and long-words, which would correspond to the local sizes of C shorts, ints and longs. -- Use ASH and AND to find address/offset instead of truncate/*. Also change the sap-ref operations to be named byte/short/word/long? Make both signed and unsigned variants of SAP refs. Change %primitive SAP refs to the function form. [At least for those truly referencing SAPs. Probably non-SAP uses should be flushed. In new obj format, use on strings, etc, is incorrect. Anyway, string refs should be just as efficient.] Have deftypes for byte/short/word/long? (a bit sleazy for byte, since it is a CL symbol) And what do we want to guarantee about the resolution of SAPs? It seems real useful to be able to point to bytes on byte-addressable machines, but we might then have problems if we ported to a non-BA machine such as the Cray. [Of course, if the machine isn't BA, then the whole lowtag scheme goes out the window, but screwing lots of source code is even worse.] I guess that even on a non-BA machine SAPs could be byte pointers, but it wouldn't be very useful unless it was compatible with the local char* format (which should be doable.) We can assume our goal is interface to C here, since the code manipulating SAPs will be pretty Unix dependent. Random stuff: There are some problems with the current INFO interface. In particular, clearing info isn't very useful, since it may allow random garbage to be inherited through. What we really want in the clearing usage is to say "blast this back to the default", which may involve creating new entries to shadow inherited ones. The only true clearing operation that makes sense is to totally clear an environment (revirginize the compiler, etc.) We need some kind of exportable interface to dicking with the INFO environment. With-Info-Environment, Make-Info-Environment, etc. Also, it seems a bit semantically bletcherous that some of a class'es info types can be inherited while others are shadowed. It seems that the semantics we want is more like there being a "record" for each info class, with the types being the slots. An info record is inherited or shadowed in its entirety. ### But not necessarily... What about inlinep/inline-expansion? Declaring a function to be inline shouldn't cause any global expansion to be forgotten. Linking named calls: I have come up with a named call linking scheme that seems to provide a big enough win over what we are currently doing to be worth implementing some time soon (i.e. next 6 months.) I think something more like the Maclisp link-snapping may be a good idea, i.e. have a general moderately slow call that can be smashed to be a direct jump. These are the three main problems with general call that make the call/return control transfers complex: -- The callee might move due to GC, so we have to fix up any hard references to the callee on GC. -- The returnee might move due to GC, so we can't use a raw return PC or a hardware call instruction. -- The callee might change due to redefinition, so we have to somehow allow for updating of hard references to a given function. ...but in a released (i.e. purified) system, none of these things happen. This suggests snapping links at purify time, rather than dynamically on a demand basis. Doing it at purify time allows us to be sure that neither the caller or the returnee will move. It is also preferable for shareability reasons to snap all links at system build time, rather than waiting for calls to actually happen. Any calls to undefined functions would be left unsnapped (and a warning would be appropriate.) In other words, Purify will become a "linker" in the conventional programming language sense, but it will be possible to run programs before they are linked. This lets us have maximally tense code in the caller, something like: cal LEXENV, pc-low cau LEXENV, pc-high balrx LEXENV, PC cal CONT, new-fp, 0 [In other words, an absolute branch using the hardware call instruction to supply the return PC. Except for multiple values hair and register saving, this is as good as what would be used in C.] Single value return would be: load PC increment PC by N branch to PC load OCFP into CONT Return PCs will look like fixnums if word aligned, or might look like other bogus things if not, but these bogus things can be ignored by GC because they would point into read-only space. Whatever they are, GC can just ignore them, since the code isn't going to move. That part is simple enough, but how do we make it like Lisp? i.e., how do we represent unsnapped links and the information needed for snapping. Also, it would be nice if we could unsnap links to patch a purified core. And we want to be able to relocate code, not only so that we can GC it, but also so that purify can move stuff around. Although it's not necessary in principle, the ability to GC code is pretty fundamental to our approach to building systems. An unsnapped link is represented by jumping to a "trampoline", which is a chunk of code that more-or-less fetches the definition and does a FUNCALL. There is at most one trampoline per function name (symbol, setf function, clos method, etc.) Trampolines can be made smaller and faster by incorporating more knowledge about the actual definition into the code (i.e. what is the definition, and is it a closure), but this requires GC to "scavenge" them. Trampolines are allocated in a special area, and are not directly subject to GC (but purify can reclaim them if the function isn't on the stack). This eliminates the callee side of the GC problem. The returnee side is handled by having the trampoline convert the raw return PC into a code object pointer and fixnum offset. This pointer and offset is saved *in the returnee frame* in dedicated stack slots. The trampoline then does another call instruction to the target, passing a return PC pointing into the trampoline. At the return point in the trampoline, we reconstruct the original PC and jump to it (preserving the PC, PC+N values magic). So this solves the basic GC and incremental redefinition problems, but we still don't know how to snap or unsnap links. This requires additional linkage info, which will be hung off the code object. This is just a simple-vector of function names alternating with fixnum offsets of the calls to the named function. There can be any number of offsets per name as long as names can't be fixnums. This linkage information can be deleted after purify uses it, or it can be left around so that links can be unsnapped. Unsnapping a link requires a scan of all memory that might contain code objects, which is expensive, but not too expensive in the context of testing bug-fixes. Or Purify could invert the code object fixups into a who-calls database indexed by name. This would allow efficient redefinition and could also be used for a user level who-calls facility. Compared to the old link-table scheme and similar indirection based approaches, a purified system (snapped links) is smaller and faster, whereas an unpurified system is larger and substantially slower. Allowing for redefinition in a purified core costs space, but is optional. A who-calls database is a fringe benefit. Probably we will want to make two cores whenever we do a release. One has full debug info and the who-calls database that allows redefinition. The other has small debug info and no support for redefinition. This would certainly help the space problem on the MIPS, and would also eliminate the need for some of the optimizations to static call we had discussed, since this would make every call better than the proposed static call. Me or William should be able to pull this off in under a month (probably mostly William, since there's a lot of C work.) How urgent this is depends on what we find out from VOP space usage profiling. About callout to C: And how does call back from C work? Probably we have some sort of glue operation that takes a C-XEP and returns a statically allocated thing that can be called as a C procedure pointer. A C-XEP will look rather like a normal function object (to minimize format complexity), but will expect incoming arguments in the C integer/float registers and on the stack top. The C-XEP will grab the args and then call the real internal function. Interface is something like: DEFUN-FOR-C Name (Return-Type Option*) {(Arg-Name Arg-Type [Mode] Arg-Option*)}* Form* In other words, almost the same as DEF-C-ROUTINE, except that you specify some code which becomes the body of the foreign-callable function. The args are accessed and bound to the named variables. If the return-type is not VOID, then the first value is the C return value. Additional values are used as the results for output reference parameters. Name is DEFPARAMETER'ed with the C procedure pointer, since the pointer isn't a valid function. About "miscops": Miscops-in-lisp: miscops will be generated by the compiler, and mostly written in Lisp. But we don't really need to do this up front. Normal function calls will work fine, just not be quite as small&fast. What are we really saying with this "miscop" obsession? Another funny thing is a miscop's ability to be uninterruptable w/o any overhead (by virtue of its address). But we could have a flag in the code object that indicates whether the code is interruptible. The interrupt handler would just look at the saved registers to find the interrupted code object, and then see if it is an interruptible code object. If so, the interrupt is serviced. If not, we hack the return PC to jump to interrupt service code. [### Probably hard to do right, though. Witness the current system. It would probably be better to require the uninterruptable function to call some interrupt servicing primitive before it returns. This would check a flag to see if interrupts need to be serviced, and if so, service them. Probably easiest to require this primitive to be explicitly called so that it can be called before a tail call (if appropriate), or not called (if a tail-called function will service interrupts.)] Query? What miscops *need* to be uninterruptable? Answer: allocation, allocation and allocation. Special bind/unbind. (Possibly unwind, but given all the cleverness the debugger will have to have to understand interrupted calls, it shouldn't be much harder to understand interrupted unwinds.) But all of these things we would really like to implement inline. So in a sufficiently clever system, we might not need uninterruptable functions. But once we had them, we might use them for things other than "miscops". Really a "miscop" combines two things: -- Static call using an immediate addressing mode with argument count checking done by the compiler. -- Callee saves convention. Only the second causes complications for miscops-in-lisp. Any function with fixed arguments could be called statically. We could have a STATIC-FUNCTION proclamation that created an additional "miscop XEP" and dumped information telling the compiler and loader about this. As long as a miscop looks like a normal code object (allocated statically), it is easy for the miscop XEP to locate the environment. We want to make miscops as much like normal functions as possible because that will both make the memory layout simpler and make miscops-in-lisp easier. In a callee-saves convention, we could make the preserved registers preserved by wiring environment TNs in them. But this would prevent us from saving the preserved registers and then using them (a no-save convention?) [### Note that it would be easy enough for a "miscop XEP" to create an escape frame and save some of the preserved registers. But as long as we have some "scratch registers", we want to encourage the callee to use them. This could be done with some sort of SC magic: when compiling a "miscop" we would use different SC lists for primitive types, with two SCs replacing each register SC. These SCs would represent the scratch register/preserved register division, with preserved registers having an artificially inflated cost. Or perhaps less of an abuse would be to allow growable finite SBs. This would have some initial size, and would be grown one element at a time up to some finite limit.] [### Perhaps we should incorporate a callee-saves component into the normal full call convention. This does seem to be a potential speed win. Then static call would be similar to full call except that we jump directly to the entry and possibly return using the known-values convention. It seems that actually callee-saves does mesh o.k. with Lisp, but does require some care. The NLX issue is actually not a big problem: NLX will just have to save all preserved registers and restore them on entry. This hit is only on NLX, and is not much above that already required to handle the possibility of a NLX through C code. So we can make callee-saves work, but is it a good idea? -- Inhibits tail recursion? -- Will it be a speed win? -- Debugger problems? In its standard inflexible form, callee-saves is bad/impossible for Lisp. If you always save used registers at the beginning of the function and restore at the end, then you always get hit for the saving (even if the registers aren't used), and tail recursion is impossible (since you must restore registers.) But we can restore callee-saves registers *before* tail calls, and then do the tail call normally. Also, this stuff can interact with a trace-driven register allocator so that we save registers on entry to the trace where they are used. A nasty debugger problem is finding the values of variables that were allocated in callee-saves registers. We could try to encode in the debug-info which registers are saved where in what extent of the code. I guess something like a vector of pairs of sc-offsets (register and save location), and then in each code location there would be a bit-mask indicating which of these saves are actually in effect at that location. (perhaps this could be combined with the "alternate location"" stuff somehow?) [I guess we can think of alternate locations as the primary location being saved somewhere else.] So to find a value in a callee-saves register, we scan up the stack looking for the first frame in that register is saved. (It must be saved somewhere, perhaps in an escape frame.) There is also some potential for it to be difficult to tell who is running at any given time. (e.g. in current miscop stuff, the code object is unchanged by a miscop call.) It seems that for the debugger (and GC's?) benefit, we would want to establish callee's code object on call, then reestablish the caller's code on return. That would be much like the current full call. The callee needn't put anything interesting in its stack frame (or even allocate it), but it must restore OLD-CONT on return. This is really the same as now.] Scheduling assembler: Provide basic operations to move an instruction from here to there, if possible. This is based on "passability" determined from side effects and operands. Instructions that have delay slots or constraints on adjacent instructions would have operations that do things like: -- Try to optimize this instruction's neighborhood. -- Make this instruction correct, inserting no-ops if necessary. -- Test whether this instruction is correct (?) Something like a sort-of table-driven peephole optimizer. We need some sort of instruction destructuring. We specify predicates on each instruction in an arbitrary window. Instructions may need to be annotated with how big their largest window is. This may be needed mainly for determining whether a code motion renders an instruction illegal. Maybe we don't want to mess with that, though. If we have the "make this inst correct" operation, we can do an "optimize" pass without worrying about whether it breaks anything, and then do a fixup pass that makes everything legal. It may even be that with the architectures currently under consideration, we don't need this fixup pass, since there is no constraint on motion other than passability. Some machines may constraints like two particular instructions not being allowed to be adjacent because some resource is "busy". We initially emit instructions with noops so that we can omit code reorganization to save time. Delete noops in favor of a non-execute instruction as a machine-dependent optimization on machines with non-execute flavors. Need to have some indication of whether an instruction is a branch, though, since branches can't really be moved, but stuff may be able to be moved into the delay slot. We could later add a DAG-based pre-allocation reorganizer that deals well with long soft interlocks (i.e. float arithmetic, RT memory access) and branch delay slots. We will always need this post-pass anyway, since new loads will be introduced by register allocation. It might be interesting (though it won't help unsafe benchmarks) to special-case conditional branches for run-time error checks. We can move stuff across error checks when rescheduling as long it isn't affected by the data dependency modeled by the original CHECK-xxx VOP. We could also reconsider the possibility of doing rescheduling on the IR2. If we did this, we might want to make constant loading more explicit, though this might not be necessary with a register allocator that understands constant caching. Branch delay slot filling clearly can't be done on IR2, though, so it might want to be done in the post-allocation pass. It seems like delay slot filling benefits relatively little from cleverness, since the branch is relatively immobile. The two main techniques are: -- Find an instruction before the branch which the test doesn't depend on, and -- Find an instruction in the target or drop-through that can be spuriously executed. (Should favor the drop-through...) In addition to seeming to be well solved by a simple scan, lots of the possible delay instructions will be introduced by register allocation (loads and stores). CALL/LOOP ANALYSIS Note that Stack Analysis introduces cleanup code after this phase runs. We have to stick the new VMR-Block into the appropriate loop. [### This should be rethought at some point. The current code for strange loops seems not to be totally debugged, and it isn't really clear that anyone cares. If we aren't using loop info for environment hackery, then we may want to redo stuff to be more useful for cost assignment and VMR optimizations. Probably what we really want to do is to compute the call graph before we do loop analysis. It seems that we should be able to use this information to avoid getting confused during loop analysis. We still represent the calls the same way in the flow graph, since this makes the actual control flow explicit for flow analysis. If we do a depth-first walk of the call graph, then we can find recursive calls using the depth-first numbering. Every recursion will involve an arc from a higher numbered node to a lower numbered one. If a function is the destination of such an arc, then we mark it as recursive. Nodes in the call graph ("functions") will be similar to loops. We do loop analysis on each function, [walking the call graph bottom-up?], using the function as the root for the loop nesting. We want to integrate the "loop depth" used for cost computation with the information from the call graph. For example, a non-recursive function would have the "depth" of the maximum of its call sites, and a recursive function would have 1+ this, since the recursion is a "loop". Probably we want the "function" node to be the VMR environment structure. This will presumably also include some "loop" structure. We definitely need to be aware of tail recursions at this point. Tail recursive calls are represented at the flow graph level, and don't appear in the call graph at all. Tail recursions will be indistinguishable from explicit iteration. ] This phase is optional, but should be done if speed or space is more important than compile speed. We annotate the VMR with information about loops. First we find the set of the blocks which dominate each block. Note that if we keep the set sorted by DFN, then they will be in order of dominance as well. We probably want to special-case local function call, since standard flow analysis won't realize that the return point of the call is dominated by the call point. The Loop structure contains a list of all the blocks in it, and also the loop depth. We arrange things so that Loops are reasonably nested, and have up and down pointers. We omit blocks contained within inner loops from our list. We also list the exit points from the loop. We must deal with "strange" loops (not just a concession to TAGBODY, since mutual recursion can create non-reducible flow graphs.) First we find the back edges and natural loops, and then we find the strange loops. A strange loop is a cycle which contains no back edges. It turns out that every strange loop contains a retreating edge which is not a back edge, and all such edges appear in strange loops. Strange loops are broken into segments, where each segment is the code in the loop which is dominated by a given entry point. The representation for a segment of a strange loop is about the same as for a natural loop. We can find the heads for the segments in a strange loop by doing a graph walk forward from the ascending non-back edge, recursing edges which are not back edges. If we reach the start node or a node already in the loop, then we add all the nodes on our path to the strange loop as we unwind. Each block in the strange loop with predecessors outside of the loop is the head of a segment. When there are multiple back branches to the same start block, we consider all the code to be in the same loop. A loop is effectively defined by its head, rather than by any particular back branch. Loop nesting is determined by the dominance relation on loop heads. A loop is nested within another if the head of the inner loop is dominated by the head of the outer loop. Local call tends to cause the flow graph to be non-reducible. This shouldn't be a correctness issue, since other code should be assuming arbitrarily bizarre flow graphs. This is an efficiency issue, since it could inhibit the combination of the allocations for some functions, resulting in more environments being allocated than is really necessary. This can probably be fixed by special-casing local call during loop analysis. All we need to do to get the correct dominators is to union the dominators at the call site with the dominators at each return site. It is less clear how this should affect loop analysis. In some ways, it would make sense to treat local functions as being nested in all the loops that call them. On the other hand, if we are interested in the dominance properties of nesting, then it makes more sense for the code for the function to be in none of the loops. I guess that this conflict is due to different people wanting different things from loops. Some people are only interested in the importance of the code, which they determine from the loop depth. Other people such as environment analysis are interested in proving that there are no cyclic paths, and want the dominance relations preserved. Other people are interested in doing "loop optimizations", and are primarily interested in entry and exit points. We may want to do funny things with the loop depth anyway, since we will want to bound it so that costs remain fixnums. LOOP INVARIANT OPTIMIZATION This phase is optional, but should be done whenever speed is more important than compile speed and space. We scan the loops from the inside out, moving VOPs to before the head of the loop when we can show that they compute the same value every time around the loop. Probably most loop invariant expressions will be due to code implicitly emitted by the compiler, the largest contribution being error checking code. We need to be more careful than lots of compilers, since we must guarantee that we don't evaluate the invariant expressions when they would not have been previously. For example, it would be unacceptable to move an error check out of the loop when the loop might run zero times. The simplest solution is to only consider VOPs in blocks that dominate all the loop exits. This is almost worthless, since most loops have the exit test at the head. What we have to do is guard the invariant code with a replication of the exit test. A simple but useful version of this general transformation can be implemented when the head is an exit. We do something like this: LOOP (if test (go EXIT)) . . (go LOOP) EXIT ==> (if test (go EXIT)) (go SKIP) LOOP (if test (go EXIT)) SKIP . . (go LOOP) EXIT What we do is remove invariants in the head block, then copy the entire head block and use it as a guard for the other invariants, which must dominate every other exit, but don't dominate the head. It makes absolutely no difference what the code in the head block is or what the exit test is. Of course, copying blocks can be a major space waste. We might want to inhibit copying of large blocks unless optimize space is 0. [### Note that this scheme can only only move code out one loop level, since the guard on the invariant block prevents it from dominating the loop exits. This transformation can be regarded as a special case of the transform: -- Given a loop with an invariant conditional, replicate the loop for each value of the condition and run the correct version. The invariant test is "runs more than once". There is a potential exponential blowup here. To avoid excessive code bloat, we might have one copy of all N loops that is invariant optimized and for which we test that *all* levels of the loop will run more than once before jumping in. But that only works if the loop control of all loops is independent (as is true in the common nested-array-loop case.) Or something...] We determine what expressions are invariants by doing global flow analysis to find the reaching definitions for each TN. A VOP is a potential invariant when every arg is either constant, has all its definitions outside of the loop, or has as its only definition an invariant VOP inside the loop. Such a VOP can be moved out of the loop when we know that it isn't affected by any side-effects in the loop. This will be trivially true if the VOP isn't affected by any side-effects. A somewhat more general solution would be to take the union of the side-effects of all the VOPs in the loop, and only do the move when the VOP is not affected by any of them. [### Actually, we don't need to use reaching-definitions information: we can just use the reference information for each TN. This won't detect some invariants that would be detected by using reaching-definitions, but it isn't true for any of the internal subexpressions that we care about, since they all have only one definition. The main case where this would happen is in the "only definition outside" case: it could be that only an invariant definition reaches the use even when there are definitions inside the loop. But this would only seem to happen if people are gratuitously recycling variables. If we don't need to use reaching-definitions, then reaching-definitions analysis could be relegated to a higher optimization level, and could be tuned for the move optimization case, where we only want to know what the definition is when there is a single one, and can get away with a simple indication of "many" if there is more than one. The move-deletion action also seems conceptually pretty similar to the stuff that register allocation does. Maybe it could be squeezed in there somehow? ] We can relax the "dominates all exits" requirement when we can prove that the operation can be successfully executed whenever its operands are available. Of course, any error checking code doesn't fall into this category. But we also need to show that "nothing bad will happen": system integrity will be maintained when the operation is done with arguments that it that is might not otherwise be done with. For one, we need to be sure that localized type information didn't influence our original decision to emit that particular VOP. Uses of THE, variable type declarations not attached to bindings, and function argument type restrictions must be disregarded, since in the original program, the code containing the declarations or call might nor have been executed. In contrast, we can use type information derived from declarations for variable bindings. Since we eagerly check the types of variables, we know that any reference to the variable will satisfy the type. Read this to say we can use the variable's Leaf-Type: we have to be more careful about using the Refs Node-Derived-Type, since ICR optimizations may discover local information about the variable's type. For example, type constraint propagation puts localized type information in the derived type. Note that the Leaf-Type is represented in the TN-Primitive type, accessible in VMR. How about this: For a VOP to be subject to aggressive code motion, it must be explicitly declared as such (i.e. :Movable T), and it must have primitive-type restrictions on any arguments that must be of a particular type for successful execution. Motion won't be done unless the TN-Primitive-Type of operands after the move satisfy the restrictions. "Successful execution" means that *no* invocation of that vop with *any* argument TNs with the specified primitive types will result in any sort of error or in any trap such as illegal memory access, floating point overflow, divide by 0. Also, the result generated must be in the representation demanded by the result TN. If there are any result type restrictions, then it must be harmless for them to be violated (as long as nobody uses the result.) For example, it a lowtag scheme, it is harmless for a fixnum output restriction to be violated. [But not in a hightag scheme, since a garbaged type code results from overflow.] Type coercion operations aren't subject to aggressive motion (i.e. coerce-from-t), since coercion requires the argument to be the right type, yet coercion operations aren't ever emitted unless the argument type is T. Data structure accesses are o.k. as long as we establish that the operand is some sort of pointer (boxed in the case of boxed accessors). [But we can't bless the result of moved code as being of the correct type. If we do a bogus reference, we may pick up an some immediate object, rather than the pointer we were expecting. So we can't do aggressive motion of a whole reference expression, only of the innermost reference.] Note that we could still do aggressive motion of structure accesses, even though the primitive type discards the exact structure type. All could translate to STRUCTURE, assuming that mixing up accessors doesn't result in death. [Systems that use protected pages to mark the end of the heap could get confused, though.] This means that the kinds of operations subject to aggressive motion is relatively limited. Considering the complexity and subtle issues involved, it seems questionable whether it is worth attempting. The only interesting case seems to be fixnum arithmetic, and this only works given assumptions about the legality of fixnum overflow. Aggressive motion of special references (including boundp checks) is probably culturally acceptable, but the correctness is dubious. We could do it in unsafe code, though. I guess there also wouldn't be any problem if we factored out the boundp check into a special check operation. COMMON SUBEXPRESSION ELIMINATION This phase is optional, but should be done if speed or space is more important than compilation speed. This phase scans each block and combines VOPs that compute the same expression. We scan forward, adding VOPs to some kind of hashtable if they have an attribute indicating that they are a candidate for a subexpression, and clearing out VOPs in the table that are killed by side-effects that we encounter along the way. Probably we want a separate table for unaffected VOPs so that we have fewer to scan when we notice a side-effect. We can deal with invalidating VOPs whose arguments are clobbered by looking at the Tn-refs for each TN that we see being used as a result. Probably many expressions should be killed by function call, since they won't be worth doing two memory references to save. We do this after all the other optimizations so that we can combine duplicated crud that may have been stuck at loop heads by previous optimizations. We could do global common subexpression elimination, but it seems like a lot more trouble that it's worth. Global common subexpression is computationally difficult, both because it uses flow analysis and because it involves creating and repeatedly scanning large sets of expressions. Local common subexpression probably gets us most of the win; common subexpression elimination is more of a space optimization that a time optimization anyway, since loop invariant elimination moves the common subexpressions out of loops. [### On the other hand, GCS could optimize stuff that safe loop invariant optimization is too wimpy to handle. The problem is that we can only optimize invariants that dominate the exit (unless we can somehow prove to ourselves that the operation cannot possibly result in an error.) Also, it isn't clear that symbolic programs really do spend most of their time burning in inner loops. With GCS, if the expression has already been computed outside of the loop, then we know we can flush the evaluation; there is no need to worry about safety. Of course, we are already planning to do GCS-like optimization of of type checks in the ICR type constraint propagation step, so we won't get any type checks this way. Special references and array index calculations (as well as garbage macro expansions) are a possibility, though.] Probably we would represent an "expression" by arbitrarily designating the first use we see of a particular VOP/argument combination as the "expression". A pre-pass would build sets of the operations generated, those killed by assignments, and a representation of the aggregate side-effects of the block. We may want to divide the set of available expressions according to whether the expression's VOP is affected by any side-effects. This is because during flow analysis of any block that has side-effects, we need to iterate over all of the possibly affected VOPs to see which which expressions need to be killed (since we don't the pre-pass to have to compute the set of all expressions killed by the side-effects.) It is likely that the set of available affectable expressions will be smaller than the unaffected set, since affected expressions tend to be killed by side effects. Unbounded SB packing: We use a different packing algorithm for packing in SBs that represent an abundant resource such as memory. If we know that all TNs that want to be packed in the SB can be packed in it, then we can optimize preference costs at the expense of storage usage. When we attempt to pack in an abundant SB, we just add the TN to a per-SB list and return success. Our algorithms might also exploit the statistical difference in the kind of TNs that get packed on the stack: overall, most TNs are local and will get packed in registers. The TNs stack allocated are a small fraction of the total, and have a much larger percentage of global TNs. As in scarce Pack, we represent the conflicts for a location with a bit-vector that has a 1 for the number of every TN that cannot be packed in that location, but we index using new TN numbers assigned only to the TNs we are trying to pack on the stack. This should reduce the size of the vectors quite a bit, since the stack TNs will be a small fraction of the total. First, we attempt to honor all global preference links (ignoring local preferences for now). We find all the preference links between TNs in the SB, and rank them by strength. This should reject a lot of preferences, since many preferences will be to TNs that got packed in registers. We then process the preferences in order: -- If one of the preferenced TNs has a location assigned, then we check the location's conflicts to see if we can pack in that location: if not, we skip this preference. If we pack, then we add the new TN's conflicts to the locations conflicts vector, computing them from the TN's conflicts information. -- If both of the preferenced TNs have locations, then we see if the locations can be combined. This involves iterating over the TNs packed in one location and seeing if any of them are in the conflicts for the other location. We maintain a list of the TNs packed in each location so that we can do this iteration efficiently. The lists should be small during the initial preferencing pass, since the size is bounded by the number of TNs that can be combined for purely preference reasons, which should be O(1). We combine the conflicts vectors using bit-or. -- If neither of the TNs have locations, and they don't conflict, then we assign a location to the pair, building a conflicts vector from the combined conflicts. After we have done all we can with the preferences, we pack stuff together in the minimum amount of space: -- Combine the already allocated locations where possible. Although the number of TNs packed in a location may become large at this point, testing for conflicts will still be reasonable, since we iterate over the small list of the location we are combining. We don't even need to update the TN list for the location we are combining into, since we won't ever try to combine it with anything. -- Pack all the remaining TNs allocated in the SB. These are the TNs that didn't have any successful preference (probably a large fraction.) If a TN can't be packed in any existing location, then we make a new one. VMR CONSISTENCY CHECKING Block connectivity is similar to ICR. Verify that TN-Ref linkages are correct. Verify that TN-Ref Write-P slots are correct in args and results. Check local and global preference links. Check info about top-of-stack TNs. (whatever that may turn out to be...) Check that the VOP list within each block is well-formed. Check VOP arg counts? Check for valid TN assignments w.r.t. function call. Check loop analysis? Check environment analysis: Do basic sanity checks such as making sure that directly referenced TNs are directly accessible in the current environment. Check for references to uninitialized TNs: We could detect some lossage by looking for TNs which are referenced before they can possibly have been written. This could easily fall out of lifetime analysis. If a TN is uninitialized at the beginning of a block, and is read before it is written, then we have a problem. Check packing legality: To the extent that it is possible, this is a good idea. We can certainly check for things like SC restrictions being satisfied. We could also check on the legality w.r.t. TN lifetimes, but it would be extra work. We could redo the live var analysis using simpler data structures and algorithms which are optimized for legality checking. After checking each block, we check that the new analysis agrees with the old, which will guarantee that the lifetimes are globally correct. Our checking algorithm would probably be based on data structures representing locations rather than TNs. Each location would have a flag indicating which live TN is in it, if any. If someone attempts to read a different TN in that location, then we know we have a problem. The validity checking might not be much less buggy than the real version, but hopefully the bugs won't intersect. Check accuracy of effects information: Incorrect tables of information about effects and interfaces is a major source of lossage. We could check up on this by mumbling when a supposedly effectless operation is used for effect. Could also complain if a supposedly effectless operation is explicitly replicated in user code. Could have a grab-bag of other checks: whenever we find a problem, we try to devise a check which would have detected it. Cost assignment checking: Probably a good idea, since bad costs would cause subtle bugs because code would work, but slowly. Can we write something to check for efficiency bugs? Can check for things like: TNs which aren't preferenced that we make moves between. TNs that are preferenced, have disjoint lives, and get packed into different LOCs in the same SC. Specific VOPs whose average cost is out of line with initial estimates. Code sequences containing many VOPs with above average costs. Could have a smarter, slower pack algorithm which tries to find better packings. If the better packing is enough better, then we complain. Could we treat repacking to be a game? We could start out with a packing, and then move through the space of packings by moving one TN as our "move". We would consider the cost of repacking any TNs that we unpack to be the "enemy move". OBSERVED BUGS AND POSSIBLE FIXES Some bugs gleaned from the change logs of existing compilers, with possible ways to prevent them from occurring in this compiler: Generators emitting bad assembler format: Make assembly emission be done with a macro that checks the instruction for validity at compile time. Random address arithmetic being done wrong with magic numbers: Use only named constants and well defined named functions in such expressions. Examples of functions are: words=>bytes (* x 4) byte-index (+ x y) skip-header (+ x ) No magic numbers in generators anywhere. Bugs with stack simulation: Let stack analysis worry about it. Generators putting unboxed things in boxed registers and vice versa. Mainly a problem with temporaries used within the code for the VOP. This is most likely to happen in code which deals with boxing and unboxing numbers. Factor this code out and get it right. We can have a frob that will tell us all VOP generators that request unboxed TNs, so that these can be given an extra-careful going over. General usage of wrong registers (magic register numbers) Make generators get their temporaries by requesting them as TNs. The declarative VOP info will cause them to be allocated and passed in. The requirements are specified in terms of SCs rather than magic registers. This should give us some freedom to move things around without having to change all the generators. Problems with inadequate restrictions being specified on TNs: Have conservative defaults? Functions that pass around pieces of LAP such as operand locations cause problems with callers not being up on the syntax of what is going on: Don't pass around syntax, pass around structures that represent semantics and can be checked. Broken generators fail to return the results and similar stuff: Check that all the args and results are actually used as operands by the generated code. We could have temporaries be classified by whether they are always used or only sometimes used, and then could check that the always-used ones are always used. Representation selection seems to be a huge source of problems: Does having lots of SCs and getting Pack to work for us really eliminate all these problems? Discrepancies between the interface to a transform and the real definition: This is largely eliminated by specifying a type that the call must satisfy. If we forget to handle a keyword in a transform, then it just won't get transformed. We could have a frob that compares the types for transforms to the types for the definitions. Other confusions between different implementations of the same thing: Macros v.s. special forms. Functions v.s. source transforms v.s. transforms v.s. open coding. Do they do the same thing? Are the differences gratuitous? Are we using the one we want? Do we have all the versions that we need? If the same information is kept more than one place then it is surely wrong somewhere: Don't duplicate information. Check that duplicates are consistent. When decorating the representation, we forget to fill in some slots, and then garbage is interpreted in some random way: Choose defaults that are obviously broken: *empty* Failure to realize that some stuff must be recomputed due to failing to realize that something has been invalidated or failing to propagate the invalidation. PROFILING It would be nice to have a general profiling mechanism like the one MIPS uses to gather dynamic instruction statistics. This could be used for user-level tuning, but it would be especially appropriate for analyzing the compiler's performance for tuning or publication. The basic mechanism is that we associate a counter with each basic block, and increment the counter when we execute that block. To make use of these per-block counts, there must also be some additional information about what is in each block. We implement the counters by having a 32bit I-Vector pointed by the profiled component's constant pool. When we enter the component, we stash this in a known, permanently allocated register. We also have another permanently allocated register that is used as a scratch location to load the current count into and increment. Probably a useful kind of per-block information is to have for each VOP used in the block, the name, the number of times used, and the generator number that was used. We represent this by having a simple-vector parallel to the counters vector. Each entry points to per-block information, perhaps represented as a simple-vector of VOP names and use counts, with the generator number only being specified when it isn't 0. We also have the start PC for each block so that we can use the debug information to find out what the block is doing. VOP usage information is probably most interesting as far as the compiler is concerned. As long as we have accurate cost information, we can directly determine what percentage of time is spent doing each VM operation, instead of having to guess from instruction usage statistics. These results could be used either to tune performance-critical generators, or to rewrite code so as to avoid using expensive operations. This information could also be useful testing purposes. If the VOP usage signature of a test changes, then this a warning that a pessimization may have been introduced by a compiler change. We could also detect generators that aren't used by our test suite. It should be easy to hack the assembler to dump instruction usage information too, although this would be of more interest to architecture types. MULTIPLE THREAD SUPPORT When we support multiple threads under Mach, we will have to scrounge a THREAD register. This will point to a structure which contains the thread-specific context which doesn't fit in registers: Current special binding block Active catch pointer Any number stacks Information for management of the resources for this thread: Base and size of stacks. Descriptive info (name...) Policy info (priority...) OS level handle on the thread Synchronization info such as what lock the thread is waiting on, if any. The THREAD pointer could actually be the base of the thread's control stack, with the frequently used info being stored before any real stack stuff. This would allow fast checks for continuation validity, since both the top and bottom of the cstack will be in registers. This is useful, since a non-local GO into another thread's stack would cause flaming death. We will also have to modify code which manipulates resources shared between threads so that it does synchronization. The big issue performance-wise is the alloc-table: when there is no lock on the table, we want to get at it real fast. Contention for the alloc-table could be reduced by having multiple allocation spaces (and tables) for important types. You could just grab a table off of a queue and put it back when you are done. It will be necessary to make all miscops "interruptable", since they will have to be able to run concurrently. Probably the only resource shared at the miscop level is the alloc-table. Note that software interrupts are going away, so strictly speaking interrupts won't happen, but a scheduling break can happen at any time. The only big problem I see is with miscops which manipulate raw pointers into objects. The system must either avoid GC while such a pointer is in use or adjust the raw pointer if the object is moved. Probably the fix is to allow raw pointers in boxed registers as long as the real object is also in a boxed register. This would allow random pointer hacking to be open-coded as well. Being preempted during object allocation would still be a problem, but the allocator will have to get a lock on the alloc-table before it can do anything, and GC can't run without the alloc table. Of course, we will have to go to deep binding when we support threads. We can eliminate the binding stack by threading the BIND blocks together on the control stack. This will simplify the problem of allocating stack for each thread. Deciding on the exact implementation of deep-binding is non-trivial, since there are many different ways it could be done. The scheme used on the S1 seems like a reasonable start, though. A BIND block looks like this: . . . The value cells are just allocated on the stack like a lexical variable. The BIND block is delimited by the prev pointer. The reason for keeping a pointer to the cell instead of just using the stack location directly is that it lets us cache binding lookups. Whenever we look up a variable, we link a LOOKUP frame into the bindings. This lookup frame contains the result of our search and looks just like a BIND frame as far as the search is concerned. The theory is that this will reduce the average distance searched, since commonly used variables will commonly have been used somewhere near you on the stack. Since we look up the address of the value cell rather than the value proper, we only have to do the lookup once per function. Note that there is no obligation to link in a LOOKUP block; if it is unlikely or impossible that any function we call will reference any of the variables that we reference, then not making a LOOKUP block is a good idea. The S1 has an important optimization of variable lookup. It is based on the observation that many special variables are never actually bound. If a variable is known not to be bound, we can just go directly for the value cell. One solution to the problem would be to introduce a new GLOBAL variable kind which cannot be bound, but we can do nearly as well by noticing the lack of bindings at run time. The S1 keeps a count of bindings in a cell in each symbol; if the count is zero at lookup time, we don't have to do any lookup. I think that it would work just as well to have an EVER-BOUND flag which is set and not reset. One advantage of this approach is that there is no need to do anything when a binding is removed. On the S1, THROW needs to scan up the bindings, decrementing the count, which requires a way to distinguish BIND frames from LOOKUP frames. There is a trick which might be a win if BIND and LOOKUP blocks have a large enough average size. The idea is to use a hash filter to determine if a given symbol might be in a particular block. Each symbol name is hashed to a fixnum with a few (3 or 4) random bits on. When a block is built, the OR of these hash values is placed at the beginning. When looking for a symbol, we can skip a block if any of the bits for the symbol are zero in the hash for the block. All these hash codes are computed at compile time, of course. When binding an unknown symbol, we just use all 1's. We can't delimit the block with the prev link, since that has to be next to the hash so that we can skip the block without going to the end. A block would look more like this: . . . In a block less than six or so entries, we can reject it with about 90% probability. The probability improves as the block size decreases, but the advantage decreases as well. The actual advantage also depends on the dynamic characteristics of the lookup process, since if it turns out that a lookup is usually satisfied on the previous block, we will never reject any blocks. Bizzare optimization on deep-binding: If we notice that a function which we call in a local call is referencing a variable we bind, we can pass in the value cell pointer as an argument so that it doesn't have to do the lookup. It is questionable whether this would help real code even with massive block compilation, but it sure would win for STAK. We could actually generalize this optimization to be less obviously bogus. What we do is shallow-bind the *location* of the current value cell for each special variable which is both referenced and bound. At each entry-point to the block, we build a specbind frame for variables bound in the block (doesn't need to be all). Initially all the value-cell pointers are to the looked-up current values. Within the block, we keep track of the current value cell by shallow-binding it in some known location. This could be in a register or in the specbind frame itself. Internally to the block, we can do a special reference by indirecting through the magic location, and can do a binding by saving the old value in our frame and then storing the location of the new cell in our frame. When we do a full function call, we move any value cell pointers which are in registers into the correct place in the specbind frame. This could make STAK *faster* than normal shallow-bound implementations, and might well help real programs also.