design.txt -- design document

Compiler and Symbol Table Info

The compiler will build symbol tables into the run-time representation to
avoid redundant representations.

Also, dependencies must be represented so that when a class is redefined,
all subclasses are invalidated.

Globals are stored on atoms. Classes are simply global variables
bound to class objects. Symbols must be searched to find a symbol
name corresponding to a class. (We could store the name in the 
class, but there is the danger that the user might reassign the
symbol's value.)

STRUCTURES

Symbol is a Node with the following fields:
    name: a pointer to an FString, the name of the symbol
    value: the global variable binding
    function: the global function finding

Array is a Node. Arrays can be extended using an append operation.
Since Nodes are not naturally relocatable, Arrays are implemented
as an Array node with a pointer to the actual data. To extend the
array, the data is copied to a larger block of memory. The Array
node has the following fields:
    length: the length of the array
    data: pointer to a Node
Since all nodes have a built-in length count, we can tell how much
space is allocated in the data. The array length is different so
that we can allocate extra space to allow for growth. It is not
necessary to reallocate the array every time its length changes.
When memory is allocated to accomodate a growing array, the
new allocated size will be some factor, making the average cost
of extending an array linear rather than quadratic.

FClass is an Object with these fields:
    class: the class class
    super: the superclass
    symboltable: symbols object
    symbols: vector of atoms
    slots: size of instances 
     (this is the number of member variables, including inherited
      members, which in turn include the class pointer)

Symbols is a Dict mapping symbols to class members (vars and methods)
Every FClass points to an instance of Symbols
    lookup: symbol -> ismethod, index, params
    insert_method: symbol, params -> void
    insert_var: symbol -> void
    methods: vector (a Node) of method pointers

Frame is a Node (not an Object) with these fields:
    tos: top of stack, tells how many stack elements in frame
        (during procedure calls)
    pc: remembers next instruction to execute (set by a call instruction)
    prev: previous frame (a Frame)
    self: the object for this frame (an Object)
    method: the method for this frame (a Method)
    parameters: any number of slots for function parameters
        (the number of parameters is given by the method's signature)
    optional parameters
    keyword parameters
    rest array pointer (if any)
    dict dictionary pointer (if any)
    locals: any number of slots for local variables
        (the number of locals is given by the method's stack_offset)
    stack: during calls, the expression stack is stored in the frame
        (the size is given by tos)

Method is a Node (not an Object) with these fields:
    literals: Array of literals used in the code
    code: instructions (a Code)
    name: name (a symbol) of method
    parameters: 16-bit integer: how many positional parameters
    optionals: 16-bit integer: how many optional parameters
    keywords: 16-bit integer: how many keyword parameters
        (these parameters are in the symbol table; this number
         is used to tell which of the symbols are keyword
         parameters.)
    rest_flag: 8-bit boolean telling whether additional
        parameters are passed by array
    dict_flag: 8-bit boolean telling whether additional
        keyword parameters are passed in a dictionary
    default: default values (in an Array) for optional and keyword
        parameters
    line_nos: line numbers corresponding to instructions (a Code)
    stack_offset: location in stack frame to store copy of stack
    frame_size: how big should the frame be (including parameters,
        locals, and expression stack area)
    symbols: symbol table (a Locals) for this method:
    class: class to which this method belongs (a Class)

Locals is a Dict containing parameters and local variables:
    lookup: symbol -> index (or -1 for not found)
    insert_var: symbol, index -> void

ATOMS/SYMBOLS

A Symbol is a lisp-like atom. There is a hash table in the
Machine class that maps strings to symbols. Symbols contain
the string, a function pointer, and a value pointer.

Because atoms are common, we expect code to be full of them,
but this creates a problem for the loader, which must somehow
make sure that an instruction that references an atom resolves to
the right address for the atom. Our solution is to make a table
of atoms per compilation unit. We don't know at this point exactly
what a compilation unit is (a file, a set of files, a class, ...),
but we will assume that a compilation unit is at least a file.
Originally, we used class as the compilation unit and used a
per-class table of symbols. This created a problem for global
functions, which have no class. Therefore, we will make an
atom table for each file and share the table among all methods
in the table. Every method will have a pointer to the atom
table. Code that references an atom will actually use an integer
offset into the atom table.

INSTRUCTION SET

Instructions are 16-bit "byte codes" for a stack machine. The
machine uses a small expression stack (rather than registers).
To avoid the problem of stack overflow and/or preallocating
a large object for the stack (which might also create problems
for garbage collection), there is no large linearly addressed
call stack. Instead, every stack frame is an independent object.

Any values on the expression stack are copied to the current
stack frame before a call (to avoid the problem of expression 
stack overflow). The compiler is responsible for allocating
enough space in each stack frame to hold the largest possible
expression stack. The expression stack size can be determined
statically, and the size is zero if a function does not call
any other functions.

Instructions have a high-order, 4-bit "op-code" and a 12-bit
"argument" field. The op-codes are as follows:

The instructions types are the following:
    operation: the argument selects an operation, e.g. add or call
    load/store global: the argument selects an atom
    load/store instvar: the argument is offset into instance
    load/store local: the argument is offset into frame
    load small signed integer: the argument is a 12-bit signed int
    load literal: the argument is a 12-bit offset into a literals array
    load atom: the argument is a 12-bit offset into the symbols array
    local frame: the argument is a 12-bit offset into the symbols array
        and names the method. This invokes a method in the current object.
    frame: the argument is a 12-bit offset into the symbols array.
        This is like local frame except the method is invoked in the
        object on the expression stack rather than the current object.
    non-local load/store: the argument is a 12-bit offset into the
        symbols array, naming the field, which must be looked up in the
        object's class.
Note that there is no "load character literal", although a char
would fit into 12 bits with room to spare. I decided the "load 
literal", which loads a value from a table of literals would be
more general. Storing characters in the literal table is perhaps
wasteful, but chars are not a frequently-used data type, so the
overall cost should be low. Note also that there are no character
types per se, but there are 1-character strings.

METHOD INVOCATION

This has become complicated. Here are some issues leading to the design.
1) expression stack: there is an expression stack, which provides a
cheap place to store intermediate results of expressions, but we don't
want the expression stack to overflow, and we don't want to allocate a
huge expression stack. The expression stack could get huge if we
evaluated argument lists on the stack -- a procedure could have very
many arguments, e.g. LIST could take hundreds of arguments to build
a big list of constant values. Also, a call could have nested
procedure calls and be recursive. Therefore, the expression stack 
cannot hold all arguments; it should compute one at a time, and the
argument should then be stored somewhere. Secondly, the expression 
stack should be emptied before invoking a method and restored upon
return of the method to support recursion. I think of the stack as
a set of registers, so saving/restoring the stack is like saving and
restoring registers in a register-based architecture.

2) methods support several kinds of parameter passing. All parameters are
passed by value, but mutable "values" and the ability to return structures
allows various ways to work around the absence of call-by-reference.
The most common parameter passing is required positional parameters.
There are also keyword parameters, optional parameters, and the ability
to pass in a variable number of parameters. The variable case is handled
by passing in a vector of actual parameters as a single argument. Any
optional or keyword parameter can have a default value.

3) because method lookup can be dynamic, the caller does not know
what style of parameter passing will be used.

Solutions:

There has to be a place to place actual arguments while the argument
list is being evaluated. Therefore, a stack frame is allocated before
the first argument is evaluated. As each argument is evaluated, the
argument is stored into the stack frame.

Note that if there are nested function calls, you could be in the 
middle of computing a stack frame when you need to begin calling
another function, requiring a different stack frame. In this case,
you will end up with multiple stack frames (!) allocated. The stack
frame pointers are simply pushed on the expression stack, and we assume
that expressions are not nested so deep that this will be a problem. 
Also, we allocate frames from the heap, so we can allocate multiple
frames and make them the top-of-stack in sequence without copying.
(The static nesting level, not the dynamic level, determines the stack
requirements.) 

After the arguments are evaluated, the frame actually becomes the
top of the stack (frames are allocated from a heap, not from a
linear memory region). The expression stack is stored into the
caller's stack frame. Thus, the caller's frame does need room for
some temporaries, but the compiler can figure out the specific 
number of temporaries (zero if the method does not call another
method) and allocate just enough space for that particular method.
Upon return, the expression stack will have one value -- the return
value -- the expression stack is restored from the caller's frame
and the return value is moved to the top of the stack. Processing
continues in the caller.

Since the caller does not know how many arguments are expected by
the callee, or whether the callee wants a vector of arguments or
individual arguments, this must be handled at runtime. (An alternative
would be pass all parameters in vectors, but this would require an
extra allocation and free operation per call and a special set of 
op-codes to load and store parameters as opposed to local variables.
It's not a bad idea though. Instead, we will use relatively clever
instructions that "hide" the callee's protocol from the caller. This
will now be described.

The basic structure for a call is:
    push the object (unless it is implicitly "this")
    create a frame -- this single instruction, either instr_local_frame
        which pushes "this" or instr_frame. Both types of
        instructions include the index of the atom
        that names the method. The method is then looked up in the
        object's dictionary (the object is on the stack). The method
        is an object with code, debugging info, etc., and includes the
        number of required parameters, the number of keyword parameters,
        the number of positional parameters, and whether the rest of the
        parameters should be passed as a vector. An exception is also 
        raised if there is no
        receiving method found by the method lookup.  Based on the
        method's parameter count value, the caller sets up the frame
        and possibly an array for extra parameters, and an
        offset into the object. These two things are stored on the 
        expression stack, so the stack contains (from top down):
            index into object for the next parameter (or -1)
            the new frame pointer
        Class instantiation is a special case -- the compiler doesn't
        know whether a symbol denotes a class or function, so this is
        determined at frame creation time. What we want is to create
        an instance and then pass parameters to the init() method.
        If the class were actually a global function, here's what it
        would look like:
            def MyClass(p1, p2):
                var instance = new(MyClass)
                instance.init(p1, p2)
                return instance
        I think we can fake this as follows: 
            Modify the compiler to generate extra instructions at the
                head of init() methods to allocate an instance and
                store the instance pointer into "this"
            Modify the return instructions in init() to return "this"
            Modify compilation of class to store a pointer to the
                init() method as the function pointer on the class's
                symbol. If there is no init() method, then generate
                a little method that just allocates and returns an
                instance of the class.
    initialize parameters -- if there are default values, these are 
        copied from the method to the frame. Other values are initialized
        to NULL.
    for each parameter (left to right)
        code to evaluate the parameter
        param -- this instruction stores the top of stack into the
            frame or appends it to the array in the first local
            slot of the frame, OR...
        keyparam -- this alternate form of param expects a keyword
            (a symbol) and a value on the stack. It looks up the
            symbol in the method's symbol table (mapping symbols to
            locations in the stack frame). If the target is found and
            if it is actually a keyword parameter (determined by
            comparing the location to the range of locations allocated
            for keyword parameters), then the value is written to the
            frame, overwriting the default value for parameter.
    call the method -- this instruction expects an index and new 
        frame pointer on the expression stack. It pops the index. It
        remembers the new frame. The rest of the expression stack
        is copied into the current frame, using an offset stored in
        the current method. (The size of the stack is also stored
        so we know how much to restore.) 
        Once the expression stack is saved, the
        new frame is made the current frame, the PC is stored in one
        of the frames (the caller's frame would make sense) and the
        PC is set to the entry point in the called method.

Return is much simpler: the method leaves one value on the expression
stack (the return value).  The previous frame is located by following
a pointer in the current frame.  From the previous frame, the expression
stack is restored and the return value is placed at the top of the 
expression stack (it isn't left on the bottom). The previous PC and 
previous frame become current, and execution resumes in the previous 
method's code.

The virtual machine is written in C++ and is run by calling the 
Machine::execute() method. This method is not recursive, i.e.
the C++ stack is not used when a new Serpent method is called,
and execute() is not exited when a Serpent method returns. Therefore,
we need some other mechanism to cause execute to return. The
mechanism is a frame with a NULL program counter. After returning
from a procedure, if the machine PC is NULL, the machine returns.

Thus, to call a Serpent method from C++, you allocate a frame with
a NULL PC. Then simulate a procedure call. The methods defined in
call.cpp provide a convenient interface for creating a frame, 
passing in parameters, and then running a function.

FUNCALL

To call a function named by an atom: compiler treats funcall() as
special. After evaluating the first argument (which should evaluate
to an atom), execute OP_FUNCALL, which pushes NULL as the "object"
and calls op_frame to create a frame for the remaining arguments.
These are evaluated and stored with OP_PARAM into the frame. Finally,
OP_CALL calls the function.

SEND

Send works by taking an object and a method selector (an atom), 
followed by arguments. Like funcall, this is handled specially by
the compiler. After evaluating the first two arguments, execute
OP_SEND, which removes the atom from the stack and calls op_frame
to create a frame for the remaining arguments. As with funcall,
parameters are stored with OP_PARAM and finally the method is
called with OP_CALL.


Garbage Collection

Use the write-barrier technique.

Initially, objects are marked black. We want to find all reachable 
objects and mark them white. Objects that are in the process of being
marked white are marked gray. The invariant is that no white object
points directly to a black object. (But a white object can point to
a gray object and a gray object can point to a black object.) When
there are no gray objects left, then we know that no white objects
point to black objects, so all black objects can be freed.

During the marking phase, new allocations are white. Any store must
be checked: if storing a ptr into a white or gray object, if the 
pointer is black, mark it gray to make sure its pointers are marked
later. Each gray object in this phase is marked white after marking
the objects it references gray (or leaving them white).

During the scanning phase, where white objects are converted back to 
black and black objects are freed, the write checking should be 
disabled because storing a black object (that has just been converted
by the scan from a white object) into a white object (that will be
converted into a black object) is OK.

During the scanning phase, new allocations should go on the gray list.
If new allocations were black, they could be freed during the scan.
If new allocations were white, we might store black objects into them
after GC. Then in the next GC cycle, the white object would appear to 
be already marked, so the black objects it references would not be
marked and they would then be freed. Also, marking new things white 
during scan violates the "initially, objects are marked black" rule.

After scanning, there should be another phase to change gray (recently
allocated objects) into black.

(See Freeing Stack Frames for more information.)

Should we scan objects to locate and free black ones? We could keep
all objects on a linked list, but we really need objects sorted by
size. There could be multiple lists, one per size. Actually, there
would be 4 lists per size: the free list, the black, white, and gray
lists. During GC, objects would move from black to gray to white.
(We need the gray list so that we can scan gray objects.) When
gray is empty, we can move all black to free. In this scheme, we 
need to scan all different sizes, but we don't have to scan the
final white list changing back to black because we can just renumber
the colors.

The alternative is to not move objects between lists. Suppose we
keep the size of each object at the beginning of the object and 
we allocate objects sequentially from memory blocks. Then we can
scan all objects by going through each memory block from beginning
to end. We need a separate list or table (stack?) of gray objects.

Storage analysis:
   S different sizes
   N objects

Assuming we can squeeze the color into a spare byte somewhere:
For lists: 2N + 4S (doubly linked lists + table for sizes)
For scanning: N + S (size field + free lists)

Timing analysis:
   L time to move object between doubly-linked lists
   I time to insert and remove from singly-linked list/stack
   S time to scan an object
   M objects (including free ones)
   N links
   F freed objects

For lists: 
   allocate object: L
   GC = 2L(M-F) -- each reachable object is moved from white to 
                   gray to black

For scanning:
   allocate object: 0
   GC = I(M-F) + MS + IF
      -- each reachable object is put on/off gray list. Every 
         object is scanned. Freed objects are put on free list.

What are values of L, I, S?
L: x->prev->next = x->next 3
   x->next->prev = x->prev 1
   x->next = y->next 2
   y->next = x 1
   x->next->prev = x 1
   x->prev = y 1
= 9 load/stores

I: x->next = stack
   stack = x
   stack = stack->next
= 4 load/stores

S: ptr += ptr->size
   if (ptr->color) ...
   ptr->color = black
= 3 loads/stores + branch = 7

lists: 2L(M-F) = 18(M-F)
scanning: I(M-F) + MS + IF = 4(M-F) + 7M + 4F = 11M

11M = 18(M-F)
18F = 7M
F = (7/18)M
i.e. when free list is around half, linked lists work better
 by avoiding scanning

for GC of 100,000 objects, we're talking about a difference
of 18(M-F) - 11M = 7M-18F: max is 7M, min is -11M
11M = 1,100,000 loads/stores, which at 100M/s = 10ms

Conclusion: linked lists impose a space overhead of at least
one pointer and may not be faster. The scanning method seems
simpler so we should use it, but perhaps we can make the
marking interface abstract enough to allow a change in 
representation someday.

GC GLOBALS

We have to mark everything starting with the "globals" that
can be reached directly. The globals include:
    symbols -- the global symbol table
    expr_stack -- the expression stack
    frame -- the current stack frame

GC ALGORITHM

(Remember that initially, all is black)

gc_start: set initial_color to white
          node_make_gray on symbols and frame
          node_make_gray on each object on expr_stack

gc_mark: take object off gray_list, set color white,
          select on type, 
          and if any referenced object is black, make it
          gray

(repeat gc_mark until gray_list is empty)

gc_mark2: mark the stack again. If gray_list is not empty, repeat
          gc_mark, otherwise proceed to gc_mark3

gc_mark3: mark_white() to take care of integers, characters, etc.

gc_sweep: scan through all pools, freeing anything that is black
          during this phase, new allocations go on gray list

gc_sweep2: initial_color = black
           change gray objects to black objects

gc_final:           

FREEING STACK FRAMES

After implementing and measuring garbage collection, I discovered
that some undesirable behavior is possible: if there are many
short subroutine calls, then many stack frames are allocated.
Perhaps the worst case is a sort routine where two stack frames
are allocated to run a simple comparison function.

Stack frames are relatively large, so it is possible to allocate
a huge amount of space before the GC frees any stack frames. This
led, in practice, to over 10MB of storage for stack frames waiting
to be freed, and dominated the memory requirements for the program.
Note that more than half of this space would have been saved if the
compiler could keep track of the maximum number of slots needed
to save the stack. For these measurements, 16 slots were allocated
and rarely used, wasting as much as 64 bytes. A typical frame
size is 96 bytes.

There are two obvious solutions: (1) have return instructions free
stack frames. This has the disadvantage that explicitly freeing
storage is not part of the current GC model, and it rules out the
possibility that someone (a debugger or closure) might be holding
a reference to the stack frame even after the function returns.
(2) implement generational scavanging to quickly return stack
frames to free lists. This has the problem of even greater
complexity and perhaps more storage requirements for additional
data structures to keep track of generations.

For now, I will implement (1), explicit freeing of stacks. It
would be nice to keep these nodes completely separate from GC,
but stack frames point to other data and must be traversed in
the marking phase. So let's assume that frames are part of the
normal heap.

After implementing (1) (and it works!), I discovered that a
significant number of stack frames are still allocated because
frames are sometimes gray when a function returns. These do
get collected eventually, but you can still end up with kilobytes
of useless stack frame-sized memory chunks sitting on free lists.
Given that you are going to have wasted stack frames on freelists,
an even better solution (I think) would be to have explicit lists
of stack frames from which to allocate. These would always be
reachable and would always be marked, and would never appear on
the free list. A disadvantage in terms of speed would be that
frames should be cleared when they go on the stack or else things
they reference would be marked. Alternatively, the mark algorithm
could be modified to treat these frames specially. Perhaps frames
could have an "in use" bit to tell the marker when to mark a frame.

Assume that the OP_RETURN instruction frees the current frame.
The frame is one of:
    Gray -- GC is running and the frame is on the Gray list.
        If we free it, then the Gray list will be corrupted.
        We cannot remove it from the Gray list, so we should
        not free it now. The GC will not free it either on this
        cycle, but the next GC will claim it.
    Black -- either GC is not running or frame has not been
        marked yet. It is safe to put node on free list, except
        that there may be one or more Gray stack frames that
        point to this one. If we free it, then we will be in 
        trouble when later the GC tries to mark it. A solution
        is to make a special case: Do not mark (make gray) nodes
        that are already free. As a sanity check, the mark
        phase should never reach a free node except through the
        previous frame pointer of a stack frame. Note also that
        if we free this node, it could be reallocated. Then we
        will have a reference from a Gray stack frame to this
        newly allocated memory. Of course, the newly allocated
        memory will be White, so it doesn't change things whether
        we mark it or not.
    White -- GC is running and the frame will not be collected.
        It is not on the Gray list. It is safe to free it.

Now, consider what happens with the debugger: The debugger gets
a pointer to the stack frame of the offending routine. These
frames are reachable via the debugger stack frame so they will not
be collected. No OP_RETURN instruction will be executed within
these frames, so they will not be explicitly freed. This is the
right behavior, so we do not need any special handling to keep
a handle on stack frames.

Any code that keeps a handle on a stack frame and returns from it,
for example code that creates a closure, will have to make sure
the return instruction does not free the frame or the frames it
references. This could be a big problem, but right now there is no
way for this problem to arise.

Another potential problem is that stores into the frame pointer
of the virtual machine are not checked. Here's the scenario:
GC starts on a tall stack and the top frame is marked at the
beginning (because the frame pointer points to it). The machine
rapidly returns a few levels. The original top frame is freed, so
it will not be marked. The inner frames of the stack have nothing
pointing to them, so they'll be black. Local variables will not
be marked and even the stack frames might be freed.

To solve this, we need to make sure that, before GC completes,
the stack is completely marked, but we still have to do this
incrementally because the stack is too big to mark atomicly.
One solution is when we return, if the top frame is BLACK, 
mark it GRAY. This will force the GC to mark it and all stack
frames lower in the stack (frames are linked via the FRAME_PREV
field). 

Alternatively (and this is the implementation), we can have the
GC check the frame field of the Machine at the end of the mark
phase. If the top frame is black, mark it GRAY and continue marking.
This is similar to how we handle the expression stack -- it gets
rescanned every time the GRAY list becomes empty until there is 
nothing left that is GRAY.

STORAGE

The storage layout is like this:
    TAG -- 1 byte
    COLOR -- 1 byte
    SIZE -- 2 bytes
    NEXT -- 4 byte pointer to next gray or free object
    <storage>

Most objects are basically vectors of pointers, so 
GC can scan them all the same way. Exceptions include
longs, doubles, strings, and code.

DATA STRUCTURES

For real-time, a limiting factor is dealing with large arrays.
A 133MHz Pentium system copies 1GB in 33s, so the copy rate is
30MB/s, or 30KB/ms, or 8K pointers/ms.

It would be nicer to use incremental structures such as lists,
but lists have the problem of linear time to access them. This
time would be greater than the copy time measured above.

Therefore, the most general structure seems to be a variable
length array, implemented by a block of contiguous memory, and
expanded by copying to a larger chunk of memory. 

There should be an expand operation that gives an explicit size.
Otherwise, arrays should grow by some multiplicative factor when
data is appended and an overflow occurs.

In order to expand without changing addresses, arrays need a header
with a pointer to the actual data.

In addition to the array type, we need dictionaries. A dictionary
will be some sort of dynamic structure with log insert, lookup, and
delete time. Hash tables are another possibility, but my concern
is the time to rehash if the table grows too large. For now, the
dictionary is just an array of alternating keys and values, and
linear search is used for lookup.

Strings: zero terminated. Nothing fancy.

Longs: keep a set of preallocated longs in a table. Otherwise,
allocate them dynamically. All "longs" are 64-bit integers. The
implementation uses 32-bit ints for things like array size, 
stack size, frame offset, etc. for efficiency.

Doubles: similar to longs, but nothing preallocated. 

Boolean: use symbol instead.  The symbols are 't' and NIL.

Symbol: implemented as a dictionary. The key in the dictionary
is also stored in the symbol object, which also stores a global
variable value pointer and a global function pointer.

File: a standard FILE* pointer stored in a node. An exception is
that standard input and standard output might be implemented
using a text control in a graphical user interface, for example,
so there must be a way to circumvent the restriction to FILE*.
In serpent, all output goes through the Machine::put_string()
method which checks to see if the FILE* is equal to (FILE *)(1).
This is a special pointer value used to denote standard output.
If so, the string is printed by calling (*Machine::put_stdout)().
The put_stdout() method is a pointer, so you can modify Machine
to provide any type of standard output. Standard input works
the same way. The initial value of put_stdout is a function that
prints to machine->mach_stdout, and the default value of this
file is C's stdout.

To send stdout to another file, replace mach_stdout with a new
file pointer. To send stdout to an object that's not a FILE *, 
replace put_stdout with a new function pointer. In general,
put_stdout should call fputs unless the file pointer is a 
special value, e.g. 1, and mach_stdout should be (FILE *) 1.
(That way, you can still redirect to another file by storing
the file pointer into mach_stdout.)

Most output from programs (e.g. the "print" statement) goes to
the file stored in the symbol 'stdout'. If this value is not of
type FILE, e.g. if the user sets it to nil, it is reset to the
value of mach_stdout and an error message is written.

Example:
    print "hello world"
    calls op_print
    calls stream_print
    calls put_string_stdout
    calls put_string on get_stdout()
    put_string calls fp->write(string, len)
    Thus fp depends on get_stdout() which returns file from s_stdout
    s_stdout is initialized to Stdio, a subclass of Vfile
       and Stdio's flush (the actual output) calls:
           *(machine->console_puts) if it is non-nil
           stdlog->puts(), stdlog->flush() if stdlog is non-nil
    machine->console_puts is initialized to
        wxs_print_string, which calls wxs_print_string_1 (if
            WXS_STDOUT_WINDOW) and fputs(s, stdout)
        stdout_print_string, which calls fputs(s, stdout)
    machine->mach_stdlog is initialized using log_enable(filename)
    
Note: srplog should be eliminated and log_enable() should be called
by including debug.srp unless dbg_nolog is set.


CODE GENERATION:

Consider the following:

class stack:
    var stack
    var top
    def init():
        top = 0
        stack = [];
    def push(item):
        append(stack, item);

When stack: is read, create a new class named stack with no super:
    Add stack to classes and create a new class object

When var stack is read, create a new variable in stack:
    Add stack to the class's symbol table with an offset of 0

Similar for top

When init is declared:
    Create a new method object. Make the signature 0.
    When top = is read: remember the variable.
    When 0 is read: write code to push 0 (small int)
    Write code to save value into top.
    When stack = is read: write code to push new array.
    Write code to save value into stack.
    Write return statement.
    (note that "id=" is treated differently than "<expr>=")

When push is declared:
    Add push to methods dictionary
    Add item to locals at offset 0
    Make number of parameters 1

When append is called:
    See if append is declared as a local or inherited method
    Since it isn't, assume that it is global: make a global call
        after pushing stack and item

RETURN VALUES

Every statement yields a value. An expression can be a statement.
The value of the last expression is returned from a function if
no RETURN statement is encountered. 

To implement this, NULL is pushed on the expression stack when
a method is called. Every statement begins with a POP and then
pushes its value on the stack.

Procedures must end with a RETURN instruction so we don't "fall
off the end".

This gets more complicated with compound statements. Consider the
following:
def fn():
    if expr:
        a
    return b
and
def fn():
    if expr:
        a
    else:
        b
and
def fn():
    if expr:
        a
           
In the first case, b is returned, so there is no need to worry
about the value of the if statement. In the second case, the
complete statement lists inside the "then" and "else" parts
generate the function return value regardless of the value of
the boolean expression. In the third case, if expr is true, a
is returned, but if false, then NULL must be pushed and returned.

Let's figure out what the code should look like for the third
case:
    [evaluate expr]
    jmpifnot L1
    [evaluate a]
 L1:jump L2
    null
 L2:return

Alternatively:
    REPLNULL -- new opcode to replace top of stack with NULL
    [evaluate expr]
    jmpifnot L1
    [evaluate a]
 L2:return

To simplify code generation, we'll use the second approach.

METHOD/PROCEDURE CALLS

A problem is knowing what we're calling at compile time:
A method could be local, but we just haven't gotten to the
declaration yet -- this could be solved with forward declarations
but do we really want these? Another case is virtual methods:
they aren't declared except in subclasses that we haven't seen
yet.  Again, declarations are required to resolve this at 
compile time. Also, do we need/want to declare all globals
routines before their use?  What if there are cross-references
between different compilation units?

Probably a better approach is dynamic lookup at runtime. So 
then there are only two kinds of call: a method call where an
object is evaluated, then parameters, then a method is named by
indexing into the class's atom table. The object is searched for
the named method. If none if found, an error is raised. 

In the other type of call, there is syntactically no object
specified, so we either use this or we execute a global routine.
Global routines are like methods, but their "this" pointer is NULL.
After other params are
evaluated, we search current object for method matching the name.
If nothing is found, we search the global routines for a match.
If nothing is found there either, we raise an error.

BUILT-IN FUNCTIONS

Consider "open()", a built-in global function returning an opened
file object. There could be a method named "open", so we must call
it as if it is a non-built-in function. A special value is stored
as the method: a Node with a pointer to a function.

BUILT-IN CLASSES AND METHODS

Built-in classes have certain methods. It will be illegal to
declare these methods without a matching number of parameters.
This will allow the compiler to check the parameter count and
pass the parameters on the stack to a function just by emiting
a particular op-code. The op-code checks the object (first
paramter) type. If it matches, the built-in method is performed.
If not, the parameters on the stack are moved to a frame and
the method is invoked in the normal fashion.

FOR LOOPS

Compiling 
    for i = expr1 to expr2 by expr3: stmts

declare i as local variable if not already declared
expr1
store to local i
expr2 ; leave target on stack
expr3 ; leave increment on stack
OP_NIL ; default value for loop
jmp looptest
looptop:
pop ; get rid of value from previous iteration
stmts ; leave last value on stack
loopincr i ; this instruction increments loop var and puts in stack
looptest:
compare loopcount and branch to looptop 
    ; stack is returnval, increment, finalcount, loopcount
    ; this instruction compares stack[-3] to stack[-2] 
    ; if less or equal, branch to looptop
breaktohere:
fordone ; moves stacktop to stack[-3], then pop3

Compiling
    for i in a: stmts

declare i as local variable if not already declared
push a
push 0
REPLNIL -- default value in case of no iterations
jmp looptest
looptop:
store local i
pop
stmts
increment
looptest:
compare tos to length of next on stack, push a[tos] and branch to looptop
breaktohere:
pop

Compiling
    while expr: stmts

loopstart:
expr
jmpifnot endloop
smts
jmp loopstart

STREAMS

A common data type is an array of strings. Since arrays can be extended,
this is an easy way to build long strings, for example to implement
repr(). The flatten() function converts this to a single string by
concatenating all members. A print function exists to print one of these
arrays of strings (for internal use only for now).

"SHORT-CIRCUIT" CONDITIONALS

The operators "and" and "or" evaluate only until a false or true operand
is found. An example nested expression is (A and B) or C:

[evaluate A]
jmpifnot orbranch
[evaluate B]
jmpif truebranch
orbranch: [evaluate C]
jmpifnot falsebranch

No elegant, simple algorithm is popping out at me, and I thought of
a fairly simple way to deal with these expressions, having the
property that expressions always leave a value on the stack, which
means there is no special case for, say, a = b or c and d.
The trick is to make special conditional branch instructions that
support the conditional operators. For example, (A and B) or C:

[evaluate A]
cndand lab_1
[evaluate B]
lab_1: cndor lab_2
[evaluate C]
lab_2: 

In this scheme, "cndand" jumps to the label if the top of stack
is false, without popping the stack. "cndor" jumps to the label 
if the top of stack is true, without popping the stack.

This executes a few more instructions than necessary, but simplifies
code generation.

DEBUGGING

Every method has an (optional) array of shorts giving the line number 
corresponding to each instruction. This doubles the code size, but is
simple to implement. Imagine how great it will be when this becomes a
real problem!

Method names can also be retrieved at run-time, so it is easy to print
where any run-time error occurs.

Beyond that, we need a stack trace and it would be even better to
examine the contents of stack frames. The initial step in this 
direction will be a simple debugger, implemented in Serpent itself.

The mechanisms include the following:

When a run-time error is raised, the error routine will look for
a global procedure named runtime_exception. This procedure should
take two arguments: a string describing the error and a pointer to
the current stack frame. Machine::execute() returns when this 
procedure returns.

Within the runtime_exeception procedure, the following built-in
functions will be useful:
    frame_previous(frame) -- returns the previous frame
    frame_variables(frame) -- returns a dictionary containing all
        local variable bindings, including parameters
    frame_pc(frame) -- returns the pc for a frame
    frame_method(frame) -- returns symbol naming the frame's method
    frame_class(frame) -- returns symbol naming the 
        frame's method's class
    runtime_exception_nesting() -- returns integer count of the
        recursive nesting level of this instance of runtime_exception
        (initially 1)


A simple debugger has been defined with the following commands:
<id> -- print value of identifier <id>
- -- go down a stack frame
+ -- go back up a stack frame
? [<n>] -- print a stack trace, optionally limited to number n levels
= <expr> -- compile and evaluate expression <expr>. For now, note that
     expressions cannot be evaluated in the context of this frame, so
     we cannot compute expressions of variables in the frame.
* -- print the values of variables in the current frame


NESTED DEBUGGING

The machine will keep a count of how many debugging levels have been
entered. It will simply return after level MAX_DEBUG_LEVEL, e.g. 10.
The nesting level can be retrieved with the builtin function
runtime_exception_nesting().

Since there is no exception-handling facility (yet), there is no
way to "catch" errors or resume at any particular point in the
program. Instead, errors cause the stack to unwind all the way
back to some place where errors are explicitly reset. This happens
now in the compiler when the compiler is compiling an "interactive"
file, i.e. it is taking input from a command line interface.
The "interactive" flag is a parameter the caller can set.

The error_flag is also reset by the OP_EVAL instruction, which
returns true if execution completed without errors and false
otherwise.

Alternatively, in a windowing system like wxserpent, all code
is run from WX Windows callbacks, and at this level, errors 
should be reset.

Errors are also reset by the error routine if the debugger is
invoked. After the debugger completes, a special instruction,
OP_DBGEND, is executed to set the error flag, reset the running
flag, and zero the runtime_exception_nesting variable. This will
cause the C++ stack to unwind.

Question: what happens to stack frames when there is an error.
They are not explicitly freed. Are they GC'd?

COMPILATION ERRORS

The compiler compiles declarations and executes expressions from a
file or from some "interactive" input, e.g. stdin (with stdout). If a
compilation error occurs, the error is reported by report_error, and 
this sets compiler->error_flag. The compilation stops at this point
and recursively called parsing functions return out to the top-level,
compile_file(). If this is an interactive file, the error is cleared
and compilation continues. Otherwise, the result (true for success,
false for error) is passed back to op_load().

We want compilation to stop at all levels when a nested load fails.
If a file loads 10 other files and the first one isn't found, the
compilation process should stop and no more files should be loaded.
To accomplish this, the compiler needs to look at machine->error_flag
when the call returns and set compiler->error_flag accordingly.

Similarly, op_load() must inspect the result of compile_file()
(returned via compile_compile()) and set machine->error_flag
accordingly.

When an error is returned, perhaps recursively through nested
compile_file() and op_load() calls, the returns stop at the
instance of compile_file() where the interactive flag is set.
At this point, the error flags are cleared and compilation 
resumes at the first expression that begins in column one.

RUNTIME ERRORS AND FLAGS

There are many flags:
    running -- tells when to stop executing instructions and return
        running is initially true. It is set to false when there is
        an error or when a call from C++ completes and it is time to
        return to the calling C++ method or function.
    error_flag -- tells that an error occurred. Since the error is
        a member variable and not a local variable, errors 
        "stick" as the stack is unwound. Error propagation stops
        when the stack unwinds all the way to a compiler with an
        "interactive" level.
    compiler->error_flag -- similar to error_flag, but this keeps
        track of compiler errors only. It is generally a bug for
        the compiler to report two errors before exiting, and this
        flag keeps track of when compilation errors are reported.
    exit_flag -- causes machine to exit, even if you are in an
        interactive loop which clears error_flag and set running

Under wxserpent, we want to turn off callbacks from WXWindows when the
debugger is invoked. Otherwise, screen refresh code might get invoked
over and over again, recursively calling the debugger. This is done by
looking at the exception_nesting variable.


TRACING

Methods will have a new field: a flag indicating whether tracing is
enabled. The field will be a 32-bit boolean (not an FVal), where true
means trace.

When called and the trace flag is set, the frame is passed as a
parameter to method_trace(frame), a global function.

BINARIES

Goal should be to make loading a binary do the same thing as loading
the source. That means there can be immediate statements in the 
binary, and we need to keep track of what's in the file.

Things that can be in the file are:
    Methods and their code
    Classes
    Immediate statements

To save a method, write:
    method name in ascii, followed by zero, then 

[to be continued -- but I think we won't worry about this
 until compiling programs is a limiting factor]

CALLING C++ OR C CODE

Functions must take fixed number of parameters (for now).
Say you have a function "foo" with int and float args.
Initialization:
    mach->create_builtin(FSymbol::create("foo", mach), 2, stub_foo);

Glue function:
void stub_foo(Machine_ptr m)
{
    FFrame f = m->frame;
    FLong_ptr p1 = f->local(0).flong->type_check(m);
    FDouble_ptr p2 = f->local(1).fdouble->type_check(m);
    if (p1 && p2) {
        int64 p1c = p1->int64_data;
        double p2c = p2->double_data;
        int res = foo(p1c, p2c);
        m->push(FLong::create(res, m));
    }
}

To register init functions, create object, pass it a parameterless function:
    Builtin_init(init_fn);

THE GLOBAL METHOD

Normally, calls are made from within a method, and the call instruction expects
there to be a defined method to which the called function will return. This
raises the question, "who calls the first method?" To answer this, we provide
a function named "<global>". The interpreter is initialized as if global is 
executing.

ADDING EXTERNAL FUNCTIONS

The interface.ser program writes C code to extend the interpreter.
C comments are added as follows:

int my_c_fn(int a, float b);
/*SER long my_boa_fn[my_c_fn](long, float) PENT*/

Explanation:
The  is given inside /*SER ... PENT*/ comments. This is conveniently
located near the C declaration in case there are changes to be made, 
but the BOA interface specification is completely independent of the 
C declaration, and it is up to the programmer to keep them consistent 
(some minimal checking will be made when the interface is compiled).

The function name as it will appear in Serpent is given first. If this differs
from the function name in C, the C name is given in square brackets.

The types in the Serpent specification are a subset of C types, as follows:
    void -- for return values only, the Serpent function will return 'nil'.
    Type      Formal  Actual
    long      long    int64
    short     short   int64
    char      char    string; string must contain one character
    string    char*   string
    double    double  double
    float     float   double  
    bool      bool    any; nil is false, else true
    anylong   long    int64 or double; value is coerced to long
    anydouble double  int64 or double; value is coerced to double
    anyfloat  float   int64 or double; value is coerced to float
    anyshort  short   int64 or double; value is coerced to short

SCANNING AND PARSING

This is much more difficult than I first imagined. The basic 
idea I picked up from Python is to turn indentation into
specific INDENT and DEDENT tokens. What causes the complexity 
are the following rules:
    (1) Lines with only comments are ignored. This means you cannot base
determine indentationI on white space -- you have to check if
the white space is followed by a comment.
    (2) In interactive mode, you cannot really ignore empty lines because
then the user could never execute a command except by starting to enter
a new one! In interactive mode, typing a completely blank line generates
DEDENTS back to the left margin, completing any multi-line entry.

To simplify things, I decided to build a two-stage scanner. The low-level
stage mostly parses tokens but doesn't deal with INDENT/DEDENT and other
details. The high-level stage, called get_token() deals with indentation,
ungetting tokens, comments, etc. Here's a sketch of the algorithm:

initially, indent_dedent_mode = false
get_token():
    if tokens were ungotten, return them without further scanning.
    while not indent_dedent_mode: // skip blank lines
        scan_token() // scan the next token candidate
        if token->typ != token_spaces:
            break
        indent = position
        scan_token() // lookahead
        if token->typ == token_comment:
            scan_token() // skip comment
        if token->typ != token_newline:
            unscan() // save token for later
            indent_dedent_mode = true
        else:
            // ignore blank or comments-only line
    if indent_dedent_mode:
        if indent < stack[top]:
            top--
            return token_dedent
        elif indent > stack[top]:
            stack[++top] = position
            return token_indent
        indent_dedent_mode = false
        scan_token()
        if token->typ == token_comment:
            scan_token() // skip comment
    return token->typ

This does not work for the following reasons:
    (1) in interactive mode, if there are no spaces, we should generate
dedents. This is fixed in ALL CAPS in the next version:

get_token():
    if tokens were ungotten, return them without further scanning.
    while not indent_dedent_mode: // skip blank lines
        scan_token() // scan the next token candidate
        if token->typ != token_spaces:
            break
        indent = position
        scan_token() // lookahead
        if token->typ == token_comment:
            scan_token() // skip comment
        if token->typ != token_newline:
            unscan() // save token for later
            indent_dedent_mode = true
        ELSE IF INTERACTIVE_MODE AND POSITION == 1:
            UNSCAN() // SAVE NEWLINE FOR LATER
            INDENT_DEDENT_MODE = TRUE
        else:
            // ignore blank or comments-only line
    if indent_dedent_mode:
        if indent < stack[top]:
            top--
            return token_dedent
        elif indent > stack[top]:
            stack[++top] = position
            return token_indent
        indent_dedent_mode = false
        scan_token()
    if token->typ == token_comment:
        scan_token() // skip comment
    return token->typ

LIST COMPREHENSION COMPILATION

List compilations have this syntax:
     [ expr for x at i in list if pred ]
or
     [ expr for i = start to end by step ]
where "at i", "if pred" and "by step" are optional.  This is 
equivalent to:
    temp = array(0)
    for x at i in list:
        if pred:
            temp.append(expr)
To avoid creating temp, we can keep the array on the stack.
Then, to accomplish temp.append(), we'll make a new special
opcode.  For the 2nd form, the equivalent code is:
    temp = array(0)
    for i = start to end by step:
        if pred:
            temp.append(expr)
We might as well relax the syntax a little and allow the
following:
    [a, b, f(x) for x in list, g(y) for y = 0 to 10 if p(y), c, d]

The problem with list comprehension compilation is shown by the
following examples:
    [x]
    [x for x in a]
Let's assume x is not declared at the point where these 
expressions are encountered. In the first expression, x is 
compiled as a global variable.  In the second expression, x 
must be a local variable. x is implicitly declared just as it
would be in a for loop. The problem is that we have to compile
the expression "x" (which might in fact be a long, complex 
nested expression) without knowing whether this is a regular
list (and x is global) or this is a list comprehension (and x
is local).

A general solution is to make 2 passes: form a parse tree, then
come back and generate code. Serpent does not form parse trees,
so this would be a major rewrite. The alternative is to save
tokens starting with the beginning of "expr" but compile "expr"
as if it is just a normal expression. If we find "for", we 
set the code location to the beginning of "expr" so that code
generation will overwrite the code we just generated, we continue
parsing the "for" and "if" parts. Then we compile "expr" from the
saved token stream. Finally, we generate an instruction to append
the computed "expr" value to the array, which is on the stack.

All this has to be recursive. Consider:
    [x for x in [y for y in [z for z in ...]]]

Or, perhaps worse, "expr" can be a list comprehension:
    [[i for i = 0 to n] for n = 5 to 10] 
which generates a list of lists of length 5 to 10. Here the
compiler will recompile "i" in the first comprehension, then
recompile the entire compreshension when the second "for" is
reached.

The "replay" data structure is a stack of structures with
instruction addresses for where the code generation started
and a pointer into a list of tokens (marked "T"):

          [ i for i = 0 to n ] for n = 5 to 10 ]
          ^ ^
+------+  | |
|   T--+--+ |
+------+    |
|   T--+----+
+------+
|   .  |
|   .  |
|   .  |

Let's rewrite the tokens before replay. The above would become

          [ i for 
          ^ ^
+------+  | |
|   T--+--+ |
+------+    |
|   T--+----+
+------+
|   .  |
|   .  |
|   .  |

When the "for" is encountered, we back up to place at the top of
the stack and save the expression tokens (in this case just "i"),
marking the stack element with "E", and change the "for" to 
"lc-for" ("list-comprehension for") which we parse as a list 
comprehension, continuing to save tokens because the stack is 
not empty. 

          [ lc-for i = 0 to n ]
          ^ 
+------+  |
|   T--+--+
+------+
|   E--+--> i
+------+
|   .  |
|   .  |
|   .  |

When we reach the end of the lc-for-if expressions, insert the
expression on the top of the stack into the tokens:

          [ lc-for i = 0 to n lc-expr i ]
          ^ 
+------+  |
|   T--+--+
+------+
|   .  |
|   .  |
|   .  |

The next token is "for" so again we back up and replay, this
time from the next top of stack:

          lc-for n = 5 to 10 ]
          ^ 
+------+  |
|   T--+--+
+------+
|   E--+--> [ lc-for i = 0 to n lc-expr i ] 
+------+
|   .  |
|   .  |
|   .  |

          lc-for n = 5 to 10 lc-expr [ lc-for i = 0 to n lc-expr i ] ]
          ^ 
+------+  |
|   T--+--+
+------+
|   .  |
|   .  |
|   .  |

Finally, this sequence will parse and the stack can be popped.

NETWORKING

If NETWORK is defined, then Serpent is compiled with functions 
that allow network communication.

OUTPUT, STDERR, STDOUT, ETC.

Some general goals: error messages from Serpent should go to
stderr.

Other general output should go to stdout.

Both stdout and stderr should be initialized to correspond to Unix
stdout and stderr, but stdout and stderr should be global variables
that the application (in Serpent) can reassign to redirect output.

Since wxserpent on Windows does not have console output, the default
stdout and stderr should be duplicated to stdlog (if defined) and also
with _CrtDbgReport() on Windows to get output into the debugger window.

The "real" stdout and stderr will be represented by instances of Stdio
which will flush to machine->console_out and machine->console_err,
respectively, and also to _CrtDbgReport() on windows and to stdlog if
the file exists.

machine->console_out and machine->console_err output to stdout and
stderr, but since they are function pointers, they can be replaced,
e.g. to direct output to a window in wxserpent.

-----

The interpreter can print debugging info depending on flags.

There are two sources of console-like output: Serpent's stdout
file, which is directed to the console and debug trace information.

Both are handled in about the same way; differences are:
 2) If dbg_trace_output_disable is set (in Serpent), then under
    serpent, stdout is not copied to CrtDbg, and under
    wxserpent, stdout is not copied to stdout

Implementation: The initial stdout in Serpent is special and implemented
by a Stdio. All other files in Serpent are implemented by Vfile and go
directly to standard C FILEs using fputs, etc. So writes to Stdio mean
"console output" and the actual write operation is performed by
Stdio::flush as follows:
  1) output goes to console.
     A) in serpent, this is a simple function that calls fputs
     B) in wxserpent, this is wxs_print_string, which echoes to stdout
  2) if Windows and !dbg_trace_output_disable, output goes to CrtDbg console
  3) if srplog, output is written there
  4) if log_enable created a log file, output is written there

Other output is sent from internal functions to Machine::trace, which does:
  1) output goes to console:
     A) in serpent, this is a simple function that calls fputs
     B) in wxserpent, this is wxs_print_string, which echoes to stdout
  2) if Windows, output goes to CrtDbg console
  3) if srplog, output is written there


Output from Machine::trace goes to:
       CrtDbg console (Visual C++ debugger output window)
       stdout
       if a GUI text window is available, copy output there too


Summary of calls:
  Serpent print (op_print) calls
  Machine::stream_print calls
  put_string_stdout gets Vfile for stdout and calls
  put_string (which could be eliminated) calls
  Vfile::write calls either
    Cfile::putc which uses fputc
  OR
    Vfile::putc calls Vfile::flush calls
    (*Vfile::put_stdout)(buff, machine) calls
    wxs_print_string calls
    wxs_print_string_1 which handles output to wxWindow

  Machine::trace (used like printf) calls
    CrtDbgReport
  AND
    fputs(msg, stdout)
  AND
    fputs(msg, stdlog)

THREADS

(see sthread.h for many details.)

Currently running thread is machine->thread. Threads have links
forward and backward (next_runnable, prev_runnable).

Threads also have  (next_sibling, prev_sibling) links to maintain
a circular list (set) of children, a children link to one of the
children, and parent link from each child thread to its parent.


machine->thread ----------+
                          |
                          v
       +-- thread <===> thread <===> thread <===> thread --+
       |      ^                                     ^      |
       |      |                                     |      |
       +------+-------------------------------------+      |
              |                                            |
              +--------------------------------------------+

UNICODE

In 2021, Serpent began to adopt Unicode. Internal representation for
strings is similar: strings are either short strings with a length
count (byte) a zero terminator, and up to 4 bytes, stored in a single
64-bit SVal, *or*, regular strings are in the heap stored with:
        64-bit header with pointer, tag, color, and slots fields
        32-bit slot count (allowing this to be a long node)
        32-bit byte count (exact string length)
        32-bit char_len (number of UTF-8 characters in string)
        32-bit char_index (cached character index)
        32-bit byte_index (cached byte offset for char_index)
(Overhead per string is 28 bytes. If we did UTF-32, we'd eliminate
the need for 12 bytes of overhead and indexing would be linear, but
the space requirement is greater for all ASCII strings since our UTF-8
scheme can put 4-char ASCII strings into a single 64-bit SVal.)

Strings are always stored as valid Unicode UTF-8 codes, normalized
using NFC. This means that:
  - strings can no longer contain arbitrary binary. E.g. we need a new
        way to read binary file data: Probably read them into arrays of 
        ints, with 6-bytes per integer
  - all incoming strings must be checked and normalized
  - concatenated strings must be re-normalized
  - string comparisons of "logically" equivalent strings may not
        support human-oriented searching, e.g. the ligature "fi" will
        not match the 2-character pattern "fi". This requires 
        "compatibility" conversions such as normalization form NFKC.

Serpent has a nfkc() function to normalize NFC to NFKC.

SERPENT SOURCE CODE AND UNICODE

We can allow unicode characters (UTF-8) in Serpent literal strings:
  - existing code probably works except escape character "\" applied
        to a unicode character should yield the entire unicode 
        character.
  - need to apply NFC normalization to content of literal strings
Serpent identifiers should be restricted to prevent visual aliases
        where two symbols are different but visually indistinguishable.
  - identifiers are formed by NFKC normalization
  - symbols are also formed by NFKC normalization
  - PI can be represented by the greek symbol! 

UNICODE SUPPORT FUNCTIONS

We need functions for:
  - find next character
  - find previous character
  - check range of bytes for valid Unicode string
  - apply compatibility conversion
  - normalize Unicode string to NFC, options are:
    o libunistring -- good, but uses malloc and free
        not clear what to do with allocated results
        source code is large, but probably could substitute Serpent
            memory allocation for malloc/free
    o ICU -- seems to use UTF-16
    o utf8proc -- see utf8proc_NFC()
        + smaller, simpler, cleaner, but still uses malloc/free
        + see libmojibake, might be newer/better