6 September 1990


Formal Properties of an Algebra for Subsequences



Wilfred J. Hansen
Information Technology Center
Carnegie-Mellon University
Pittsburgh, PA 15213

Abstract.
This paper formally defines an algebra for sequences that has been shown elsewhere to have a number of desirable properties as a programming language component, especially for dealing with multi-media text. Unlike other algebras, the primitive values are not themselves objects, but are references to subsequences of underlying sequence objects. These subsequence reference values, called subseqs, are the domain of four functions: next(), start(), extent(), and base(). The paper shows that, in conjunction with standard control constructs, these functions are necessary and sufficient for all computations over subsequences. The functions are defined by means of composite symbolic expressions containing variables for both sequence elements and the end points of subseq values; the resulting proofs are remarkably straightforward. By adding the replace() function, the algebra can incorporate mutable values, though the formal system then requires the notion of the set of all subseqs referring to any given base. Note that the algebra is not a complete computational engine as would be a set of rewriting rules; nonetheless, the final section shows that regular expression pattern matching--a frequent task of rewriting rules--is easily expressed in the algebra.



1. Introduction
Sequences are fundamental to computing: a file is a sequence of records or characters, a record is a sequence of fields, a field is a sequence of characters. In numerical applications data in arrays are often processed sequentially. It is surprising, then, that little effort has been devoted to algebras for sequences. This paper describes formally one such algebra, demonstrates a proof mechanism for it, and shows that the algebra is necessary and sufficient for all computations over sequences. Later sections extend the algebra to mutable values and give examples.

The key feature of the algebra is that each value is not itself a sequence, but is instead a reference to a portion of an underlying sequence called its base; thus the value is an indivisible composite of a reference to the base and the location and extent of a subsequence within that base. In Figure 1, the base is "'Twas brillig and the slithy toves" and variables m, n, and p refer to subsequences. This type of data value is called a subsequence reference, or subseq, and can refer to as much as the entirety of its base or as little as the empty subsequence between two elements. Note that two empty subseqs are not equivalent unless they are at the same positions in the same base.

Figure 1. Illustration of three subseq values. The base sequence is "'Twas brillig and the slithy toves"; m refers to the subsequence "s bril", n refers to the empty subsequence between the first two letters of "the", and p refers to "toves".

The many advantages of substring references for programming--especially with multi-media strings--have been discussed elsewhere [Hansen, 1990a]. The strongest advantages are seen in returning values from functions, where the single substring reference value incorporates both the position and length of a substring. A substring isolated by one function can be passed directly as an argument to another function and that function can access not only the substring but also subsequent or prior characters of the source. Consider searching and parsing algorithms. With these, it is often desirable that the outcome of a search indicate the location of the result as a start for further processing, as for example in a loop that scans through the characters of a source string extracting a token at each iteration.

Unfortunately, in non-garbage-collected languages like C it is not possible to implement the subsequence reference algebra as a simple subroutine package without using unbounded memory. It is simple to define a struct containing the position and length so such values can be stored or passed to or from functions. However, if a function creates such a value and returns it, there is no good way to ensure that the storage allocated for the struct is ever deallocated.

The variety of sequence algebras and string facilities in existing programming languages is reviewed in [Hansen, 1990a]. Especially notable are the string processing languages SNOBOL4 [Griswold, 1971] and Icon [Griswold, 1983], neither of which implement subsequence references as a datatype. PostScript [Adobe, 1985] and Common Lisp [Steele, 1984] do implement subsequence references, but the three of the four functions of the subsequence algebra below cannot be expressed in these languages. It must be noted that all four languages offer extensive string processing capabilities beyond the scope of the algebra described here. It is not the intention of this algebra to supplant these capabilities, but rather to suggest a data type for strings on top of which other semantics can be added.

Formal treatments of string values sometimes present as their algebra the free monoid with concatenation. This adequately generates all string values, but cannot express substring references. The computation model in such treatments is often a set of rewriting rules with strings as the data values and the rules as the only mechanism; there are no if's, while's or functions. In contrast, the subsequence algebra is a representation for strings designed to be consistent with other data values in a programming language; the computational engine is the ordinary tools of if-then-else, while-do, and functions.

One string representation common to many programming languages is an array of characters with integers or pointers to indicate position and extent of substrings. From an algebraic standpoint this model is inferior because it requires an additional domain--integers or pointers--as well as the domain of sequences. In practice, the array/integer model of strings suffers from other deficiencies: Characters do not always fit in a single byte, or even two bytes, so array subscription may not access distinct characters; string values are limited in length to predeclared storage sizes; and it is not always intuitive--especially when modifying a program--as to whether an integer i refers to the i'th character or the preceding or following inter-character position. When a substring is to be returned by a function under the array/integer model the return value can easily be a reference to the start of the substring, but the length of the substring must be returned through some side-effect like a global variable or modification of an additional parameter to the function.

The icon language offers an entirely different paradigm for reference to substrings. The question mark operator establishes its left operand as a subject string to be scanned and its right operand is a statement (usually a compound statement) which advances an implicit pointer through the subject string. This is a dynamic notion of substring since it depends on the entire state of the computation to determine the current substring. To store a substring or pass it as a value, the program can either make a copy of its contents or store two references, one to each end of the substring. Neither is a single value referring to the substring within its base.

The subsequence algebra has been implemented three times so far. First in a research language [Hansen, 1987], then as part of the cT system for educational computing [Sherwood & Sherwood, 1988], and most recently in the Ness language [Hansen, 1989; Hansen, 1990b] as part of the Andrew system [Morris, et al., 1986]. (In these past implementations the declarator for subseq values is "marker".) A subset of Ness will be used for programming examples below. In addition to the functions of the algebra, the subset utilizes subseq declarations, assignment, if-then-else, while-do, predicates, compound statements, function definition, call (with call-by-value parameters), and return. Spare as it may be, this subset was sufficient to write a preprocessor from itself into C.

Section 2 defines the fundamental primitives, applications of which are shown in Section 3 where they are used to define other functions to access subsequences. Section 4 establishes that the algebra is sufficient to compute not only all subsequences of a given sequence, but also all functions. Section 5 shows how the algebra can be extended to cover modification of strings and section 6 presents a regular expression parser as an extended example. Rather than axiomatic definitions and proofs, subseqs are defined syntactically and proofs are done with symbol manipulation. This proof technique, and the elegance of the algebra, combine to render most proofs quite straightforward; indeed, the goal of this paper is not so much to prove results about the algebra, but rather to demonstrate its potential for simple expression of algorithms.

2. Subsequence Primitives
The referent of each value in the algebra is an underlying sequence of elements called its base. An element may be an object of any sort; for examples we will use the alphabet of ASCII characters as the elements.

Definition: The alphabet is the infinite set including alphabetic characters, numeric values, references to objects, and even references to subsequences.

In practice it is expected that the elements of the alphabet have "small" representations; large objects can be incorporated by storing references to them.

Definition: A sequence is an ordered collection of elements from the alphabet: a0, a1, a2, ... .

The algebra does not require that sequences be finite; the implemntation of a sequence could include a generator to produce tail elements on demand as they are accessed. This paper does not explore this potential, however, and treats all sequences as finite.

Each value in the algebra is itself a reference to a subsequence of a base sequence, so we call the value a subsequence reference value or subseq. For the formal exposition in this paper, the notation for subseq values must represent both the base sequence and the position within it of the referenced subsequence. We write base sequences in angle brackets like this

< s1 c >

Identifiers within the angle brackets are meta-variables which refer to individual elements (a, b, c, ...) or to sequences of zero or more elements (p, q, r, s1, s2, ...) . Quoted strings represent themselves. In most cases a subseq representation will also contain square brackets:

Definition: A subsequence reference representation is written as

< s1 [ s2 ] s3 >

where each si is a sequence of zero or more elements. The entire sequence < s1 s2 s3 > is the base. s2 is the referenced portion. If s2 has length zero, the subseq is empty.

It may appear that the definitions below require copying sequences; the intended interpretation--and the implementation of Ness--does not require copying sequences for any of these functions other than concatenation and replace().

Traditional programming constructs are described as follows for subseq values. In the descriptions, the construct is followed by an arrow, =>, and then the resulting value as a subsequence reference representation.

Constants return a subseq for the entire sequence. For character strings this is:

"some text" => < [ "some text" ] >

Concatenation (written with "~") constructs a new sequence by juxtaposing the referenced portions of the arguments (that is, the portion inside the square brackets) and returning a subseq for the entire new sequence.

< s1 [ s2 ] s3 > ~ < s4 [ s5 ] s6 > => < [ s2 s5 ] >

Comparison of subseq values is defined to be sequence comparison of the referenced portions. This operation is well defined only when there is an appropriate ordering relation defined for elements of the sequences, such as lexicographic ordering of characters. For any relational operator relop:

< s1 [ s2 ] s3 > relop < s4 [ s5 ] s6 > => < s2 > relop < s5 >

Assignment of subseq values assigns the subseq value, it does not copy the sequence. After the assignment

a := x

if the expression x has the subseq value < s1 [ s2 ] s3 > then a and x will each have that value after the assignment; they are separate values, but both refer to the same underlying base.

Printing a subseq value prints the referenced portion.

print(< s1 [ s2 ] s3 >) => <s2 > is printed

Declarations in the examples will utilize the keyword subseq:

subseq var1, var2, ...

Parameter passing is call by value and passes the entire subseq value, including the base. Thus if f has a formal parameter m and is called with f(p) where p is < s1 [ s2 ] s3 > then f can access any portion of the base of p. All parameters of functions defined in this paper are subseq values and parameter declarations are omitted.

The algebra of subsequence expressions has the following primitive operations. The effects of these functions on a typical character string are illustrated in Figure 2.


Figure 2. The four primitive functions. The base sequence for all examples is as in Figure 1. The subseqs below the base show the result of applying the various functions to m, n, and p.

Start(m) returns a subseq both of whose limits are at the beginning of the argument:

start(< s1 [ s2 ] s3 >) => < s1 [] s2 s3 >

Next(m) returns a subseq for the element following the argument. If the argument was at the end of its base, next() returns the end of the base.

next(< s1 [ s2 ] c s3 >) => < s1 s2 [ c ] s3 >
next(< s1 [ s2 ] >) => < s1 s2 [] >

Base(m) returns a subseq for the entire sequence surrounding the argument.

base(< s1 [ s2 ] s3 >) => < [ s1 s2 s3 ] >

With these first three functions, an expression for the first character of the base of subseq m is next(start(base(m))). In Figure 2 this would have as its value a reference to the initial apostrophe in the string.

Extent(m, p) produces a subseq extending from the beginning of its first argument to the end of its second argument. If the second argument ends before the start of the first, the result is an empty subseq at the end of the second argument. Extent() expects that its arguments be on the "same" base, that is, a base generated by a single constant or concatenation. If the bases differ, extent() returns a unique empty constant.

To formally define extent() we abuse the notation slightly by omitting some square brackets; missing right (or left) brackets appear somewhere to the right (or left) of the corresponding left (or right) bracket which does appear.

extent(< s1 [ s2 s3 >, < s1 s2 ] s3 >) => < s1 [ s2 ] s3 >
extent(< s1 s2 [ s3 >, < s1 ] s2 s3 >) => < s1 [] s2 s3 >
extent(< s1 [ s2 ] s3 >, < s4 [ s5 ] s6 >)
=> < [] >, where < s1 s2 s3 > /= < s4 s5 s6 >

As written, the formal definition does not require identical bases for the two arguments, only that the base sequences be equal. There is no way in the algebra to distinguish two subseqs on the same base from two subseqs on different but equal bases. However, implementations of extent() may return the empty sequence unless the two bases are indeed physically one.

Referring again to Figure 2, an expression for " slithy "--with spaces--is

extent(next(next(next(n))), start(p))

because next(n) is "h", next(next(n)) is "e" and next(next(next(n))) is the space before slithy. As an example with more general applicability, an expression to compute all but the first element of a subseq t is extent(next(next(start(t))), t).

The following example utilizes the algebraic operations introduced above. ExpandTabs() copies an input string and produces a modified version with tab characters replaced with enough spaces to move subsequent text to appropriate column positions. The main loop sequentially processes each character of the input, generating the output in variable new. For each cycle, c refers to the current character and m is the remaining unprocessed text. Each time a tab is encountered, its place in the new string is filled with extent(tab, base(eight)), which is the right number of spaces because variable tab is advanced by one character for each non-tab character and replenished with eight spaces when exhausted.

-- ExpandTabs(m) -- return a copy of m with tabs expanded to spaces so -- subsequent text is at position 9, 17, 25, 33, ... -- function ExpandTabs(m) subseq c -- current source character subseq result -- the output string subseq eight -- eight spaces subseq tab -- if c is a tab, replace it with -- extent(tab, eight) result := "" -- initial output is empty eight := next(start(" ")) -- 1st of 8 spaces tab := eight -- distance to 1st tab while m /= "" do c := next(start(m)) -- first character m := extent(next(c), m) -- rest of text if c = "\t" then -- tab: output the right number of spaces result := result ~ extent(tab, base(eight)) tab := eight -- restart tab cycle elif c = "\n" or next(tab) = "" then -- newline or tabstop result := result ~ c -- copy char tab := eight -- restart tab cycle else -- non-tab result := result ~ c -- copy char tab := next(tab) -- shorten tab end if end while return result end function

Note the absence of arithmetic for determining the current output position. As an exercise the reader is invited to write a version of ExpandTabs() that keeps track of tab position with integers. As another exercise, rewrite the function to do concatenation to result only when a tab is found or the end of the input is reached.

In order for a set of functions to be teachable and memorable, they should be based on an underlying set of functions that are as small as possible. The next section will present a number of additional subsequence functions all of which can be defined in terms of the above four primitives. To show that none of the four is redundant we have:

Theorem 2.1 (Necessity): The four primitive functions--start(), next(), base(), and extent()--are all necessary for computation with the algebra. That is, none can be expressed as a functional composition of the others.

Proof: We wish to demonstrate for each of the primitives, P, that there is no expression E which behaves as defined above for P and yet does not contain P in the expression E. We do this by exhibiting a particular value, v, and argue for each primitive that no such expression E exists that converts this value appropriately, and therefore no expression E exists which implements P for all arguments. The particular value v is

< "ab" [ "cd" ] "ef" >

Note first that

next(v) = < "abcd" [ "e" ] "f" >,
start(v) = < "ab" [] "cdef" >, and
base(v) = < [ "abcdef" ] >.

For extent() no E can exist because the value returned may have to be of any length and the other functions return only subseqs of zero, one, or all the elements of a base.

For base() no E can exist because none of the other functions can otherwise generate a subseq beginning before the left bracket of v.

For next() no E can exist because no other function can create a subseq that ends one character after the end of v.

For start(), note first that base() and next() return values that terminate at or after the end of their argument. Extent() returns a value which ends where its second argument does, but since that argument can only be derived from v, next(), base(), or extent(), there is no way it could be prior to the termination of v.

Since none of the functions can be expressed in terms of the others, all are necessary. QED

Experiments with various alternative sets of primitive functions have shown none more convenient for programming than the set above. For instance, by symmetry the algebra could be defined with previous() instead of next() or finish() instead of start(); but either would emphasize right-to-left processing instead of the more natural left-to-right processing. Next() could return an empty subseq after the following element, but algorithms often require elements rather than empty subseqs.

There is an algebra which is formally simpler than the one above, requiring three primitive operations instead of four. It utilizes next() and extent() and a third function we can call startofbase():

StartOfBase(m) returns a subseq for the empty sequence at the beginning of the entire sequence surrounding the argument.

startofbase(< s1 [ s2 ] s3 >) => < [] s1 s2 s3 >

Implementation of base(x) with the three primitives first has a loop to find e, the end of the base, and then returns the value extent(startofbase(x), e). Start() is less obvious:

function start(m)
subseq p -- an empty subseq at or before start(m) p := startofbase(m) while extent(p, m) /= m do
-- inequality means p precedes start(m) p := next(p) p := extent(next(p), p) -- p is still empty, -- but is one element further along in m
end while return p -- equality means p = start(m)
end function

The set of three functions, however, is less convenient for programming than the four primitives presented earlier.

3. Subsequence Functions
Given a subseq, many algorithms require computation of subseq values for nearby subsequences; within the algebra the expressions for such computations are usually concise. This section presents formal definitions for a number of nearby subsequences, exhibits expressions to compute the desired subseq, and proves that that expression does indeed have the desired value. As a first example, we compute finish(), the analog of start() which is the empty subseq following a subseq value.

finish(m) - Returns a subseq for the empty subsequence just after m:

finish(< s1 [ s2 ] s3 >) => < s1 s2 [] s3 >.

Lemma 3.1: An expression for finish(m) is start(next(m)).

Proof: If s3 is non-empty it contains a first element c and a continuation s4:

start(next(< s1 [ s2 ] c s4 >))
= start(< s1 s2 [ c ] s4 >)
= < s1 s2 [] c s4 >
= < s1 s2 [] s3 >

while if s3 is empty we have:

start(next(< s1 [ s2 ] >))
= start(< s1 s2 [] >)
= < s1 s2 [] >
= < s1 s2 [] s3 >

In both cases the expression computes the correct value. QED

Algorithms frequently need single element subsequences analogous to next, but in some other position relative to a given subseq; for example, the first element before the subseq and the first and last elements within the subseq. We begin with the single element subseq which starts at the same place as its argument:

front(m) - Returns a subseq for the first element after start(m), if there is one, otherwise m must be empty at the end of its base and this value is returned:

front(< s1 [ c s2 ] s3 > => < s1 [ c ] s2 s3 >
front(< s1 [] c s3 >) => < s1 [ c ] s3 >
front(< s1 [] >) => < s1 [] >

Lemma 3.2: An expression for front(m) is next(start(m)). (Proof omitted.)

Front() returns a single element regardless of whether its argument is the empty string, but sometimes it is preferable to have a function first() which is empty when its argument is. To define first() it is easiest to begin with rest(), a function to compute all but the first element. The implementation of this function exploits a detail of the definition of extent().

rest(m) - Returns a subseq for all elements of m other than the first; but if m is empty, so is rest(m):

(i) rest(< s1 [ c s2 ] s3 >) => < s1 c [ s2 ] s3 >
(ii) rest(< s1 [] s3 >) => < s1 [] s3 >

Lemma 3.3: An expression for rest(m) is extent(next(next(start(m))), m).

Proof: The proof of (i) has three cases depending on the lengths of s2 and s3. When s2 has at least one element we write s2 as < c2 s4 > and the proof proceeds thus:

rest(< s1 [ c c2 s4 ] s3 >)
= extent(next(next(start(< s1 [ c c2 s4 ] s3 >))),
< s1 [ c c2 s4 ] s3 >)
= extent(next(next(< s1 [] c c2 s4 s3 >)), < s1 [ c c2 s4 ] s3 >)
= extent(next(< s1 [ c ] c2 s4 s3 >), < s1 [ c c2 s4 ] s3 >)
= extent(< s1 c [ c2 ] s4 s3 >), < s1 [ c c2 s4 ] s3 >)
= < s1 c [ c2 s4 ] s3 >

In the second case s2 is empty and s3 has one or more elements, while in the third case s2 and s3 are both empty. In both cases the subseq is reduced to a empty subsequence at its former end. These cases can be verified by an argument similar to that of the first.

For part (ii) we observe that the first argument to extent() is next(next(start(m))), which cannot yield a subseq starting before the beginning of m and the second argument is m, which ends at the end of m. Since m is empty, the extent() must yield a subseq equivalent to m. QED

first(m) - Returns the first element of m, but if m is empty, so is first(m):

first(< s1 [ c s2 ] s3 >) => < s1 [ c ] s2 s3 >
first(< s1 [] s3 >) => < s1 [] s3 >

Lemma 3.4: An expression for first(m) is extent(m, start(rest(m))). (Proof omitted.) QED

last(m) - Returns the last element of m, but if m is empty, so is last(m):

(i) last(< s1 [ s2 c ] s3 >) => < s1 s2 [ c ] s3 >
(ii) last(< s1 [] s3 >) => < s1 [] s3 >

Lemma 3.5: Last(m) is computed by this function:

function last(m) if rest(m) = "" then return m else return last(rest(m)) end if end function

{Note: Although this definition requires a recursion, when the implementation of the algebra stores elements in contiguous memory last() can be computed more efficiently by scanning backward in the base sequence.}

Proof: Case (ii) follows trivially from the definition of rest().

Case (i) must be proved by induction on the length of m. If m has one element, then rest(m) is "" by the definition of rest(), so m is correctly returned as its own last element. If m has more than one element, we write it as < s1 [ c2 s2 c ] s3 >, the else clause is executed and we return last(< s1 c2 [ s2 c ] s3 >). By the inductive hypothesis, the latter expression gives the correct value. QED

As a tool for later sections and as an example of first() and rest(), consider a simple token scanner. The function token(s, chset) below will scan forward through the subseq s and find the first token which is composed of characters from chset, where chset represents the set of characters e3ach of which appears at least once anywhere in chset. The basic scheme is to look through the subseq for an element which is in chset and then find all adjacently succeeding characters also in chset. It is a convention for pattern matching functions that the range of the search is over s if s is non-empty or otherwise over extent(s, base(s)); it is another convention that the failure value is finish(s).

-- token(s, chset) -- Search s for a segment composed of elements of chset. -- If s is initially empty, search over extent(s, base(s)). -- Return a subseq for the entire segment. -- If no instances of characters from chset occur in s, -- the function returns finish(s). -- function token(s, chset) subseq t, startsegment t := first(s) if s = "" then s := extent(s, base(s)) end if while extent(t, s) /= "" and not isIn(t, chset) do t := next(t) end while startsegment := start(t) while extent(t, s) /= "" and isIn(t, chset) do t := next(t) end while return extent(startsegment, start(t)) end function -- isIn(c, chset) -- Checks to see if c is one of the elements of chset. -- Returns True if found and False otherwise. -- boolean function isIn(c, chset) while chset /= "" do if c = first(chset) then return True end if chset := rest(chset) end while return False end function

4. Sufficiency
With the aid of the subsequence functions, it is possible to write an expression for any subsequence of a sequence. To demonstrate this, consider the set of all subsequences of a sequence. This set consists of each instance of a subsequence starting at one position in the sequence and continuing to the same or another, later position. Here are functions to print all the subsequences of a sequence:

function printsubsequences(s) printsubsub(s, s) if s /= "" then printsubsequences(rest(s)) end if end function function printsubsub(t, s) print (extent(s, start(t))) if t /= "" then printsubsub(rest(t), s) end if end function

It is not difficult to show that printsubsequences(s) prints all subsequences of s. We begin with a lemma.

Lemma 4.1 (Tail Recursion): If P is an operation and Q is a sequence of zero or more variables each preceded by a comma, then the function f() defined by

function f(x Q) P(x Q) if x /= "" then f(rest(x) Q) end if end function

performs P once for each tail of x, including the final empty subsequence. That is, if x is < s1 [ c1 c2 . . . cn ] s3 > then P is executed for each of

< s1 [ c1 c2 . . . cn ] s3 >,
< s1 c1 [ c2 . . . cn ] s3 >,
. . .,
< s1 c1 c2 . . . [ cn ] s3 >, and
< s1 c1 c2 . . . cn [] s3 >.

Proof: If x is < s1 [] s3 >, then n is zero and P is executed once; the then clause is not executed because x = "". When n > 0 we argue by induction. A call to f() evaluates P once for the current value of x and then calls f recursively for rest(x). By the definition of rest, rest(x) is a tail of x so n is one less and the general case holds by induction. QED

Lemma 4.2: The call printsubsub(s, s) prints all subsequences of s that begin at start(s).

Proof: By the Tail Recursion Lemma, printsubsub(s, s) executes print(extent(s, start(t))) for t being each tail of s. This is exactly the subsets of s that begin at the beginning of s. QED

Lemma 4.3: The call printsubsequences(s) prints all subsequences of s.

Proof: By the Tail Recursion Lemma, printsubsequences(s) executes printsubsub(s, s) for s being each tail of the initial s. By the preceding Lemma, this call prints the subsequences beginning at each position within s, which is the entire set of subsequences. QED

Theorem 4.1 (Sufficiency): The subsequence algebra is sufficient to generate all subsequences of the base of a sequence.

Proof: By the preceding Lemma, if s is a sequence, the call printsubsequences(base(s)) will print all subsequences of the base of s. Since they are printed, they must have been generated. Since only the functions of the algebra have been used to operate on sequence values, those functions must be sufficient. QED

Note that the Sufficiency Theorem proves that all subsequences can be generated, but not that any specified subsequence or set of subsequences can be generated. The simplest way to do this is by showing that the algebra can simulate a Turing machine:

Theorem 4.2 (Universal Computability): The subsequence algebra is sufficient to compute any computable function.

Proof: By Church's Thesis, computable functions are those which can be computed by a Turing Machine or any equivalent machine. One such machine is the Two-Counter machine which requires simply two (very) large numbers to represent the state and the tape of a Turing machine. Since sequences in the algebra are of unlimited length, two of them can be utilized to simulate two counters--for example by a unary encoding--and thus a program in the algebra can simulate a Turing Machine. QED

If we pretend that efficiency matters for a Turing Machine, the simulation can be done more directly by utilizing a sequence for the tape with a subseq as the read head. The requirement for an infinite tape is, as usual, avoided by encoding only that finite portion of the tape which has or has had non-zero values. Each of the four operations of the Machine--write a one, write a zero, move left, and move right--can be implemented as a function with the tape as argument and producing a new tape. In order to have a subseq value on the new tape in the same position as on the old, we need a function to do the equivalent of copying a subseq along with its base:

-- nextPerSubseq(m, p) -- return a subseq for the single element c from base(m) -- such that the number of elements in extent(m, start(c)) -- is equal to the number of elements in p {or if p is -- too long, then c is next(base(m)) -- function nextPerSubseq(m, p) m := next(start(m)) while p /= "" do m := next(m) p := rest(p) end while return m end function

Here are the functions for the first and last of the four Turing machine operations; the other two are similar.

function WriteAOne(t) subseq allprevious := extent(base(t), start(t)) return nextPerSubseq(allprevious ~ "1" ~ extent(next(t), base(t)), allprevious) end function

function MoveRight(t) if next(t) /= "" then return next(t) else -- extend the tape with a new zero return nextPerSubseq(base(t) ~ "0", base(t)) end if end function

In both functions nextPerSubseq is employed so the newly created value that is returned will refer to the correct position on the tape. For even more efficiency--as long as we are pretending it matters, it would be better if we did not copy the entire tape at each iteration but instead modified the existing sequence. To do so requires that the algebra be extended as described in the next section.

5. Modifying Base Sequences with Replace()
In some practical applications, such as interactive text editors--and in some impractical applications like Turing Machines--it is desirable to be able to modify base strings. For this we extend the algebra with the function replace(x, y) which modifies the base of x. Replace() subsumes insertion and deletion: for insertion, an empty subseq is replaced with non-empty text; for deletion, a non-empty subseq is replaced with empty text. Together with a second new function, newbase, it is also possible to define concatenation in terms of more primitive operations.

Without extending the notation we can informally define replace() as:

replace(< s1 [ s2 ] s3 >, < s4 [ s5 ] s6 >)

=> < s1 [ s5 ] s3 >,
if < s1 [ s2 ] s3 > is not a constant
=> Error, otherwise

Replace(x, y) modifies the base of the subseq value x so the subsequence referred to by x now contains the referenced portion of string y instead of its former value. The value returned is a subseq for the new copy of the replacement value. All other subseq values on the same base as x are adjusted appropriately, as defined below.

This definition is inadequate because it does not describe the effect on other subseq values that refer to the same base. Informally we can say that replace(x, y) affects other subseqs on base(x) as though the value of y is inserted at the end of x and then x is deleted. Thus subseqs that begin after x are shifted along so they still refer to the same underlying text as they did and subseqs that span x will have new contents for the portion that was x. For subseqs which end at the end of x there are four cases, depending on whether each of x or the other subseq is empty or not; the action for each case is described in Table 1 and illustrated in Figure 1.
x is empty x is non-empty
other is empty other precedes insertion other follows insertion
other is non-empty other includes the insertion

Table 1. Adjustment of other subseqs that end where x does. The operation performed is replace(x, y). 'Other' is some other subseq value on the same base as x and ending where x ends. The contents of each cell describe the relation of 'other' to the inserted copy of y.

Rather than theoretical considerations, the basis of Table 1 is practical experience with the algebra. In most cases it has seemed preferable that empty subseqs remain empty after a replace. An empty subseq often indicates a position to begin further processing, so when the x and the other subseq are both empty the other is left preceding the insertion so the insertion will be processed. But when x is non-empty the other subseq originally followed x and should still follow the replacement.


Replace "EFGH" in abcd EFGH ijkl with "XYZ"

The subseq a / bcd / EFGHijkl becomes a / bcd / XYZijkl
The subseq a / bcdE / FGHijkl becomes a / bcd / XYZijkl
< The subseq a / bcdEFGH / ijkl becomes a / bcdXYZ / ijkl
The subseq a / bcdEFGHi / jkl becomes a / bcdXYZi / jkl
The subseq abcd / EFG / Hijkl becomes abcd / / XYZijkl
< The subseq abcd / EFGH / ijkl becomes abcd / XYZ / ijkl
The subseq abcd / EFGHi / jkl becomes abcd / XYZi / jkl
The subseq abcdE / FG / Hijkl becomes abcd / / XYZijkl
< The subseq abcdE / FGH / ijkl becomes abcd / XYZ / ijkl
The subseq abcdE / FGHi / jkl becomes abcd / XYZi / jkl
<> The subseq abcdEFGH / / ijkl becomes abcdXYZ / / ijkl
> The subseq abcdEFGH / i / jkl becomes abcdXYZ / i / jkl

Replace empty string between c and d in abcdef with "XYZ"

< The subseq a / bc / def becomes a / bc / XYZdef
The subseq a / bcd / ef becomes a / bcXYZd / ef
<> The subseq abc / / def becomes abc / / XYZdef
> The subseq abc / d / ef becomes abcXYZ / d / ef

Table 1. The effect of replace() on other subseqs. The replace performed for each group of lines is shown above the group. Each line shows an example of some other subseq on the same base and its value after the replacement. The base string is letters only; the spaces are for readability and the slashes indicate the extent of each other subseq value. In general the replacement is made by inserting the replacement at the finish of the replaced value and then deleting the replaced value. The <'s mark examples where the other subseq ends at the end of the replaced string and the >'s indicates examples where it begins there.

With replace() we can rewrite the Turing tape operations as

function WriteAOne(t) return replace(t, "1") end function

function MoveRight(t) if next(t) /= "" then return next(t) else return replace(next(t), "0") end if end function
In this case the functions return their argument after having modified its base, if necessary.

To formally model replace() we must introduce the notion of a memory with a collection of base strings each having some number of subseqs. The entire contents of memory is represented in curly braces:

{ . . . , < s1 [ s2 ] s3 >, . . . < s4 [ s5 ] s6 > , . . . }

To indicate that subseqs share the same base, we label the brackets of each subseq. Thus two subseqs on the same base in memory would be:

{ . . . , < s1 [j s2 [k s3 ]j s4 ]k s5 >, . . . }

where the subsequence referred to by j is s2 s3 and that of k is s3 s4. We require that the labels on all pairs of brackets be unique, so a label both identifies a base and refers to a subsequence within it.

For convenience in the formal definition, we introduce the notion of an "extended sequence" composed of intermixed brackets and elements from a base sequence. To distinguish the two forms of sequence the component items in an extended sequence are called "constituents"; each constituent is either an element of a base sequence or a left or right bracket which represents one end of a subseq on that base. For each label in a well-formed extended sequence there is one left bracket and one right bracket with that label and the left bracket precedes the right. The Greek letters appearing as meta-linguistic variables below take as values subsequences of extended sequences (but not references to these subsequences).

Using bracket labels to indicate subseq values, the function we wish to define is

replace(p, q) => r

which has two subseq arguments and returns a third. The memory representation prior to executing the function is

{ . . . < sb1 [p sb2 ]p sb3 >
. . . < sb4 [q sb5 ]q sb6 > . . . }

where

sbi is a subsequence of the extended sequence which includes all the elements of si and all adjacent brackets other than those explictly shown.

The subsequences p and q may overlap on the same base, so the formal definition must make a copy of q. It does so by introducing a new base string in the memory representation with an otherwise unused label denoted by x. The first rule just makes this copy:

(1) { . . . < sb1 [p sb2 ]p sb3 >
. . . < sb4 [q sb5 ]q sb6 > . . . }

=> { . . . < sb1 [p sb2 ]p sb3 >
. . . < sb4 [q sb5 ]q sb6 >
. . . < [x s5 ]x > }

For the case of a non-empty p, we rewrite sb2 as s7 c b7, where

c is the last element--that is, a non-bracket--in sb2 (by assumption, sb2 has at least one character)
s7 is all elements in sb2 that precede c
b7 is all brackets from anywhere in sb2, taken in their original order


The required effect is to remove s7 and c and place all the brackets of b7 after the insertion. Note that the copy x is deleted and the placement is shown for the result subseq r:

(2)

{ . . . < sb1 [p sb2 ]p sb3 >
. . . < [x s5 ]x > }

=> { . . . < sb1 [p [r s5 ]r ]p b7 sb3 >
. . . }

For an empty p we assume that sb2 is also empty and rewrite sb1 and sb3 extracting and splitting the set of all brackets between the last non-bracket element of sb1 and the first character of sb3:

sx1 is all of sb1 except the brackets on its right. If sb1 has no elements, sx1 is empty.
sx3 is all of sb3 except the brackets on its left. If sb3 has no elements, sx3 is empty.
b2 is all brackets between sx1 and sx3.


We rewrite b2 as b1 b3 where:

b1 is is all brackets from b2 which are right brackets or the left brackets of empty subsequences.
b3 is all left brackets from b2 which do not begin empty subsequences.

Since [a and ]a bound an empty subsequence between s1 and s3, they are constituents in b1. The third rule provides for inserting the replacement text s5 between b1 and b3, and again deleting the copy x and indicating the return value n.

(3) { . . . < sx1 b1 b3 sx3 >
. . . < [x s5 ]x > }
=> { . . . < sx1 b1 [p s5 ]p b3 sx3 >
. . . }

With replace() the function ExpandTabs of Section 2 can be reduced in length by a third: lines for the variable result are removed except the one for c = "\t" which becomes

replace(c, extent(tab, base(eight)))

As another example, take the problem of replacing newlines with spaces in order to wrap short lines into paragraphs. The function WrapLines below replaces most newlines with a single space, but uses two spaces if preceded by an end of sentence or leaves the newline in place if it is followed by white space. Spaces preceding the newline are removed as well.

-- WrapLines(m) -- convert newlines to spaces -- but leave newlines followed by whitespace -- and use two spaces after end of sentence -- function WrapLines(m)
subseq s while True do
s := token(m, "\n") -- collect consecutive newlines if s = "" then
-- no more newlines exit function
end if while previous(m) = " " do
-- remove preceding spaces replace(previous(m), "")
end while if rest(m) /= "" then
-- leave multiple newlines alone
elif next(m) = " " or next(m) = "\t" then
-- next line starts with whitespace -- leave newline
elif token(previous(m), ".?!;:") /= "" then
-- end of sentence replace(m, " ") -- replace with two spaces
else
replace(m, " ") -- replace with one space
end if m := extent(finish(s), m) -- consider the rest of m
end while
end function

In some applications that create new sequence values it is crucial to start with an empty mutable value into which text can be inserted with replace(). For this the algebra is extended again with a primitive read-write constant:

newbase() => < [] >

Creates a new, modifiable and empty, base string.

With newbase() we can now redefine concatentation and introduce an append operation:

s ~ t == base(replace(end(replace(newbase(), s)), t)).

The result contains the concatentation of the referenced segments from s and t.

v ~:= t == v := base(replace(finish(base(v)), t)), (

where v is a variable).

The value of variable v is appended with t and v is given the entire result as its new value.

With append output values can be constructed by beginning with

result := newbase()

and appending values with

result ~:= expression ... .

As an optimization, the expression

a ~:= b ~ c
is interpreted as

a ~:= b a ~:= c

Append has the side-effect-free advantages of concatenation, but can be implemented without allocating space for another copy of result on each iteration. We can illustrate this scheme with a function to collect footnotes--delimited with curly braces--from throughout a text and construct a new text with the footnotes at the end.

-- CollectFootnotes(m) -- Footnotes in m are delimited with { and } -- Return a copy of m with the braces removed and their -- contents collected at the end of the text. -- function CollectFootnotes(m)
subseq l, result, footnotes result := newbase() footnotes := newbase() while True do
l := token(m, "{") l := extent(l, token(extent(l, m), "}")) if l = "" then
result ~:= m ~ footnotes return result
end if footnotes ~:= "\n\n"
~ extent(next(next(start(l))), start(last(l)))
result ~:= extent(m, start(l)) m := extent(next(l), m)
end while
end function

6. Regular Expression Parser
One fruitful area of research on strings--especially with rewriting-rule approaches--has been in parsing algorithms. Since we have already shown the subsequence algebra can compute any computable function, it can obviously perform parsing. Nonetheless, regular expression parsing is a good example of the algebra in action for strings; the functions come out quite short and simple. Moreover, the regular expressions can be conveniently represented with subsequences containing nested subsequences as elements.

A regular expression parser has as its arguments a subject string to be matched against and a pattern describing what is to be found. For this example we assume that the match must be found at the beginning of the subject; it is a trivial extension to check each position to see if the pattern begins there. (There are several non-trivial algorithms for finding the pattern at an arbitrary location without doing too much backtracking.)

Four forms of regular expression are recognized by the example function:

Direct match: the pattern character must match the current subject pattern.
Concatentation: two or more regular expressions are listed and the subject must match each in turn.
Alternation: the subject must match one of a sequence of patterns.
Repetition: the regular expression specifies one subsidiary expression which is matched against the subject as often as possible.

(In some other formulations, a repetition matches only as many instances as are necessary to make the remainder of the pattern match.)

For our purposes direct matches will be individual elements and concatenations will be a sequence of patterns. For example, the pattern "abc" is the concatentation of three direct matches for a, b, and c and will match a subject if it begins with those three characters.

To describe the representation of alternation and repetition we need to introduce the notion of subseq objects: this is a subseq value which has been encapsulated as an object so it can serve as an element in a sequence. Subseq objects will be used in the representation of patterns where non-direct regular expressions are components of other regular expressions. In formal definitions a subseq object will be underlined to indicate that it is a unit.

When sequences contain subseq objects we need two additional operations, isseq and contains. The boolean function isseq(), when applied to a subseq returns True if the first element of the subseq is an embedded subseq object:

isseq (< s1 [ < s2 [ s3 ] s4 > s5 ] s6>) => True
isseq (< s1 [ c s2 ] s3 >) => False
isseq (< s1 [ ] s3 >) => False

If isseq() yields True, the embedded subseq can be extracted with the contents() function. If its argument does not begin with a subseq object, contents() behaves the same as first():

contents (< s1 [ < s2 [ s3 ] s4 > s5 ] s6>) => < s2 [ s3 ] s4 >
contents (< s1 [ c s2 ] s3 >) => < s1 [ c ] s2 s3 >
contents (< s1 [ ] s3 >) => < s1 [ ] s3 >

Now we can specify the representation of regular expressions as sequences:

Direct: the character represents itself. It cannot be * or |.
Concatenation: a sequence of representations of regular expression.
Alternation: a sequence whose first character is | and whose remaining elements are each the representation of a regular expressions.
Repetition: a sequence whose first element is * and whose remaining elements are a regular expression. (These remaining elements need not be an embedded subseq object).

Of the four functions below for a regular expression pattern matcher, the first,--rex_match--serves only to determine what sort of expression the pattern is and to call one of the other functions to process it. The other three each handle one form of regular expression; direct patterns are handled within rep_cat, the parser for concatenations.

-- rex_match(s, pat) -- Checks extent(s, base(s)) to see -- if it begins with an instance of pat. -- If found, returns a subseq for the match, -- otherwise returns finish(s). -- An empty value at start(s) is returned if -- pat is empty or -- pat is a repetition and no instances are found. -- function rex_match(s, pat) if first(pat) = "*" then -- repetition return rex_rep(s, rest(pat)) elif first(pat) = "|" then -- alternation return rex_alt(s, rest(pat)) else -- concatenation return rex_cat(s, pat) end if end function

-- rex_rep(s, pat) -- Scan s for multiple instances of pat. -- Return the entire matched part of s. -- function rex_rep(s, pat) subseq m m := start(s) while True do m := rex_match(m, pat) if m = "" then return extent(s, m) else m := finish(m) end if end while end function

-- rex_alt(s, patlist) -- Check s to see if it begins with -- one of the patterns in patlist. -- Return the subsequence corresponding -- to the first successful match. -- If there is no match, return finish(s). -- function rex_alt(s, patlist) subseq m while patlist /= "" do m := rex_match(s, contents(patlist)) if m /= "" then return m end if -- succeed patlist := rest(patlist) end while return finish(s) -- fail end function

-- rex_cat(s, patlist) -- Check s to see if it begins with -- the sequence of patterns in patlist. -- If so, return the corresponding initial -- subsequence of s; otherwise return finish(s). -- If pat is empty, returns start(s). -- function rex_cat(s, patlist) subseq m m := front(s) while patlist /= "" do if isseq(patlist) then m := rex_match(m, contents(patlist)) if m = "" then -- fail return finish(s) end if else -- is a character: try to match it if m /= first(patlist) then return finish(s) -- fail end if end if m := next(m) patlist := rest(patlist) end while return extent(s, start(m)) -- succeed end function

Note that throughout these functions the pattern argument is processed with first and rest so the argument could be a subsequence of some larger sequence. In contrast, the subject is permitted to extend past the finish of the original argument to the end of its base. The reader is challenged to revise these functions to restrict the pattern match to the extent of the original value of s.

Summary and Conclusion
This paper has presented an algebra for subsequence expressions, employing as primitives the functions base(), start(), next(), extent(), concatenation, constants, and appropriate definitions for assignment and comparison. The four functions were shown to be necessary and sufficient for computing all subsequences of a given base sequence. Section 5 showed that the algebra can be extended to permit modification of base strings and that the effect on other values referring to the modifed base can be well defined. The final section exhibited a regular expression pattern matcher utilizing sequences both for the subject string and the data structure of the pattern.

Acknowledgements. This work was catalyzed by Bruce Sherwood's search for a string sublanguage for the cT educational computing environment. I am indebted to him, Judy Sherwood, David Andersen, the Center for Design of Educational Computing, and cT users. Initial development of these ideas was done while in residence at the Department of Computer Science, University of Glasgow, with the support of the Science and Engineering Research Council (grant number GR/D89028), both under the direction of Malcolm Atkinson. The work has also benefitted from conversations with Kieran Clenaghan, David Harper, Joe Morris, John Launchberry, Mark Sherman, and John Howard.


References

[Adobe, 1985] Adobe Systems, Inc., Postscript Language: Reference Manual, Addison-Wesley, (Reading, Mass., 1985).

[Griswold, 1971] Griswold, R. E., J. F. Poage, and I. P. Polonsky, The SNOBOL4 Programming Language, Prentice-Hall (Englewood Cliffs, 1971).

[Griswold, 1983] Griswold, R. E. and M. T. Griswold, The Icon Programming Language, Prentice-Hall (Englewood Cliffs, 1983).

[Hansen, 1987] Hansen, W. J., Ness - Reference Manual, Computer Science Dept., Univ. of Glasgow, 1987.

[Hansen, 1989] Wilfred J. Hansen, "Ness Language Reference Manual", available on X-windows distribution tape as .../contrib/andrew/atk/ness/doc/nessman.d, Information Technology Center, Carnegie-Mellon Univ., 1989.

[Hansen, 1990a] Hansen, Wilfred J. "Programming Language Support for Multi-Media Text with an Algebra for Subsequences," Information Technology Center, Carnegie Mellon Univ. (March 1990).

[Hansen, 1990b] Hansen, Wilfred J., "Enhancing documents with embedded programs: How Ness extends insets in the Andrew ToolKit," Proceedings of 1990 International Conference on Computer Languages, March 12-15, 1990, New Orleans, IEEE Computer Society Press (Los Alamitos, CA, 1990), 23-32.

[Morris, 1986] Morris, J., Satyarayanan, M., Conner, M. H., Howard, J. H., Rosenthal, D. S. H., Smith, F. D. "Andrew: A distributed Personal Computing Environment," Comm. ACM, V. 29, 3 (March, 1986) 184-201.

[Sherwood, 1988] Sherwood, B. A., and Sherwood, J. N., The cT Language. Stipes Publishing Company, (Champaign, Illinois, 1988).

[Steele, 1984] Steele, G. L., Jr., Common Lisp: The Language, Digital Press (Bedford, MA, 1984).