30 November 1991

Subsequence References: First-Class Values for Substrings

Wilfred J. Hansen
ATK Consortium
School of Computer Science
Carnegie-Mellon University
Pittsburgh, PA 15213

Abstract. Arrays of characters are a basic data type in many programming languages, but strings and substrings are seldom accorded first-class status as parameters and return values. Such status would enable a routine that calls a search function to readily access context on both sides of a return value. To enfranchise substrings, this paper describes a new data type for substrings as a special case of one for general subsequences. The key idea is that values are not sequences or references to positions in sequences, but rather references to subsequences. Primitive operations on the data type are constants, concatenation, and four new functions--base, start, next, and extent--which map subsequence references to subsequence references.

This paper informally presents the data type, demonstrates its convenience for defining search functions, and shows how it can be concisely implemented. Examples are given in Ness, a language incorporating the new data type which is implemented as part of the Andrew Toolkit.

Keywords: strings, substrings, sequences, subsequences, programming language design, string searching, document processing, Andrew Toolkit, ATK, Ness

Text and string operations are increasingly important as word processing becomes one of the most common computer applications, especially for non-technical people. Despite the importance of strings, programming languages have offered no innovations in string data types or operations since the introduction of pattern matching and substr which happened at least as early as COMIT [Yngve, 1963] and PL/I [IBM, 1965], respectively. The most recent innovations are in control structures; in particular, Icon [Griswold, 1990] has introduced goal-directed evaluation. This integrates string scanning with the rest of the language, whereas pattern matching is a sub-facility in earlier languages like COMIT and SNOBOL4 [Griswold, 1971].

In some languages strings and string expressions are first-class values in that they are eligible to appear in all contexts that permit any other expression, including assignment, function parameter, and return value. Substrings, however, do not have first class status. To represent one usually requires multiple assignments and multiple variables with the concomitant increase in complexity and potential for errors. Records, structs, or other composite values can sometimes be utilized, but this still requires the programmer to be aware of the details.

First-class values are especially important in the applicative programming style, which eschews side effects and thus has little room for values of other classes. Although applicative programming has sprung from a desire to prove properties of programs, it can also play a role in making it possible for more people to program. [Hughes, 1989]

This paper defines and demonstrates subsequence references, a first-class data type for subsequences in general and substrings in particular. In this data type, each value is a reference to a subsequence of a base sequence, so each single value represents an entire subsequence. The presentation in this paper is informal; for a formal definition of the data type, see [Hansen, 1989a].

Subsequence references are but one of several models of string values:

Atomic. In the atomic model of strings each string value is a distinct object with no accessible constituents. Operations and string functions return values that are effectively new strings with no relation to other existing strings. Given the right implementation, atomic strings can certainly be first-class, however, they are not always simple to use. Problems arise for parsing and searching operations because the result of a search must report not only the string which matched, but also its position. For instance, a calling routine may need to test punctuation adjacent to a returned value.

Although there are no major languages with a pure atomic model of strings, the possibility has been demonstrated by Eli [Glickstein, 1990]. In this Lisp-like language, search functions return a list of three strings whose concatenation will recreate a value equivalent to the original subject string. The middle element of the list satisfies the search specification.

Starting no later than XPL [McKeeman, 1970], implementations of functions for atomic string values have not actually copied strings to produce new values. Instead each return value is a reference to a substring of one of the argument strings, which thus serves as a base for the value. This technique is also employed for the implementation described below, with the difference that the new data type offers functions for accessing elements of the base string outside the referenced substring.

Indexed. In the indexed model a string value is a pointer to a string or an integer index to an element of a string (usually the latter is an atomic string). Such values can easily be first-class since integers and pointers are themselves first-class, but substrings are not first class because two values are required to refer to an arbitrary substring. When a programmer takes a shortcut to utilize one variable as a basis for locating two or more substrings, there is a potential for off-by one errors. Programmer effort is also increased when forced to choose between the atomic and indexed models. The formal complexity of the language is increased by having recourse to a domain--integers or pointers--outside the domain of strings.

C is one well-known example of a language with indices implemented as pointers; a string value is a pointer to a string or a tail of a string. To refer to a subsequence in C a composite value is required. For instance a token scanner might utilize the type StringPiece:

typedef struct string {char *str; int len;} StringPiece;
StringPiece token(StringPiece s)  {
	... compute position and length 
	s.str = position;
	s.len = length;
	return s;
}

Here assignments to both s.str and s.len are required where only one value, the substring, is being manipulated. Moreover, there is no general way for a program to determine how many active StringPiece values refer to any given base string, so it may be impossible to do general storage management, as discussed below in section 5.

Integer indices to strings are found in many languages, including PL/I and Fortran. These are not entirely equivalent to pointers when a string is copied: integer indices into the original will refer to the same positions in both copy and original, while pointers will refer only to a position in one string or the other.

Icon. Icon provides a special operator, ?, for pattern scanning. The expression

s ? operations

first makes string value s be the value of the global variable &subject and then executes the operations. The current position in the subject is given by another global variable, &pos, which initially has the position before the first character. Each of the string operations upto, many, find, any, match, and bal examines &subject starting at &pos and returns the integer index of the position just beyond a successful match. Extracting a matched substring as a new atomic value is done by two other functions, tab and move. These two functions (and only these two) have the side effect of advancing &pos. Thus it is common to write expressions such as

t := tab(many(letters))

wherein the position function many computes the position after an initial substring composed of letters and the tab operator first advances &pos across the substring and then returns a copy of that substring so it can be assigned to t.

Icon functions can generate a sequence of values, for example, many(letters) generates the positions of each run of letters in the subject. Icon's goal-directed evaluation mechanism initiates backtracking in appropriate contexts, which consumes successive values until one satisfies the search. For instance, s ? { tab(many(letters)) & ="()" } determines if s contains a run of letters followed by (); backtracking tries each value of many(letters) in turn until one is found followed by () or no more can be found.

Not only does Icon have the complexity of offering both the atomic and indexed models of string values, but there are separate sets of functions for each. There is a potential for confusion between tab and move, which return atomic string values, and many, upto, and the other functions which return indices. Indeed, it would be interesting to know if omission of required tab and move operations is a common error in Icon programs.

Subsequence references. Each subsequence reference value incorporates both a subsequence value and the position of that value within its base. When given first-class status by a programming language, such are ideal to return from parsing or other searching/scanning operations. As discussion of the various Algorithms below will show, it is common to utilize a single variable both for its value and its position. In all cases, other string models would require additional variables and assignment statements.

Although internally more complex than integers, subsequence references simplify programming by reducing the required number of concepts. Atomic values and indices are subsumed by the single model and there is no need for recourse to a domain outside of subsequences. Moreover, subsequences largely eliminate the need for separate data types for character and string values, both of which can be represented as references to appropriate subsequences.

Subsequences can be viewed as a generalization of Icon's &pos and &subject; however subsequences can appear in any context, not just within string scanning. Moreover, subsequence references can be used in iterations requiring scanning two strings in parallel. For example, an interpreter for a pattern matching language would need to advance simultaneously across one sequence representing the subject and another representing the pattern.

The subsequence reference data type has been implemented in two languages: Ness [Hansen, 1989b; Hansen, 1990] and cT [CDEC, 1989]. Both were originally implemented under the Andrew Toolkit (ATK) [Morris, 1986; Palay, 1988], although cT has recently been re-implemented. The capability range of ATK is illustrated by this paper: a single file with various embedded objects created using ATK's ez text editor. Examples below are given in Ness, the implementation of which permits typographic styling and embedded objects in program text and constants; the program fragments below were compiled and executed with the styles as shown.

1. A Data Type for Subsequences

To define the subsequence reference data type we assume an arbitrary set of elements--for instance the set of all characters in all fonts and sizes--and define these terms:

A sequence is a finite, ordered collection of elements from the arbitrary set. Before each element and after the last is a "position".

A subsequence reference or subseq is a triple [b, l, r] where b is a sequence and l and r are positions in b.

The subseq [b, l, r] is said to refer to the elements of b between the two positions l and r. If l and r are the same, the subseq is said to be empty. Note that empty sequences are not all equivalent; they may differ as to their locations within their bases. The elements in a sequence are imagined to be in a horizontal line with earlier elements to the "left" of later elements.

Figure 1 shows three subseq values, m, s, and p, defined on the base sequence "'Twas brillig and the slithy toves" and referring respectively to "s bril", an empty subsequence, and "toves".

Figure 1. Three subseq values on a base sequence. The base sequence is shown between < and >. The end positions of subseq value are shown as arrows pointing at positions, i. e., between elements of the base.

The most basic operators defined for subseq values are start, base, next, and extent as illustrated in Figure 2 and defined thusly:

start(x) - Returns the empty subseq at the beginning of its argument. Specifically the value is on the same base as x and has both positions the same as the leftmost of the two positions in x.

base(x) - Returns a subseq for the entire base of x. Specifically, the return value is on the same base as x and has the two positions at opposite ends of that base.

Figure 2. Four primitive functions. The subseq values below the base show the result of applying the primitive functions to m, s, and p. Values for extent(s, m) and extent(s, p) are given in the text.

In the Figure, start(m) is the empty subseq between a and s; and the values of base(m), base(p), and base(s) are each the entire sequence between < and >. To get an empty subseq at the beginning of x's base sequence, we can write start(base(x)). The opposite composition, base(start(x)), returns exactly the same value as base(x) because x and start(x) share the same base.

next(x) - This function returns a subseq for the element following x. Specifically, the base of the result is the same as that of x, one position is at the rightmost position of x, and the other position is one element further to the right in the base. If the argument x extends all the way to the end of its base string, then next(x) returns an empty subseq for the position at the end of the base.

Next(m) and next(s) in the figure are both single elements, while next(p) is empty. The element just after the beginning of x is next(start(x)), while the empty subseq at the end of x is start(next(x)). Next(base(x)) is the empty subseq at the end of base(x).

extent(x, y) - In general, returns a subseq for everything from the beginning of x to the end of y. Specifically, when x and y are on the same base, the result subseq is also on that base and has one position at the end of start(next(y)); the other position is either start(x) or start(next(y)), whichever is further left in the base. If x and y are on different bases, the result is an empty subseq on a unique empty base.

One non-empty result in Figure 2 is extent(s, p) which extends from s to the end of the base; conversely, extent(s, m) gives an empty subseq at the same position as extent(p,m). All of the base before m is extent(base(m), start(m)) and all of the base after m is extent(next(m), base(m)); observe that the result for both is shorter than base(m) even though it is one of the arguments.

To create new base values it suffices in this paper to provide constants and concatenation:

"..." - (where ... is a sequence of elements): a subseq constant. The value produced refers to a base sequence composed of the sequence of elements and has the two positions at the opposite ends of the base.

x _~ y - Tilde denotes concatenation and generates a new base string composed of copies of the subsequences referred to by x and y. The value returned is a reference to the new base string with positions at its opposite ends.

In terms of Figure 2, m _~ p is a subseq whose base is the new value "s briltoves" and whose positions are at the extremes of that base.

In examples below, subseq variables are declared with the form:

subseq m, p, s

Function arguments and return values are subseq values by default. Assignment of subseq values assigns the reference and does not depend on the contents of the base sequence of the source nor affect the base of the destination. Comparison, however, is best defined to compare the elements referred to, rather than the subseq values. With this definition, two subseqs x and y, not both on empty bases, are on the same base if base(x) = base(y) and extent(base(x), base(y)) = base(x).

Given a subseq value we can write simple loops to scan through a base string. If m refers to a blank, we can advance it to point to the nearest following non-blank with:

while m = " " do m := next(m) end while

Note that this loop will terminate with m referring to either a non-blank or to the empty subseq at the end of the base. Of course, if m originally referred to a non-blank, it would remain unchanged. Denoting the initial value of m as m₀, the loop invariant is that extent(m₀, start(m)) is all blanks.

Suppose m refers to a word, that is, consecutive non-blank characters with adjacent blanks on both ends. To find the next word we must first skip the blanks following m and then build a subseq referring to everything prior to the next blank. In Ness this is written as:

function nextword(m)

while next(m) = " " do m := next(m) end while
m := next(m)		-- first letter of word
while next(m) /= "" and next(m) /= " " do

-- another non-blank: include it in m
m := extent(m, next(m))
end while
return m
end function

The first while loop scans across all blanks after m and the second scans across all subsequent non-blanks to accumulate the word. The test next(m) /= "" checks to see if m ends at the end of its base, in which case it is deemed to be at the end of a word. When there is no word, nextword returns an empty string.

Even this brief example can illustrate the first-class nature of subseq values; that is, that a subseq value returned by one function can be directly passed as an argument to another. For instance the statement

if m = "function" then 

addToTable(nextword(m), functiontable) 
end if

will pass to addToTable the exact word returned from nextword without recourse to extraneous code like global variables, additional arguments, or side effects as would be required in C. A further advantage of subsequence reference values becomes apparent if addToTable must consider the context of the word in its base string; this would be impossible with atomic string values.

The operators start and next are asymmetric with respect to text order in that one moves from left-to-right and the other returns the leftmost position in its argument, while the corresponding operators for the reverse direction are non-primitive. This asymmetry reflects a decision to engineer the primitives for the most common order of examining text. Indeed, some implementations utilizing multiple encoding widths for characters may impose a performance penalty for right-to-left traversal.

2. Non-primitive Functions

With the primitive functions as a foundation we can write expressions for interesting subsequences relative to a given value. Commonly used functions include those identified in Figure 3 and defined in Table 1.

Finish is analogous to start and also produces an empty subseq, but at the other end of its argument. Functions front, first, last, and previous all produce subseqs for single elements analogously with next. Rest(x) returns a subseq one element shorter than x.

Figure 3. Non-primitive functions. For the subtle difference between first and front, see Table 1.

Function	Definition	Expression
finish(x)	the empty string at the position where x ends	start(next(x))
front(x)	the element which starts where x starts (even if x is empty)	next(start(x))
rest(x)	all of x past its first element (empty if x is empty)	extent(next(front(x)), x)
first(x)	first element of x, but empty if x is empty	extent(x, start(rest(x)))
last(x)	the last element in x, or x itself, if empty	{see text}
previous(x)	the element preceding x	last(extent(base(x), start(x)))

Table 1. Non-Primitive Functions. The function named in the first column and defined in the second can be implemented with the expression given in the third.

Figure 4 illustrates further the non-primitive functions of Table 1. Here variable m has the same properties as q in Figure 3 and variables s and p show the results for empty and one element values, respectively. First and front differ in their values only for the empty subseq s; in this case, first(s) is s itself and front(s) is the element next(s).

The expressions for rest and first exploit the definition of extent. When x has one or more elements, the value of next(front(x)) begins just after the first element, so the expression for rest produces a value extending from just after that first element to the end of x. When x is empty, its end precedes the start of next(front(x)), so the result for rest is the empty subsequence at the end of x, which is the same value as x itself, as per the definition of rest. The same trick applies in the definition of first, which gives the single element preceding rest(x) if x is non-empty, but otherwise x itself.

Figure 4. Examples of non-primitive functions.

With the aid of rest we can write an elegant, recursive definition of last:

function last(x) 

if rest(x) = "" then return x
else return last(rest(x)) end if

end function

The then case applies when x is initially empty or when the recursion has descended to the last element; otherwise the else case recurs to compute the last element in the rest of x, which will be the same as the last element of x. In practice, of course, last is implemented more directly.

3. Searching Strings

It is common in string algorithms to scan a string looking for a substring satisfying some property described with a regular expression, a context free grammar, or some more general scheme. Such advanced pattern matching is beyond the scope of this paper, but a few simple, Icon-like search operations will serve as valuable examples and tools for the extended example in the next section.

The search operations below each have two arguments, a subject and a specification. By convention, the subject argument determines the range to be searched according to this rule: If the subject argument is non-empty, the range is the subject; but if the subject is empty, the range extends from the subject to the end of its base. Since a successful search always yields a non-empty subsequence, failure is indicated by returning an empty subseq value, usually the one at the end of the subject argument. By these conventions, if the range is to extend from the position given by empty subseq p to the end of its base, the calling function can choose that the failure value be either p or finish(base(p)) by passing as the subject either p or extent(p, base(p)), respectively.

In the descriptions below, when the second argument--the specification--is obj, the match must be an exact match, element-for-element; but when it is set, the string is treated as a set. A typical set value is "0123456789" for the set of all digits. These functions are illustrated in Figure 5.

search(subj, obj) - Scans the range from left to right looking for an instance of obj and returns a subseq referring to the first such instance encountered. If none is found, the function returns finish(subj). match(subj, obj) - If the range has obj as its initial elements, a subseq for those elements is returned; otherwise the function returns finish(subj). span(subj, set) - Returns a subseq for start(subj) and all contiguous succeeding elements of the range which are elements from set. If front(subj) is not in set, the function returns start(subj). token(subj, set) - Returns a subseq for the leftmost contiguous subsequence in the range which is composed of elements from set. If none is found, the function returns finish(subj).

trim(subj, set) - Returns a subseq for all of the range except for any trailing elements which are in set. If all elements of the range are in set, the value start(subj) is returned.

Figure 5. Examples of search functions. Note that when m is the search subject, the value of token does not extend beyond finish(m) and "igh" is not found.

With the search operations we can now rewrite nextword of Section 1 more briefly as

function nextword(m)

m := finish(span(finish(m), " "))
return extent(m, start(search(m, " ")))
end function

If variable wordCharacters gives a complete set of characters allowed in words, nextword could be simply


token(finish(m), wordCharacters)

The three versions of nextword differ in subtle respects. The token(...) version returns a subseq for a word composed solely of elements found in wordCharacters, which might be just the letters or even a smaller set; the other two define a "word" as any non-empty sequence of characters between spaces. The version in Section 1 will find every such word, but the span...search version just above will find the last word only if it is followed by a space.

To define the search functions in terms of the primitive operations, we begin with a support function which searches a string src looking for an instance of a character c. If found, a subseq for the instance is returned, otherwise the value returned is an empty subseq at the end of src:

function findchar(src, c)

while src /= "" and c /= first(src) do

src := rest(src)
end while
return first(src)
end function

This function illustrates one form of loop; at each cycle around the while loop, src is one character shorter by virtue of the call on rest(src). The loop ends if either src becomes empty or its first character matches c. The loop invariant is that c is not equal to any character in extent(src₀, start(src)), where src₀ denotes the initial value of src.

Algorithm 1 expresses span in terms of findchar and the primitive operations. The first if-then sets s to be the range of the search as defined by the search conventions. The loop, as above, calls rest(s) at each step to shorten s by one element. As one example of multiple usage of a subsequence reference, note that variable s is utilized for its first character with first(s), its position with start(s), and its extent with rest(s).

function span(subj, set)

subseq s	-- the search range
s := subj
if s = "" then  s := extent(s, base(s))  end if
while findchar(set, first(s)) /= "" do

s := rest(s)
end while
return extent(subj, start(s))
end function

Algorithm 1. Span. Returns a subseq for all elements following start(subj) which are in set. The loop invariant is that all elements in extent(subj, start(s)) are elements of set.

Sometimes a loop advances through a string in steps longer than one character at a time, as illustrated by the search function in Algorithm 2. Each cycle of the while loop calls findchar to find the first character of obj and then calls match to determine if all of obj has been found. If so the appropriate value is returned, but if not, there is no point to re-checking extent(s, f), so s is set to everything after f. If s becomes empty, findchar returns an empty subseq and the loop exits via the test of f = "".

Algorithm 2 again illustrates multiple usage of subsequence reference values; the contents of f are tested and the positions of both ends appear in expressions. Since both f and m are on the same base as s, the new value of s can be set to begin after f and the value of m can be returned, each retaining its position in the original base string.

function search(subj, obj)

subseq s	-- the search range
subseq f	-- a location in subj of first(obj)
subseq m	-- the result of matching obj at f
s := subj
if s = "" then  s := extent(s, base(s))  end if
while True do

f := findchar(s, first(obj))
if f = "" then return finish(subj) end if 	-- fail
m := match(start(f), obj)
if m /= "" then return m end if   -- succeed
s := extent(finish(f), s)
end while
end function

Algorithm 2. Search. Find the leftmost occurrence of obj in the search range. An invariant of the loop is that a substring matching obj does not begin in extent(subj, start(s)).

In the Ness implementation, search is implemented with a far faster non-linear search [Sunday, 1990]. The details of match, token, and trim are left as exercises to the reader.

4. A practical string processing problem

A practical problem will further illustrate the subsequence reference data type. Some pre-ANSI C compilers allow a permissive syntax for preprocessor statements which permits white space and extraneous text in some contexts. To convert old C source code to ANSI, a program is needed to squeeze out the white space and add appropriate comment delimiters. The reformatting is done by a function NormalizeLine(line) which returns a possibly-modified copy of its argument.

A line must be normalized if it has the form:
<WS> # <WS> <key> <WS> <text> <WS> <newline>
where <WS> is optional whitespace, <key> is one of five words, and <text> is arbitrary text. When <key> is if, ifdef, or ifndef the output is
# <key> <space> <text> <newline>
but when <key> is else or endif the output for empty <text> is
# <key> <newline>
and for non-empty <text> is
# <key> <space> /* <space> <text> <space> */ <newline>
Note that the latter case must strip existing comment delimiters from <text>. If the input line does not have one of the expected forms, it is returned unchanged.

The essence of NormalizeLine is to first locate <key> and <text> in the input line and then--if the line is in one of the above forms--build the output from those values. For building the output, the algorithm relies on a table giving the output for each of the five key values. Since Ness has neither arrays nor structures such tables are commonly represented in programs as strings where the key is specially delimited so it can be found with a search. Then associated data is taken from subsequent fields. For NormalizeLine, a suitable table is

subseq keytable  :=  

"<if>;#if\n;#if ;\n;"
~ "<ifdef>;#ifdef\n;#ifdef ;\n;"
~ "<ifndef>;#ifndef\n;#ifndef ;\n;"
~ "<endif>;#endif\n;#endif /* ; */\n;"
~ "<else>;#else\n;#else /* ; */\n;"

In this table, the key is surrounded in < > and followed by three fields delimited with semi-colons. The first field is returned if the input line has only the # and keyword. When <text> is non-empty, the return value is that text bracketed with the second and third fields.

Algorithm 3 utilizes keytable to accomplish the task of building the output. The omitted portions of NormalizeLine determine that the line is a preprocessor statement, set k to the keyword, and set t to the text after the keyword. The call on search then locates k in keytable. The last three statements utilize the fields following the key in keytable to build the appropriate return value.

function nextfield(f)		-- Return the field following field f

f := finish(next(f))			-- skip semicolon after f
return extent(f, start(search(f, ";")))
end function

function NormalizeLine(line)

{...set k to the keyword and t to the text after k...}
fix := search(keytable, "<" _~ k _~ ">")  -- find k in keytable
-- use fix and t to build the result
if t = "" then  return nextfield(fix)  end if
fix := nextfield(nextfield(fix))		-- skip to prefix field
return  fix _~ t _~ nextfield(fix)
end function

Algorithm 3. Fragment of NormalizeLine function. This fragment constructs the output line from the fields in keytable.

The usage of function nextfield in Algorithm 3 must be emphasized. Since its return value is an entire field, that value is suitable for concatenation to construct the result as in the final return statement. But since its value is a subsequence reference, it can also serve as argument to a function (in this case, itself) to locate and return the next following field, as in nextfield(nextfield(fix)). In other languages, a composite value would be required as with typedef StringPiece in the introduction, and even then the body of nextfield would have a number of additional assignment statements.

The full NormalizeLine function appears in Algorithm 4. The scanning portion of the algorithm tracks with variable t the current position of the scan within the line. Skipwhite advances past spaces and tabs; it could easily be written in a language with the index model of strings since it effectively returns a position. The algorithm begins by skipping whitespace, checking for a "#", and skipping more whitespace. Next, k is set to <key> by spanning subsequent letters and <key> is sought in keytable with the result being assigned to fix. The trim expression gives t a value extending from the first non-whitespace character after <key> to the last non-whitespace character on the line. The remainder follows Algorithm 3, with the addition of code to remove redundant comment delimiters.

-- Return the empty subseq following t and any adjacent white space
function skipwhite(t)

return finish(span(finish(t), " \t\b"))
end function

function NormalizeLine(line)

subseq k, fix, t, u
t := skipwhite(start(line))
if next(t) /= "#" then return line end if
t := skipwhite(next(t))
k := span(t, 
"abcdefghijklmnopqrstuvwxyz"
)	-- get <key>
fix := search(keytable, "<" _~ k _~ ">")
if fix = "" then return line end if
t := trim(extent(skipwhite(k), line), " \t\b\n")
-- build the result from 'fix' and 't'
if t = "" then  return nextfield(fix)  end if
fix := nextfield(nextfield(fix))    -- skip to prefix field
u := search(t, "*/")		-- check for comment delimiters
if u /= "" and previous(last(fix)) = "*"  then

if match(t, "/*") /= "" then  t := skipwhite(next(first(t)))  end if
t := trim(extent(t, start(u)), " ")
end if
return fix _~ t _~ nextfield(fix)
end function

Algorithm 4. NormalizeLine.

As experiments the author and one of the referees wrote versions of NormalizeLine in C and Icon. The code is considerably longer in C because more variables are required to keep track of both ends of segments of the input line and because of the C tradition of in-line loops for string scanning. The C code also has an egregious bug in that it returns memory allocated in three different ways. For an unchanged line the original input is returned; for a #else or #endif with no comment, a constant is returned; and otherwise newly allocated memory is returned. Since the calling routine cannot easily distinguish these cases, it will probably not deallocate the allocated memory, leading to a core leak proportional to the number of preprocessor statements.

The Icon version was very elegant because it could take advantage of the ? string scanning operator, backtracking, failure, and Icon's table and record features. By its nature, however, the problem did not exploit Icon's goal-directed execution mechanism. In general, the string scanning operations were scattered in the program among the operands to other functions. Assignments could have reduced this convolution, at some expense in elegance.

5. Implementation

Before considering implementation of subsequence references as part of a programming language, we must consider whether it can be provided instead as a library package. This is problematical in C, and will be challenging even in a language designed to support packages.

The biggest problem is management of string storage: to conserve space, a base sequence can and should be deleted when no subseq refers to it. In a low-level language like C which does not provide storage management, the string manager must keep track of all subseqs to manage their space and that of the base sequences. However, this may not be possible. Consider expressions where a string value returned from one function is passed as argument to another, as in this fragment from Algorithm 4:

extent(skipwhite(k), line)

Skipwhite allocates a subseq and returns either it or a pointer to it. If the latter, which routine deallocates the subseq memory and when? In either case, how can the base sequence memory be deallocated? The simplest option is to never free the memory occupied with subseqs and base sequences, but this profligacy may lead to excessive paging or program termination. Subseqs and base strings constitute a non-circular structure and can be managed with references counts instead of more general and costly tools. However, if subsequence references are implemented in a higher level language with storage reorganization, such as Modula-3 [Cardelli, 1988], the more costly tools are likely to be used.

When successfully implemented as a package, subsequence references must be defined in terms of some more primitive notion of strings provided in the language itself. This means two separate string mechanisms with an attendant loss in efficiency and increase in the number of concepts a program reader may encounter.

Finally, a string facility requires lexical support for string constants and a primitive operation--preferably syntactic--for concatenation.Function notation is, of course, possible, but it is more awkward for multiple concatenations as in the assignment to keytable in Algorithm 4. Few languages are flexible enough to be extended in these directions. Moreover, if strings are to support typographic styles and embedded objects there is the further requirement that the program development support system also support editing of programs containing such constants. Thus program editing must move closer to word processing and programs can begin to have typographic styles themselves.

Given the deficiencies of a library implementation, it is reasonable to assume that we are implementing subsequence references as the native tool for strings in a high level programming language. Although this paper has not claimed interest in efficiency, it turns out that subsequence references can be implemented without undue overhead. Should the data type become widely used, appropriate compiler optimizations or tailored hardware will ensure acceptable execution speed.

Central to an implementation are data structure definitions for subseq values and base strings. Additional data structures are needed for typographic styles and embedded objects if they are to be supported by the implementation. The Ness implementation relies on the Andrew ToolKit (ATK) data structures, so little new data structure design was necessary.

In ATK, base strings are stored as a physical sequence of characters in a space larger than the string itself. The unused space is retained as a gap within the base at the position of the most recent modification (since ATK and Ness allow modification of base sequences) [Hansen, 1987]. If consecutive modifications tend to happen at nearby locations, the overhead of copying characters to make changes is imperceptible.

One alternative to physically sequential storage is a list of characters. Insertions and deletions are far faster, but space requirements mushroom and performance degrades with increased paging as the list gets fragmented among many pages. Experience has demonstrated that sequential text storage reduces paging sufficiently to offset the cost of copying strings when changes are made. Of course there are numerous intermediate data structure designs with linked lists of elements each having a physically sequential block of text. We have not tried this approach because the physically sequential approach has been satisfactory.

Section 1 noted that the implementation of subsequence references is similar to many string packages, dating back at least as early as [McKeeman, 1970]. In that work only a position and length were recorded for each string, whereas the minimal implementation of an immutable subseq is three words: a pointer to the base, and representations of the leftmost and rightmost positions. When the elements are stored consecutively, integers suffice to indicate positions, so a subseq value can be described in C as:

struct subseq {

struct basestring *b;/* the base */
int l, r;/* the leftmost and rightmost positions */

};

A fourth word is required if the implementation chooses to implement "reference counting" by linking together all subseqs on each base. The algorithms below assume that a struct basestring has at least the fields l and r to indicate its two end positions.

Operations on subsequence references are most succinctly implemented by modifying their argument, so, in an applicative environment, the code compiled to pass a subseq as an argument copies its value. Suppose these values are struct subseqs as above with x as the first argument and y the second. Then the four primitive functions can be compiled as if they were:

start:  x.r = x.l;


base:  x.l = x.b->l;   x.r = x.b->r;


next:	  x.l = x.r;   if (x.r < x.b->r)  x.r++;


extent:  if (x.b != y.b) *x = <UniqueEmptyString>;

else {x.r = y.r;   if (x.l > y.r)  x.l = y.r;}

In two places integer comparison determines the ordering of positions; this would have to be changed if base strings were not stored consecutively. Similarly, the ++ in next must be modified if elements occupy more than one byte. Conceptually, these are the only places in the implementation that require knowledge of the implementation of base sequences.

For a simple assignment like

m := start(s)

an optimizing compiler will copy s into m and then set m.r= m.l. If we are compiling for the IBM RISC architecture, the entire statement could take as few as three instructions: a load-multiple and a store-multiple to transfer the subseq representation and an additional store to m.r. For assignment of an expression with more operators, the overhead of copying the initial value is distributed among them all and is thus proportionately lower, so the average number of instructions per primitive is low.

6. Discussion

ASCII has sufficed for example strings above, but the implementation described clearly permits more general sequences. There is room here for only brief mention of a few possibilities:

Invariants. In writing invariants, it is often desirable to express attributes of subsequences. As shown in various of the examples above, invariants can be succinctly expressed with subsequence expressions.

Integer indices. While subsequence references permit ignorance of the indexed model of strings, it does not preclude it. For instance, the primitives are sufficient to implement nextn(s, n) as a function which applies next a total of n times starting with the initial value s. Then the traditional function substr(s, i, j)--which returns the substring of s starting with the i'th character and extending for j characters--is expressible as

extent(nextn(start(s), i), nextn(start(s), i+j)).

By defining substr in terms of next, there is no question that the indices refer to elements rather than byte positions in an array.

Arrays. With nextn and a simple trick, sequences can subsume one-dimensional arrays. The trick is to address the array by a reference to the empty element at its start. Thus if A is an array, nextn(A, i) returns the i'th element of the array. A more elaborate language could offer bracketed subscripts as syntactic sugar for array access.

Formatted text. Operations such as applying and testing typographic styles are conveniently defined with subsequence references because a style naturally applies to a subsequence of the text. In the Ness implementation, functions are provided to apply styles, remove them, interrogate the style of text, and traverse the text in sections delimited by style change boundaries.

Embedded Objects and Multi-media. The contents of sequences need not be restricted to characters. Because Ness is implemented under ATK it permits elements which are raster images, spreadsheets, and numerous other objects. As ATK is extended to provide objects for voice recordings, video clips, and other multi-media components, these too will be potential elements of Ness sequences.

Mutability. Much literature has been devoted to applicative versus imperative programming. The subsequence data type has been presented in a purely applicative form above, but a sequence modification operator can easily be defined (and not quite so easily implemented). Replace(s, r) modifies the base sequence of s so the portion that was referred to by s contains a copy of r. Insertions and deletions are special cases of replacement. In some applications--for instance, text editors--where the base may have numerous subseqs referring to its parts, such modification can be a useful tool because the other subseqs remain attached to the base whereas if a new base were constructed it would have no subseqs on it. Replace is also useful when it can be employed to avoid creating new strings; the overhead of copying strings is not too bad, but the overhead of allocating memory and paging-in non-adjacent strings can be large.

General sequences. Subsequence references can subsume Lisp lists if subseq values can be objects embedded in sequences. The car() function would be re-interpreted to extract an embedded subseq as itself.

Unbounded sequences. Potentially infinite sequences can be defined with a function to generate successive elements. The subsequence data type is entirely adequate to deal with such sequences because only the next operator need access further along in the sequence--it would call the generator if necessary. The base operator would return a general sequence including a reference to the generator function.

Many syntactic, semantic, and implementation questions concerning subsequence references are as yet unanswered. However, by yielding a first-class type for subsequences, the subsequence reference data type holds promise of reducing program complexity and thus making it possible for more people to exploit computers.

Acknowledgments. Bruce Sherwood was a bountiful source of enthusiasm and encouragement; I am indebted to him, Judy Sherwood, David Andersen and others at the Center for Design of Educational Computing, Carnegie-Mellon University, who implemented and utilized subsequence references as part of cT. This work began with the support of the Science and Engineering Research Council (Britain, grant number GR/D89028) and the Department of Computer Science at the University of Glasgow, under the stimulating direction of Malcolm Atkinson. Support for the Ness implementation in ATK was provided by the IBM Corporation and the Information Technology Center, Carnegie Mellon University under then director Alfred Z. Spector. Referee C patiently made numerous helpful comments.

References

[Cardelli, 1988] Cardelli, Luca, J. Donahue, L. Glassman, M. Jordan, B. Kalsow, G. Nelson, Modula-3 Report, Research Report 31, Digital Systems Research Center, (Palo Alto, CA, 1988). [CDEC, 1989] Center for Design of Educational Computing, CMU, The cT^tm Programming Language, Falcon Software (Wentworth, NH, 1989). [Glickstein, 1990] R. Glickstein, "Lisp Primitives in ELI, the Embedded Lisp Interpreter", available on X-windows distribution tape as .../contrib/andrew/overhead/eli/doc/procs.doc, Information Technology Center, Carnegie-Mellon Univ., 1990. [Griswold, 1971] Griswold, R. E., J. F. Poage, and I. P. Polonsky, The SNOBOL4 Programming Language, Prentice-Hall (Englewood Cliffs, 1971). [Griswold, 1990] Griswold, R. E. and M. T. Griswold, The Icon Programming Language, 2nd Edition, Prentice-Hall (Englewood Cliffs, 1990). [Hansen, 1987] Hansen, W. J., Data Structures in a Bit-Mapped Text-Editor, Byte (January, 1987), 183-190. [Hansen, 1989a] Wilfred J. Hansen, "The Computational Power of an Algebra for Subsequences", CMU-ITC-083, Information Technology Center, Carnegie-Mellon Univ., 1989. [Hansen, 1989b] Wilfred J. Hansen, "Ness Language Reference Manual", available on X-windows distribution tape as .../contrib/andrew/atk/ness/doc/nessman.d, Information Technology Center, Carnegie-Mellon Univ., 1989. [Hansen, 1990] Hansen, Wilfred J., "Enhancing documents with embedded programs: How Ness extends insets in the Andrew ToolKit," Proceedings of 1990 International Conference on Computer Languages, March 12-15, 1990, New Orleans, IEEE Computer Society Press (Los Alamitos, CA, 1990), 23-32. [Hughes, 1989] Hughes, J., "Why Functional Programming Matters," Computer Journal 32, 2 (1989). [IBM, 1965] IBM Corporation, PL/I Language Specifications, C28-6571-0, IBM Corp., Data Processing Division, (White Plains, NY, 1965). [McKeeman, 1970] McKeeman, W. M., J. J. Horning, and D. B. Wortman, A Compiler Generator, Prentice-Hall, Inc. (Englewood Cliffs, 1970). [Morris, 1986] Morris, J., M. Satyarayanan, M. H.Conner, J. H. Howard, D. S. H. Rosenthal, and F. D. Smith. "Andrew: A distributed Personal Computing Environment," Comm. ACM, V. 29, 3 (March, 1986), 184-201. [Palay, 1988] Palay, A. J., W. J. Hansen, M. Sherman, M. Wadlow, T. Neuendorffer, Z. Stern, M. Bader. and T. Peters, "The Andrew Toolkit - An Overview", Proceedings of the USENIX Winter Conference, Dallas, (February, 1988), 9-21. [Sunday, 1990] Sunday, D. M., "A Very Fast Substring Search Algorithm," Comm. ACM 33, 8 (Aug, 1990) 132-142. [Yngve, 1963] Yngve, V. H., "COMIT," Comm. ACM 6, 10 (Mar, 1963) 83-84.