Wyvern: a Language for Engineering Mobile and Web Applications

Abstract. This document describes the rationale for the Wyvern programming language targeted at potential users of the language. It will grow to include a specification for Wyvern as well.

Motivation

Better programming languages have revolutionized software development--from Fortran, which freed programmers from assembly language, through Java, which brought type safety and garbage collection to the masses, and JavaScript, which made the web come alive. Yet current tools for building applications for the web and for mobile devides, two of the most vibrant software sectors today, are woefully inadequate. The problems are numerous and significant:

Diversity of Languages and Artifacts. Web applications today are written as a poorly-coordinated mishmash of artifacts written in different languages, file formats, and technologies. For example, a web application may consist of JavaScript code on the client, HTML for structure, CSS for presentation, XML for AJAX-style communication, and a mixture of Java, plain text configuration files, and database software on the server. This diversity increases the cost of developers learning these technologies. It also means that ensuring system-wide safety and security properties in this setting is difficult.
Weak static safety. Many of the technologies used above--including JavaScript, text files, and server language choices such as Python and Ruby--provide little checking at compile-time or configuration time that the system is well-formed. Even basic consistency properties--such as the existance of a class named in a configuration file--remain unchecked unless ad-hoc tools are built that are specific to a web framework.
Reduced developer productivity. Because of the lack of static safety and consistency checks, even simple errors must be tediously diagnosed based on often-inadequate run-time error messages. Languages without types also make it more difficult to provide the kind of IDE support, such as auto-completion (which is, at best, much more limited in dynamically-typed languages), that developers rely on to work productively when building on third-party libraries and components.
Poor coordination within and across organizations. Languages without types make coordination within and across organizations more difficult, because developers cannot use types to see how components developed by others are supposed to be used. This, in turn, leads to more productivity problems, as well as additional defects and vulnerabilities.
Insecure language and library constructs. Many of today's languages base their language and library constructs on concepts that are convenient and/or efficient to implement, but which may have poor safety and security properties. For example, the default integer type in C and Java is of limited width and has wrap-around semantics--but developers usually use this type to represent mathematical integers that have a limited range. If program input--possibly from an attack--ever causes an integer to exceed the intended range, the program silently computes the wrong result rather than flagging the error. While numbers are done more securely in Python and JavaScript, these languages have similar problems via generic string types that do not easily capture the formatting expectations that are essential to combating injection attacks. Collectively, these problems make it more difficult to find program errors, and create an opportunity for a program error to turn into a security vulnerability.
Low-level abstractions. Today's programming languages support abstractions that are too low-level, making it more difficult to assure safety and security, and requiring programmers to write a great deal of unnecessary boilerplate code. The problems include describing data models at a low level, programing distributed communication within a system out of low-level primitives, the lack of any system-level design view, and finally the fact that system-wide security properties cannot be explicitly expressed (in part because the other abstractions above are too low-level). The level of abstraction in the language in turn means that reasoning about high-level system properties becomes extraordinarily difficult, and consequently those properties are easy to violate and hard to assure.

Existing industrial or research languages have made progress on some of these problems, but the solutions remain inadequate. For example, Ruby on Rails demonstrates that a single language and platform can, through the judicious use of internal domain-specific languages (DSLs), express a rich variety of artifacts, including code, presentation, navigation structure, and other features. However, developers should not have to give up the safety of typed languages--indeed, types are essential to improving the coordination among the artifacts used to describe a web or mobile application. Furthermore, integration should not be supported only on the server side, as with Ruby on Rails, but across the client and server.

Overview

A Wyvern is a two-legged, winged dragon. The Wyvern language emphasizes security, and just as treasure guarded by a Wyvern ought to be secure, so should be programs written in the Wyvern language.

Goal. The goal of Wyvern is to be an excellent programming language for engineering web and mobile applications. While the area of focus is important, the language is really driven by engineering needs. Engineers understand the need to balance multiple factors: with respect to a language, those factors include developer productivity, assurance of the end product, and run-time efficiency.

Target audience. Wyvern is targetted at software engineers who are developing applications for web and mobile platforms. Today, these developers are likely to be writing code in JavaScript on the client and in languages such as Python or Java on the server. Assurance, productivity, and efficiency are all important to our target audience.

Approach. Wyvern begins with a simple core language with good support for object-oriented programming as well as functional abstractions. It builds on this to address the challenges outlined above through a number of strategies:

Flexible syntax. While most existing web and mobile applications are cobbled together from multiple artifacts written in diverse languages and notations, Wyvern will, like Ruby, have an flexible enough syntax for developers to express all these artifacts within a single programming language, using an internal DSL strategy. Internal DSL will also support declarative security policy expression within the language.
Strong, extensible typechecking. Like statically typed languages such as Java, Wyvern will have a strong static type system that can aid in ensuring basic safety and security properties of programs, as well as provide the productivity and coordination benefits of types. Unlike the Java base language, but reminiscent of Scala's compiler plugins or Java's annotation processors, Wyvern's static checking system is extensible with additional rules for checking the artifacts defined in internal DSLs within Wyvern.
Secure-by-default language and library constructs. Wyvern will provide built-in datatypes that are secure by default. The default integer type is a mathematical integer with unlimited range. Programmers can express range-limited integers by providing a limit, which the system then verifies (either statically or dynamically). When declaring a variable of type string, developers specify the string's expected format. Database access is done through an internal DSL for querying, and queries cannot be insecurely constructed from strings.
High-level abstractions for architecture and data. Leveraging Wyvern's support for DSLs, programmers will be able to describe the architecture of an application within the language, have it be enforced by the type system, and have it be implemented by the compiler and runtime. The architecture will permit description of distributed programs, with custom protocols for communication supported through code generation via Wyvern's extension facilities. Data structures in Wyvern can be described at the level of abstraction of databases, facilitating database integration, querying, and rich semantic constructs such as bidirectional relationships between objects. These abstractions support analysis of security properties that are data- or architecture-dependant.

Properties

Wyvern's design should have the following properties, which facilitate the overall goal of Wyvern as a language for engineering web and mobile applications. For each property we attempt to provide a basis for judging whether the language design adequately fulfills the property. This basis is ideally objective, but may be subjective in many cases by necessity. The properties are:

Simplicity. Keeping the design of Wyvern simple will allow developers to learn it easily, making them productive quickly. Furthermore, a simple language means it is easier to verify important properties of the language and of programs written in the language. More pragmatically, simplicity will allow a small team to develop a usable prototype of Wyvern quickly. Although simplicity can come at the cost of expressiveness, other language designs have shown that providing extensibility via libraries can often provide the best of both worlds, and this is Wyvern's strategy as well.
Wyvern's goal is to be of comparable simplicity to languages such as Smalltalk, Self, Python, Scheme, Forth, or Lua. This sets a high bar, especially as Wyvern is to be statically typed and all these languages are dynamic. Among statically typed languages, Wyvern should be similar to the simplicity of C, ML, or the first version of Java--though these examples are not fully satisfying from the simplicity point of view. Simplicity in the design should be reflected in simplicity of the syntax, dynamic semantics, and static semantics.
Readability. Reading source code is more important to engineers than writing it, especially when evolving software in a team setting. Programs in Wyvern should be as succinct and as clear as programs in Python or Smalltalk, exiting languages that emphasize readability.
Flexibility. In order for Wyvern to express declarative information that typically goes in configuration files, Wyvern must have flexible syntax and semantics. The language's syntax should be at least as good for defining internal DSLs as is the syntax of Ruby or Python. Test cases for the flexibility of the language include expressing the structure of a web page, the architectural structure of an application, or make-like build dependencies.
Safety. Wyvern will be type- and memory-safe. It will have good support for contracts and unit tests, with all contracts dynamically checkable and some statically checked. The design of the language and its libraries will provide a strong defense against the top 5 OWASP vulnerabilities (as of as of 11/15/11: injection, XSS, broken authentication and sessions, insecure direct object references, and CSRF). The flexibility of the language's syntax for expressing DSLs will be complemented by extensibility of static and dynamic checking to ensure that programs written in those DSLs are sensible and correct.
Agility. Change tasks in Wyvern should result in editing deltas that are better than editing deltas for similar change tasks in Java. Ideally, Wyvern will come close to the agility of Python in this respect, though meeting Python's standard may be difficult for a statically-typed language without non-local type inference.
Interactivity. Wyvern will support a read-eval-print loop (and read-eval-display on web/mobile platforms). Execution information will be available for debugging or reflective use; overall, the language should be as debuggable as Smalltalk.
Modularity. Modularity is critical to separate reasoning, and therefore separate development--both of which are critical in a large-scale engineering context. Any complex construct (i.e., one with repeating parts) in Wyvern can be given a name and reused. Wyvern's type system is modular, checking a module based only on that module's source code and the interfaces of other modules. Module interfaces are complete in that compilation will never fail if the module behind an interface is swapped for another module that matches the same interface. For safety, Wyvern's extension facilities cannot affect code that does not enforce them. Similarly, reflection, casts, and other dynamic constructs cannot be used to violate interface restrictions.
Efficiency. The performance of Wyvern should scale with the patterns of usage. Programs that are ismorphic to Java or C should have execution times that are within 10% of equivalent programs written in those languages (in the case of C, the comparison point includes the overhead of a conservative garbage collector). Programs that use more advanced features may be slower, but should still be within a 2-3x factor of C or Java performance.
Data Persistence and Distribution. Wyvern will support high-level declarations of the persistence and distribution strategy for data, which can be implemented with customized semantics; this is generally important in the target domain. Data declarations will support expressing simple high-level relational models, including bidirectional relationships and one-to-many relationship constraints; these are essential for defining adequiate security policies.

Research Goals

Wyvern is intended to be a useful, practical language, but also to be a means to investigate scientific questions. Through the design, implementation, and evaluation of the Wyvern language, we hope to pursue research in the following areas:

Types and productivity. We hypothesize that typechecking enhances developer productivity when working on legacy code, in groups, or programming against unfamiliar libraries. The mechanisms behind this effect include documentation, better IDE support, and (to a lesser extent) finding defects early. To achieve these benefits, type must be explicit (for documentation benefits), modular (so they can be locally checked), and natural (so they match developer intuitions). Investigating whether these hypotheses are true will involve a variety of user studies; researchers elsewhere are also working on related questions.
Architecture. How can we provide a language that can express highly dynamic distributed architectures yet is helpful in reasoning and implementation? How can we support a wide variety of custom protocols which can integrate efficiently with legacy systems? What architectural properties must be enforced to enable security reasoning, and what are the right abstractions for enforcing these?
Security. What is the right policy language for expressing the most important web and mobile application security properties? What are the right abstractions for expressing expectations at module boundaries in a lightweight, usable way? What is the right tradeoff between static and dynamic checking that provides a high level of assurance at low cost in both programmer expression and run-time overhead? How can architectural design be leveraged, and what consistency properties are necessary to do so successfully?
High-level data models. How can we integrate rich database models with object-oriented programming? Can we leverage this higher level of abstraction to express and verify security properties? Can we make code cleaner and shorter, and can we do so without imposing a high run-time cost?
Flexible syntax, extensible semantics and typechecking. How can a syntax support a wide variety of DSLs, yet still make it easy to express each one? What are the right interfaces to provide so that each DSL can define its semantics and typechecking strategy, and integrate well with other DSLs and with the base language?
Modularity. What should a modern module system look like? The specification of a module should include not just a type and documentation for it, but contracts and tests. Modules should be versioned, and versions should include automatically-checked compatibility information. Modules should be given global names that map to logical URLs where the module is stored, so the module can be aquired from anywhere (and safely cached due to version IDs). Code should be mobile. Different configurations should be supported at low coding overhead, and supporting static or dynamic choice of configuration.
Resource optimization in mobile devices. In a joint project with the SEI, we want to investigate resource optimization. This could include moving computation between a device and cloud resources to optimize time, power, and quality; getting data from different data sources according to quality and availability; or reconfiguring the UI of an application to manage energy consumption. The project will leverage enhancements in architecture and modularity described above.
Other areas. Concurrency and Functional-OO integration are topics that any modern language must handle well. While Wyvern's design in this area is still somewhat open, we hope to build on recent work by group members while also breaking new ground.

Lexical Structure

Goals

The lexical structure of Wyvern is intended to fulfill the following goals:

Simplicity. Wyvern lexical analysis should be simple, so it is easy to implement in the core (and auxiliary) tools. Simplicity also enhances readability (next).
Readability. It should be easy for programmers to understand the structure of Wyvern programs. The language should avoid problematic cases, such as the if statement in C, in which the indentation used may indicate something different from the parser's semantics, and lexical analysis as well as parsing can play a role in avoiding this. Readability is particularly important in the context of an extensible language (next).
Extensibility. It should be easy to define where a DSL begins and ends in a way that does not restrict the content of the DSL.
Familiarity. To enhance understandability, Wyvern uses standards from previous languages where they are compatible with Wyvern's goals.

Design and Rationale

Wyvern is a whitespace-sensitive language. This can be implemented in a fairly simple and clearly specified way, as demonstrated by the Python language and Adams et al.'s POPL 2013 on principled parsing of whitespace-sensitive languages. Many programmers, including ourselves, feel that whitespace sensitivity enhances readability. It definitely avoids issues in matching parentheses and curly braces, and avoids the if statement ambiguity in C. Finally, whitespace indentation levels provide a convenient way to delimit DSLs, while placing few restrictions on the DSL. In particular, anything at all can appear in a DSL as long as it is indented relative to the surrounding text.

As a secondary point, whitespace sensitivity fits nicely with Wyvern's goal to support web programming, as several other languages in this space are whitespace sensitive.

Wyvern provides C-style and single-line comments. Line continuations are as specified in Python and in the C preprocessor (in C, newline characters are significant in macro definitions).

As the Python and C approach to line continuations seemed slightly ad-hoc, we considered alternatives such as allowing a line continuation when the next line was indented a specific amount. However, we are also using indentation to denote blocks and to delimit DSLs. We felt it would be ambiguous, overly restrictive, and/or too confusing to use indentation for two different purposes.

Specification

The input to lexical analysis is a stream of ASCII characters (but see the extensions below). The output of lexical analysis is a stream of tokens of the following kinds:

IDENTIFIER indicating an identifier. The identifier name is associated with the token.
NUMBER indicating an integer constant. The numeric value is associated with the token.
STRING indicating a string constant. The string value is associated with the token.
NEWLINE indicating a new logical line.
INDENT indicating an increase in the indentation level.
DEDENT indicating a decrease in the indentation level.
LPAREN, RPAREN, LBRACK, RBRACK, LBRACE, and RBRACE, which stand for (, ), [, ], {, and }, respectively.

Comments. Wyvern supports C-style and single-line comments. In a C-style comment, all characters between a starting /* and an ending */ are ignored. C-style comments cannot be nested.

In single-line comments characters from a starting // to the end of the line are ignored. However, in a single-line comment the newline character at the end of the comment is not ignored in the rest of lexical analysis.

Lines and line joining. Our specification for explicit and implicit line joining is taken from part of the Python reference. A physical line is a sequence of characters terminated by an end-of-line sequence. An end-of-line sequence is one of: the ASCII LF character, the ASCII sequence CR LF, or the ASCII CR character.

Two or more physical lines may be joined into logical lines using backslash characters (\), as follows: when a physical line ends in a backslash that is not part of a string literal or comment, it is joined with the following forming a single logical line, deleting the backslash and the following end-of-line character. A line ending in a backslash cannot carry a comment. A backslash does not continue a comment.

Newline characters are ignored inside matching parentheses, square brackets, or curly braces, as in Python. Following the Python spec, implicitly continued lines can carry comments. The indentation of the continuation lines is not important. Blank continuation lines are allowed.

A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated).

Indentation. Our specification for indentation is adapted directly from Python's. Leading whitespace (spaces and tabs) at the beginning of a logical line is used to compute the indentation level of the line.

A line's indentation is denoted by the sequence of spaces and tabs preceding the first non-blank character of a line. Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation.

Although the language specification permits both tabs and whitespace in defining indentation, as different editors display tabs in different ways, it is recommended that programmers (and Wyvern editors) not use tabs in Wyvern files.

The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows.

Before the first line of the file is read, the empty string "" is pushed on the stack; this will never be popped off again. Each strings pushed on the stack will always have the previous string on the stack as a prefix, with at least one whitespace character added. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is longer, it is pushed on the stack, and one INDENT token is generated. If it is shorter, it must be one of the strings occurring on the stack; all strings on the stack that are longer are popped off, and for each string popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each string remaining on the stack that is longer than the empty string.

Whitespace after the first non-whitespace character of a line serves as a delimiter between tokens.

Other tokens. An identifier is a sequence of characters that begins with a letter or underscore, and contains letters, underscores, and digits. An identifier may also be a sequence of operator characters, which include =, <, >, !, ~, ?, :, &, |, +, -, *, /, ^, and %. Operator identifiers may not contain the comment sequences /* or //.

A number is a sequence of digits. A string begins with ", includes any number of non-" characters but no end-of-line sequences, and ends with a ".

Likely extensions

Support for Unicode, character encodings (default UTF-8), and (probably) Unicode escapes.
Suspending lexical analysis mostly within a DSL, except to strip off leading indentation and to determine the scope of the DSL using that indentation, so the DSL can define its own lexical analysis. Character encodings, Unicode escapes, and line continuations probably have to be processed by Wyvern before the DSL is entered.
Associating comments with particular program elements. In particular a comment in a line that has no non-comment tokens is associated with the following token. A comment in a line that has a token in it is associated with the non-comment token following the comment, except when there are no following non-comment tokens on that line, in which case the comment is associated with the previous non-comment token. A non-comment token may have more than one comment associated with it; in this case the comments are ordered by their occurrence in the input stream.
Support for character, rational, hexadecimal constants, etc.
Specify the handling of whitespace characters other than tab, space, return, and linefeed. E.g. formfeed?

Wyvern Core Language

This section defines the core constructs of the Wyvern language, together with their dynamic semantics and typechecking. We scope the "core constructs" as the constructs that are sufficient to define the compiler and to serve as the seed for the entire standard library (given also the extension interface and the foreign library interface). We describe built-in constants, but their interpretation is delayed to the discussion of the Wyvern Standard Library, below.

Goals

Understandability trumps simplicity and elegance when these conflict. E.g. o.m should generally not have effects, and the only exceptions should be obvious in the type (e.g. m is identifiably a property, not a function). Actually we still need to make this be true!

Types

Types in Core Wyvern consist of object types, function types, tuples, and option types. The 0-ary tuple "Unit" has only one value, written ().

Declarations in wyvern include types, methods, values, and variables. A block of consecutive types and method declarations may be mutually recursive. On the other hand, values and variables are only in scope after their declarations.

An object type in Wyvern is declared with the keyword type, and consists of a set of method and property signatures. An object type can be referred to anywhere in its scope using the name.

A method signature is declared with the keyword meth, and consists of a method name, method arguments, and method result type. The method arguments are a comma-separated list of pairs of each argument's name and type. If a result type is not specified then it is type Unit.

To be implemented later. The method result type may be a simple type, or it may be a tuple. If a tuple, the result type is enclosed in parenthesis and includes a comma-separated list of pairs of a tuple element's name and type. Support first-class tuples in pattern matching and argument passing.

A property signature is declared with the keyword prop (for properties that are only readable) or var (for properties that can be directly written). Property signatures consist of the property name and type.

Future note. We probably want some way of saying that a property is immutable (not just missing a write accessor) but we have postponed the decision about whether this goes in the type system or a specification.

A function type is written A -> B, where A is the argument type and B is the result type. A option type is written T?, and indicates that a value is either present at type T, or the value is null. ? binds more strongly than ->. In the future, when polymorphic types are added, T? may be syntactic sugar for something like Option[T] or T Option.

Example syntax for types is shown below:

type IntCell
    var contents : Int
	
type Stack
    prop top : Int?
    meth push(element : Int)
    meth pop() : Int?
	
type StackFactory
    meth make() : Stack
    meth makeWithFirst(firstElement : Int) : Stack

type ListUtilities
    meth map(f : Int -> Int, l : IntList) : IntList
    // meth map(f : Int -> Int) : IntList -> IntList // curried version
    // meth map(f : Int -> Int)(l : IntList) : IntList // curried version with sugar

Classes, Methods, and Fields

A class consists of a set of methods, fields, and class methods.

A method consists of a method signature, as described above, plus a method body. A method body is either a simgle expression after an = symbol, or a sequence of statements which is indented and starts on the following line. A class method is identical to a method except for the use of the keyword class, and it defines a method on the class object rather than on an object instantiated from the class.

A field consists of the keyword var (for mutable fields) or val (for immutable fields), a name, a type, and an optional initialization expression. If the initialization expression is ommitted, the field is initialized to the empty constant (for numbers, strings, and options), or else must be initialized in any object construction expression.

A class may be ascribed an object type and (separately) a class type. Object type ascription constrains the type of objects generated by the class. From outside the class body, the class appears to have only the elements in the ascribed type. Class type ascription constrains the type of the class itself. From outside the class, the class appears to have only the class methods mentioned (as ordinary methods) in the class type. If multiple class or object types are ascribed, the actual ascription is the type-theoretic intersection of the ascribed types (i.e. it has the union of the members in the ascribed types).

Each class defines a type that contains only that class's implementation. This type is the principal type of the class if no type is ascribed, otherwise, the type is a subtype of the ascribed type but has no additional members.

Future: public/private as syntactic sugar? What is the signature of a package? Can you distinguish (A) type members that refer to a particular type implementation from (B) type members that do not? Then type members are the general syntax. ML abstype? Do we write abstype as "class" in the signature of a package?

To consider: class fields? Seems like a bad practice generally, but some uses are OK (e.g. to support hash-consing a.k.a the Flyweight pattern). Default constructors. Destructors. Type members, bounding, and instantiation. Case of. Comprises. Tagged. Subtyping. Inheritance or delegation. Default method parameters, useful in particular for constructor calls?

In the present design, in order to access a field f (or method m) on the receiver, you must use "this.f" We may allow f to be used directly, but then we must use Newspeak's "lexical search first" rules to avoid capture (see Modules as Objects in Newspeak).

The standard name for class methods that act as constructors or factory methods is make.

Statements and Expressions

Statements start on a line, and formally include all following lines at the same or greater level of indentation. A val or var declaration consists of the keyword val or var, a name, an optional type, and an initialization expression. If the type is absent, it is inferred as the most precise type of the initialization expression. Val declarations define a read-only, let-bound variable, scoped to the statement starting on the following line at the same indentation level. Var declarations define a mutable variable with the same scope.

Expressions include variable reads and assignments, first-class functions, function applications, property reads and assignments, object creations, and method calls.

If a method is called or a property is accessed on null, using the special selector .?, the result is null. We may instead use some other form of syntactic sugar, such as "propNull x x.getOption()" where both x and the result of getOption are option types.

Methods may be called with named parameters: use x:5 syntax (avoid clash with =) or use := for assignment

To consider: where can new be used? Only inside the class? Only if no ascription (which would hide the new "operation") has been used? May want to autogenerate standard constructors in some cases. Notion of a principal constructor used for pattern matching. Rob: can we make "meth m(x) = e" sugar for "val m = fn x => e".

If class Link is defined with a method make, can we use the shorthand Link(0,null) in place of Link.make(0,null)? Constructors must be defined explicitly for each class, there are no defaults as in Java--we anticipate that IDEs will make this less painful

Example syntax for classes, methods, fields, statements, and expressions is shown below:

class StackImpl
    implements Stack
    class implements StackFactory

    var list : Link?

    meth top() = list.data

    meth push(element)
        list = Link(element, list)

    meth pop()
        val result = list.data
        list = list.next
        result

    class meth make() = new StackImpl

    class meth makeWithFirst(firstElement)
        new StackImpl
            list = Link(firstElement, null)


class Link
    val data : Int
    val next : Link?

    class meth make(d:Int, n:Link?) = new Link(data=d, next=n)
	
// a package-level method (method of the package object, if we have one)
meth stackClient()
    val s = StackImpl.Stack()
    s.push(5)
    print(s.top)
    val addOne : Int -> Int = fn(x:Int) => x+1
    print(addOne(s.pop()))

Constants

Built-in constants include integers, decimal numbers, floating-point numbers, strings, characters, and null. The empty constants for each type are as follows: for String, ""; for numbers, 0; for option types, null.

Packages

Signature ascription at the package level. What does "public" mean--public to package, or to file, or to class? Hierarchical signature ascription.

Reflection

Mirrors? Need something that is secure.

Wyvern Module System

Requirements

These requirements list the software engineering properties we want the module system to provide. First, we list the properties supported by the initial design described here:

Representation independence, information hiding, and visibility control. It should be possible to hide aspects of a module that are only relevant to its representation. Substituting one module implementation for another should not break client typechecking, if both modules implement the type the client imports.
Hierarchical namespace. Hierarchical entities should be able to provide modules organized into a similarly hierarchical namespace. Libraries should be subdividable hierarchically.
Explicit dependencies. It should be easy for humans and tools to determine and control the dependencies of a module.
Documentation. It should be easy for humans and tools to determine what a module provides
Safe extension. It should be possible to extend a module signature without breaking clients that import that signature. Safe extensions include providing additional functionality in an implementation, or broadening the signature of an implementation.
Scoped names. It should be possible to use two constructs defined with the same name, by distinguishing their scopes.
Separate Compilation. It should be possible to compile a module using only the interfaces of other modules. In the sense of Harper and Pierce, true separate compilation should be possible (with appropriate type annotations), while incremental compilation should be required.
File-based development. Source code should be storable in the file system, divided among multiple files. File-based version control should work.
Web-based module references. URLs (or URIs with a specified convention for retrieving them) should be usable to specify modules used.
Provide relative and absolute import. Absolute import should be based on a canonical name of a module (and eventually its version number, see below). Modules that can be reached by absolute import should be self-describing. Relative import is relative to an abstract space that maps to the file system in the implementation, equivalent to the right-hand side of a URL. Files imported via relative import need not be self-describing. The semantics of relative import is equivalent to syntactic insertion of a nested block. We provide relative import partly to enable configuration files that are not in Wyvern format at the top level.
Support syntactic and semantic extension. Code within a module should be able to provide language extensions--either for DSLs in nested blocks, or for modular keyword-based extensions of the expression and declaration syntax. An extension defines syntax, typechecking rules, and execution semantics. A module may only use syntax extensions defined in modules it imports (not its own extensions) in an absolute form. If we eventually provide recursive modules (see below), a module cannot use extensions defined in modules it is recursive with.
Headless modules. A Wyvern source file may leave out the module declaration, which is convenient for script-like files. In this case it may only be used by other Wyvern files through relative import.
Reusable types and signatures. It should be possible to declare reusable names for any type or module signature, so it can be reused
Convenient visibility control. It should be possible to provide visibility control without declaring all public elements twice (e.g. once in the module and once in its signature). For reasons such as signature reuse and separation of concerns, of course, such duplicate declarations may sometimes be beneficial.

Some requirements we intend to support in future extensions of the module system:

Hierarchical visibility control. Sub-modules can communicate with each other, but hide that information from external clients.
Dynamism. It should be possible to dynamically load a module without breaking anything. It should be possible to choose between implementations of a module interface at run time. This is required in framework designs, and is poorly supported by Java.
Versioning. Tools should automatically update the version of a module when building and when publishing. Tools should automatically track type-based and test-based version compatibility, although developers should be able to specify additional incompatibilities (e.g. that are not visible due to inadequate tests). Imports should be able to specify a version, with the semantics that more recent compatible versions are permitted. Multiple versions of a module should be able to coexist in a running system.
Traits. Support interfaces that have default implementations, as traits. This includes at least methods, and perhaps fields. Extensions of a trait with methods that have a default implementation should be supported, without breaking modules that implement that trait (but do not implement the new method).
Runtime representation independence. The runtime system should not be usable to break the representation independence property (e.g. through type tests, casts, or reflection).
Good support for both abstract data types and object types. Rationale: usefulness of both is widely known in practice, see Jonathan Aldrich's Onward! 2013 essay.

Some additional requirements we may eventually want to consider:

Renaming. Support convenient renaming and possibly other trait operations
Tool supported best practices. For example, production code should not declare dependencies that it does not use (see Go)(OK to temporarily violate this during development).

Design

A formal description of the design can be found in the Module System section of the core-language document. We choose to model modules essentially as objects that have additional infrastructure that gives each module a name (denoted by a URL) and allows the module to import other modules by URL. Modules include type members, so therefore we must add type members to the public interface of objects. Object type members now therefore include both defs and type declarations. Types can be arrow types, a named type in scope, or a type denoted by a path ending in a type.

In order for the type system to be sound, module paths that lead to a type must be constant--i.e. return the same type component each time they are evaluated (see Harper and Pierce's module system chapter in ATAPL for details). We therefore insist that a path start with an (unchangable) variable, and that every field in the path be constant. Constant becomes an annotation that can decorate a def in a public object (or module) type. A constant def can only be implemented by a val declaration form (i.e. the declaration of a field that cannot be assigned after initialization).

The module declaration construct includes the name of the module, which takes the form of a URL. The module can optionally be ascribed a type, which means that external modules importing it see it as an object of that type, and any members of the module not mentioned in that type are hidden. A module then has a series of import statements and a series of declarations.

Import statements include a URL from which the module is imported. If the URL is relative, the location of the current module is used as the starting point, following the convention used in HTML links. An import may optionally be ascribed a type. If so, that type must be a supertype of the imported module's type, and within the importing module, the imported module has the type ascribed in the import statement. An import may be given a short name, by which it is known from within the module. If no short name is given, the last name in the URL is used (this is the same convention used in the Go language).

Import statements may be used to import modules defined in other languages, such as JavaScript or Java. In the case of JavaScript, the type ascribed is used to give the module a type in Wyvern; this type is not checked (for now). The source language in an import may only be Wyvern, JavaScript, and Java for now, though others may be supported in the future. The source language is inferred from the MIME type of the URL (which in turn is based on the file extension in most URL infrastructures).

Declarations may be annotated as public. If a module is not ascribed a signature explicitly, a signature is generated from the public members of the module. Any member that is not public is implicitly not a part of its type. The same rule holds true for new statements used to initialize arbitrary objects. If a signature is ascribed to a module, either the module must have no members declared public, or all the elements exposed in the signature must be declared public (with subtypes of their types in the signature). Note that we choose private as the default rather than public in order to "nudge" developers to hiding things unless they should be exposed, rather than the other way around.

For now, there is a one-to-one correspondance between modules and files. This will quickly be broken when we support relative import.

Later extensions of the module system will have the following features:

Hierarchical composition of modules and ascription of the containing module.
Dynamic module loading.
Module versioning.
Relative import and non-self-describing modules, including modules in a DSL.
A more disciplined design for handling imports from other languages, especially dynamically typed langauges such as JavaScript. This may be related to F#'s type providers.

Wyvern picture by Zigeuner —— Picture made for the Blazon Project of French-speaking Wikipedia. CC-BY-SA-3.0.