Abstract. This document describes the rationale for the Wyvern programming language targeted at potential users of the language. It will grow to include a specification for Wyvern as well.
Better programming languages have revolutionized software development--from Fortran, which freed programmers from assembly language, through Java, which brought type safety and garbage collection to the masses, and JavaScript, which made the web come alive. Yet current tools for building applications for the web and for mobile devides, two of the most vibrant software sectors today, are woefully inadequate. The problems are numerous and significant:
Diversity of Languages and Artifacts. Web applications today are written as a poorly-coordinated mishmash of artifacts written in different languages, file formats, and technologies. For example, a web application may consist of JavaScript code on the client, HTML for structure, CSS for presentation, XML for AJAX-style communication, and a mixture of Java, plain text configuration files, and database software on the server. This diversity increases the cost of developers learning these technologies. It also means that ensuring system-wide safety and security properties in this setting is difficult.
Weak static safety. Many of the technologies used above--including JavaScript, text files, and server language choices such as Python and Ruby--provide little checking at compile-time or configuration time that the system is well-formed. Even basic consistency properties--such as the existance of a class named in a configuration file--remain unchecked unless ad-hoc tools are built that are specific to a web framework.
Reduced developer productivity. Because of the lack of static safety and consistency checks, even simple errors must be tediously diagnosed based on often-inadequate run-time error messages. Languages without types also make it more difficult to provide the kind of IDE support, such as auto-completion (which is, at best, much more limited in dynamically-typed languages), that developers rely on to work productively when building on third-party libraries and components.
Poor coordination within and across organizations. Languages without types make coordination within and across organizations more difficult, because developers cannot use types to see how components developed by others are supposed to be used. This, in turn, leads to more productivity problems, as well as additional defects and vulnerabilities.
Insecure language and library constructs. Many of today's languages base their language and library constructs on concepts that are convenient and/or efficient to implement, but which may have poor safety and security properties. For example, the default integer type in C and Java is of limited width and has wrap-around semantics--but developers usually use this type to represent mathematical integers that have a limited range. If program input--possibly from an attack--ever causes an integer to exceed the intended range, the program silently computes the wrong result rather than flagging the error. While numbers are done more securely in Python and JavaScript, these languages have similar problems via generic string types that do not easily capture the formatting expectations that are essential to combating injection attacks. Collectively, these problems make it more difficult to find program errors, and create an opportunity for a program error to turn into a security vulnerability.
Low-level abstractions. Today's programming languages support abstractions that are too low-level, making it more difficult to assure safety and security, and requiring programmers to write a great deal of unnecessary boilerplate code. The problems include describing data models at a low level, programing distributed communication within a system out of low-level primitives, the lack of any system-level design view, and finally the fact that system-wide security properties cannot be explicitly expressed (in part because the other abstractions above are too low-level). The level of abstraction in the language in turn means that reasoning about high-level system properties becomes extraordinarily difficult, and consequently those properties are easy to violate and hard to assure.
Existing industrial or research languages have made progress on some of these problems, but the solutions remain inadequate. For example, Ruby on Rails demonstrates that a single language and platform can, through the judicious use of internal domain-specific languages (DSLs), express a rich variety of artifacts, including code, presentation, navigation structure, and other features. However, developers should not have to give up the safety of typed languages--indeed, types are essential to improving the coordination among the artifacts used to describe a web or mobile application. Furthermore, integration should not be supported only on the server side, as with Ruby on Rails, but across the client and server.
A Wyvern is a two-legged, winged dragon. The Wyvern language emphasizes security, and just as treasure guarded by a Wyvern ought to be secure, so should be programs written in the Wyvern language.
Goal. The goal of Wyvern is to be an excellent programming language for engineering web and mobile applications. While the area of focus is important, the language is really driven by engineering needs. Engineers understand the need to balance multiple factors: with respect to a language, those factors include developer productivity, assurance of the end product, and run-time efficiency.
Target audience. Wyvern is targetted at software engineers who are developing applications for web and mobile platforms. Today, these developers are likely to be writing code in JavaScript on the client and in languages such as Python or Java on the server. Assurance, productivity, and efficiency are all important to our target audience.
Approach. Wyvern begins with a simple core language with good support for object-oriented programming as well as functional abstractions. It builds on this to address the challenges outlined above through a number of strategies:
Wyvern's design should have the following properties, which facilitate the overall goal of Wyvern as a language for engineering web and mobile applications. For each property we attempt to provide a basis for judging whether the language design adequately fulfills the property. This basis is ideally objective, but may be subjective in many cases by necessity. The properties are:
Wyvern's goal is to be of comparable simplicity to languages such as Smalltalk, Self, Python, Scheme, Forth, or Lua. This sets a high bar, especially as Wyvern is to be statically typed and all these languages are dynamic. Among statically typed languages, Wyvern should be similar to the simplicity of C, ML, or the first version of Java--though these examples are not fully satisfying from the simplicity point of view. Simplicity in the design should be reflected in simplicity of the syntax, dynamic semantics, and static semantics.
Readability. Reading source code is more important to engineers than writing it, especially when evolving software in a team setting. Programs in Wyvern should be as succinct and as clear as programs in Python or Smalltalk, exiting languages that emphasize readability.
Flexibility. In order for Wyvern to express declarative information that typically goes in configuration files, Wyvern must have flexible syntax and semantics. The language's syntax should be at least as good for defining internal DSLs as is the syntax of Ruby or Python. Test cases for the flexibility of the language include expressing the structure of a web page, the architectural structure of an application, or make-like build dependencies.
Safety. Wyvern will be type- and memory-safe. It will have good support for contracts and unit tests, with all contracts dynamically checkable and some statically checked. The design of the language and its libraries will provide a strong defense against the top 5 OWASP vulnerabilities (as of as of 11/15/11: injection, XSS, broken authentication and sessions, insecure direct object references, and CSRF). The flexibility of the language's syntax for expressing DSLs will be complemented by extensibility of static and dynamic checking to ensure that programs written in those DSLs are sensible and correct.
Agility. Change tasks in Wyvern should result in editing deltas that are better than editing deltas for similar change tasks in Java. Ideally, Wyvern will come close to the agility of Python in this respect, though meeting Python's standard may be difficult for a statically-typed language without non-local type inference.
Interactivity. Wyvern will support a read-eval-print loop (and read-eval-display on web/mobile platforms). Execution information will be available for debugging or reflective use; overall, the language should be as debuggable as Smalltalk.
Modularity. Modularity is critical to separate reasoning, and therefore separate development--both of which are critical in a large-scale engineering context. Any complex construct (i.e., one with repeating parts) in Wyvern can be given a name and reused. Wyvern's type system is modular, checking a module based only on that module's source code and the interfaces of other modules. Module interfaces are complete in that compilation will never fail if the module behind an interface is swapped for another module that matches the same interface. For safety, Wyvern's extension facilities cannot affect code that does not enforce them. Similarly, reflection, casts, and other dynamic constructs cannot be used to violate interface restrictions.
Efficiency. The performance of Wyvern should scale with the patterns of usage. Programs that are ismorphic to Java or C should have execution times that are within 10% of equivalent programs written in those languages (in the case of C, the comparison point includes the overhead of a conservative garbage collector). Programs that use more advanced features may be slower, but should still be within a 2-3x factor of C or Java performance.
Data Persistence and Distribution. Wyvern will support high-level declarations of the persistence and distribution strategy for data, which can be implemented with customized semantics; this is generally important in the target domain. Data declarations will support expressing simple high-level relational models, including bidirectional relationships and one-to-many relationship constraints; these are essential for defining adequiate security policies.
Wyvern is intended to be a useful, practical language, but also to be a means to investigate scientific questions. Through the design, implementation, and evaluation of the Wyvern language, we hope to pursue research in the following areas:
Wyvern is a whitespace-sensitive language. This can be implemented in a fairly simple and clearly specified way, as demonstrated by the Python language and Adams et al.'s POPL 2013 on principled parsing of whitespace-sensitive languages. Many programmers, including ourselves, feel that whitespace sensitivity enhances readability. It definitely avoids issues in matching parentheses and curly braces, and avoids the if statement ambiguity in C. Finally, whitespace indentation levels provide a convenient way to delimit DSLs, while placing few restrictions on the DSL. In particular, anything at all can appear in a DSL as long as it is indented relative to the surrounding text.
As a secondary point, whitespace sensitivity fits nicely with Wyvern's goal to support web programming, as several other languages in this space are whitespace sensitive.
Wyvern provides C-style and single-line comments. Line continuations are as specified in Python and in the C preprocessor (in C, newline characters are significant in macro definitions).
As the Python and C approach to line continuations seemed slightly ad-hoc, we considered alternatives such as allowing a line continuation when the next line was indented a specific amount. However, we are also using indentation to denote blocks and to delimit DSLs. We felt it would be ambiguous, overly restrictive, and/or too confusing to use indentation for two different purposes.
The input to lexical analysis is a stream of ASCII characters (but see the extensions below). The output of lexical analysis is a stream of tokens of the following kinds:
Comments. Wyvern supports C-style and single-line comments. In a C-style comment, all characters between a starting /* and an ending */ are ignored. C-style comments cannot be nested.
In single-line comments characters from a starting // to the end of the line are ignored. However, in a single-line comment the newline character at the end of the comment is not ignored in the rest of lexical analysis.
Lines and line joining. Our specification for explicit and implicit line joining is taken from part of the Python reference. A physical line is a sequence of characters terminated by an end-of-line sequence. An end-of-line sequence is one of: the ASCII LF character, the ASCII sequence CR LF, or the ASCII CR character.
Two or more physical lines may be joined into logical lines using backslash characters (\), as follows: when a physical line ends in a backslash that is not part of a string literal or comment, it is joined with the following forming a single logical line, deleting the backslash and the following end-of-line character. A line ending in a backslash cannot carry a comment. A backslash does not continue a comment.
Newline characters are ignored inside matching parentheses, square brackets, or curly braces, as in Python. Following the Python spec, implicitly continued lines can carry comments. The indentation of the continuation lines is not important. Blank continuation lines are allowed.
A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated).
Indentation. Our specification for indentation is adapted directly from Python's. Leading whitespace (spaces and tabs) at the beginning of a logical line is used to compute the indentation level of the line.
A line's indentation is denoted by the sequence of spaces and tabs preceding the first non-blank character of a line. Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation.
Although the language specification permits both tabs and whitespace in defining indentation, as different editors display tabs in different ways, it is recommended that programmers (and Wyvern editors) not use tabs in Wyvern files.
The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows.
Before the first line of the file is read, the empty string "" is pushed on the stack; this will never be popped off again. Each strings pushed on the stack will always have the previous string on the stack as a prefix, with at least one whitespace character added. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is longer, it is pushed on the stack, and one INDENT token is generated. If it is shorter, it must be one of the strings occurring on the stack; all strings on the stack that are longer are popped off, and for each string popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each string remaining on the stack that is longer than the empty string.
Whitespace after the first non-whitespace character of a line serves as a delimiter between tokens.Other tokens. An identifier is a sequence of characters that begins with a letter or underscore, and contains letters, underscores, and digits. An identifier may also be a sequence of operator characters, which include =, <, >, !, ~, ?, :, &, |, +, -, *, /, ^, and %. Operator identifiers may not contain the comment sequences /* or //.
A number is a sequence of digits. A string begins with ", includes any number of non-" characters but no end-of-line sequences, and ends with a ".
Types in Core Wyvern consist of object types, function types, tuples, and option types. The 0-ary tuple "Unit" has only one value, written ().
Declarations in wyvern include types, methods, values, and variables. A block of consecutive types and method declarations may be mutually recursive. On the other hand, values and variables are only in scope after their declarations.
An object type in Wyvern is declared with the keyword type, and consists of a set of method and property signatures. An object type can be referred to anywhere in its scope using the name.
A method signature is declared with the keyword meth, and consists of a method name, method arguments, and method result type. The method arguments are a comma-separated list of pairs of each argument's name and type. If a result type is not specified then it is type Unit.
To be implemented later. The method result type may be a simple type, or it may be a tuple. If a tuple, the result type is enclosed in parenthesis and includes a comma-separated list of pairs of a tuple element's name and type. Support first-class tuples in pattern matching and argument passing.
A property signature is declared with the keyword prop (for properties that are only readable) or var (for properties that can be directly written). Property signatures consist of the property name and type.
Future note. We probably want some way of saying that a property is immutable (not just missing a write accessor) but we have postponed the decision about whether this goes in the type system or a specification.
A function type is written A -> B, where A is the argument type and B is the result type. A option type is written T?, and indicates that a value is either present at type T, or the value is null. ? binds more strongly than ->. In the future, when polymorphic types are added, T? may be syntactic sugar for something like Option[T] or T Option.
Example syntax for types is shown below:
type IntCell var contents : Int type Stack prop top : Int? meth push(element : Int) meth pop() : Int? type StackFactory meth make() : Stack meth makeWithFirst(firstElement : Int) : Stack type ListUtilities meth map(f : Int -> Int, l : IntList) : IntList // meth map(f : Int -> Int) : IntList -> IntList // curried version // meth map(f : Int -> Int)(l : IntList) : IntList // curried version with sugar
A class consists of a set of methods, fields, and class methods.
A method consists of a method signature, as described above, plus a method body. A method body is either a simgle expression after an = symbol, or a sequence of statements which is indented and starts on the following line. A class method is identical to a method except for the use of the keyword class, and it defines a method on the class object rather than on an object instantiated from the class.A field consists of the keyword var (for mutable fields) or val (for immutable fields), a name, a type, and an optional initialization expression. If the initialization expression is ommitted, the field is initialized to the empty constant (for numbers, strings, and options), or else must be initialized in any object construction expression.
A class may be ascribed an object type and (separately) a class type. Object type ascription constrains the type of objects generated by the class. From outside the class body, the class appears to have only the elements in the ascribed type. Class type ascription constrains the type of the class itself. From outside the class, the class appears to have only the class methods mentioned (as ordinary methods) in the class type. If multiple class or object types are ascribed, the actual ascription is the type-theoretic intersection of the ascribed types (i.e. it has the union of the members in the ascribed types).
Each class defines a type that contains only that class's implementation. This type is the principal type of the class if no type is ascribed, otherwise, the type is a subtype of the ascribed type but has no additional members.
Future: public/private as syntactic sugar? What is the signature of a package? Can you distinguish (A) type members that refer to a particular type implementation from (B) type members that do not? Then type members are the general syntax. ML abstype? Do we write abstype as "class" in the signature of a package?
To consider: class fields? Seems like a bad practice generally, but some uses are OK (e.g. to support hash-consing a.k.a the Flyweight pattern). Default constructors. Destructors. Type members, bounding, and instantiation. Case of. Comprises. Tagged. Subtyping. Inheritance or delegation. Default method parameters, useful in particular for constructor calls?
In the present design, in order to access a field f (or method m) on the receiver, you must use "this.f" We may allow f to be used directly, but then we must use Newspeak's "lexical search first" rules to avoid capture (see Modules as Objects in Newspeak).
The standard name for class methods that act as constructors or factory methods is make.
Statements start on a line, and formally include all following lines at the same or greater level of indentation. A val or var declaration consists of the keyword val or var, a name, an optional type, and an initialization expression. If the type is absent, it is inferred as the most precise type of the initialization expression. Val declarations define a read-only, let-bound variable, scoped to the statement starting on the following line at the same indentation level. Var declarations define a mutable variable with the same scope.
Expressions include variable reads and assignments, first-class functions, function applications, property reads and assignments, object creations, and method calls.
If a method is called or a property is accessed on null, using the special selector .?, the result is null. We may instead use some other form of syntactic sugar, such as "propNull x x.getOption()" where both x and the result of getOption are option types.
Methods may be called with named parameters: use x:5 syntax (avoid clash with =) or use := for assignment
To consider: where can new be used? Only inside the class? Only if no ascription (which would hide the new "operation") has been used? May want to autogenerate standard constructors in some cases. Notion of a principal constructor used for pattern matching. Rob: can we make "meth m(x) = e" sugar for "val m = fn x => e".
If class Link is defined with a method make, can we use the shorthand Link(0,null) in place of Link.make(0,null)? Constructors must be defined explicitly for each class, there are no defaults as in Java--we anticipate that IDEs will make this less painful
Example syntax for classes, methods, fields, statements, and expressions is shown below:
class StackImpl implements Stack class implements StackFactory var list : Link? meth top() = list.data meth push(element) list = Link(element, list) meth pop() val result = list.data list = list.next result class meth make() = new StackImpl class meth makeWithFirst(firstElement) new StackImpl list = Link(firstElement, null) class Link val data : Int val next : Link? class meth make(d:Int, n:Link?) = new Link(data=d, next=n) // a package-level method (method of the package object, if we have one) meth stackClient() val s = StackImpl.Stack() s.push(5) print(s.top) val addOne : Int -> Int = fn(x:Int) => x+1 print(addOne(s.pop()))
A formal description of the design can be found in the Module System section of the core-language document. We choose to model modules essentially as objects that have additional infrastructure that gives each module a name (denoted by a URL) and allows the module to import other modules by URL. Modules include type members, so therefore we must add type members to the public interface of objects. Object type members now therefore include both defs and type declarations. Types can be arrow types, a named type in scope, or a type denoted by a path ending in a type.
In order for the type system to be sound, module paths that lead to a type must be constant--i.e. return the same type component each time they are evaluated (see Harper and Pierce's module system chapter in ATAPL for details). We therefore insist that a path start with an (unchangable) variable, and that every field in the path be constant. Constant becomes an annotation that can decorate a def in a public object (or module) type. A constant def can only be implemented by a val declaration form (i.e. the declaration of a field that cannot be assigned after initialization).
The module declaration construct includes the name of the module, which takes the form of a URL. The module can optionally be ascribed a type, which means that external modules importing it see it as an object of that type, and any members of the module not mentioned in that type are hidden. A module then has a series of import statements and a series of declarations.
Import statements include a URL from which the module is imported. If the URL is relative, the location of the current module is used as the starting point, following the convention used in HTML links. An import may optionally be ascribed a type. If so, that type must be a supertype of the imported module's type, and within the importing module, the imported module has the type ascribed in the import statement. An import may be given a short name, by which it is known from within the module. If no short name is given, the last name in the URL is used (this is the same convention used in the Go language).
Import statements may be used to import modules defined in other languages, such as JavaScript or Java. In the case of JavaScript, the type ascribed is used to give the module a type in Wyvern; this type is not checked (for now). The source language in an import may only be Wyvern, JavaScript, and Java for now, though others may be supported in the future. The source language is inferred from the MIME type of the URL (which in turn is based on the file extension in most URL infrastructures).
Declarations may be annotated as public. If a module is not ascribed a signature explicitly, a signature is generated from the public members of the module. Any member that is not public is implicitly not a part of its type. The same rule holds true for new statements used to initialize arbitrary objects. If a signature is ascribed to a module, either the module must have no members declared public, or all the elements exposed in the signature must be declared public (with subtypes of their types in the signature). Note that we choose private as the default rather than public in order to "nudge" developers to hiding things unless they should be exposed, rather than the other way around.
For now, there is a one-to-one correspondance between modules and files. This will quickly be broken when we support relative import.
Later extensions of the module system will have the following features: