Introduction | In this lecture we will learn about the lowest level of the Java language: its tokens. We will learn how to recognize and classify every category of token (which is like classifying English words into their parts of speech). Towards this end, we will employ ourly new learned EBNF skilss to write and analyze descriptions for each category of token. In later lectures we will learn about a programming language's higher level structures: phrases (expressions), sentences (statements), paragraphs (blocks/methods), chapters (classes), and books (packages). |
The Family History of Java | Before going on to study Java, let's take a brief look, through quotes, at the languages on which Java was based, traveling back over 30 years to do so. |
Where it starts: C |
The earliest precursor of Java is C: a language developed by Ken Thompson
at Bell Labs in the early 1970s.
C was used as a system programming language for the DEC PDP-7.
C began achieving its widespread popularity when Bell's Unix operating
system was rewritten in C.
Unix was the first operating system written in a high-level language;
it was distributed to universities for free, where it became popular.
Linux is currently a popular (it is still free!) variant of Unix.
"C is a general-purpose programming language which features economy of
expression, modern control flow and data structures, and a rich set of
operators.
C is not a "very high level" language, nor a "big" one, and is not
specialized to any particular area of application."
- B. Kernighan/D. Ritchie: The C Programming Language
(Kernighan & Ritchie designed and implemented C) |
From C to C++ |
"A programming language serves two related purposes: it provides a vehicle
for the programmer to specify actions to be executed, and it provides a
set of concepts for the programmer to use when thinking about what can
be done.
The first aspect ideally requires a language that is "close to the
machine," so that all important aspects of a machine are handled simply
and efficiently in a way that is reasonably obvious to the programmer.
The C language was primarily designed with this in mind.
The second aspect ideally requires a language that is "close to the
problem to be solved" so that the concepts of a solution can be expressed
directly and concisely.
The facilities added to C to create C++ were primarily designed with this
in mind"
- B. Stroustrup: The C++ Programming Language (2nd Ed)
(Stroustrup designed and implemented C++) |
Java as a Successor to C++ |
"The Java programming language is a general-purpose, concurrent, class-based,
object-oriented language.
It is designed to be simple enough that many programmer can achieve fluency
in the language.
The Java programming language is related to C and C++ but it is organized
rather differently, with a number of aspects of C and C++ omitted and a
few ideas from other languages included.
It is intended to be a production language, not a research language, and
so, as C.A.R. Hoare suggested in his classic paper on language design,
the design has avoided including new and untested features.
... The Java programming language is a relatively high-level language, in that details of the machine representation are not available through the language. It includes automatic storage management, typically using a garbage collector, to avoid the safety problems of explicit deallocation (as in C's free or C++'s delete). High-performance garbage-collected implementations can have bounded pauses to support systems programming and real-time applications. The language does not include any unsafe constructs, such as array accesses without index checking, since such unsafe constructs would cause a program to behave in an unspecified way."
- J. Gosling, B. Joy, G. Steele, G. Bracha: The Java Language Specification
|
Overview of Tokens in Java: The Big 6 |
In a Java program, all characters are grouped into symbols called
tokens.
Larger language features are built from the first five categories of tokens
(the sixth kind of token is recognized, but is then discarded by the Java
compiler from further processing).
We must learn how to identify all six kind of tokens that can appear in
Java programs.
In EBNF we write one simple rule that captures this structure:
token <= identifier | keyword | separator | operator | literal | comment We will examine each of these kinds of tokens in more detail below, again using EBNF. For now, we briefly describe in English each token type.
|
The Java Character Set |
The full Java character set includes all the
Unicode
characters; there are 216 = 65,536 unicode characters.
Since this character set is very large and its structure very complex, in
this class we will use only the subset of unicode that includes all the
ASCII (pronounced "Ask E") characters; there are 28 = 256
ASCII characters, of which we will still use a small subset containing
alphabetic, numeric, and some special characters.
We can describe the structure of this character set quite simply in EBNF, using only alternatives in the right hand sides.
lower-case <= a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z In the special rule, the bracket/brace characters stand for themselves (not EBNF options nor repetitions) and one instance of the vertical bar stands for itself too: this is the problem one has when the character set of the language includes special characters that also have meanings in EBNF. White space consists of spaces (from the space bar), horizontal and vertical tabs, line terminators (newlines and formfeeds): all are non-printing characters, so we must describe them in English. White space and tokens are closely related: we can use white space to force the end of one token and the start of another token (i.e., white space is used to separate tokens). For example XY is considered to be a single token, while X Y is considered to be two tokens. The "white space separates tokens" rule is inoperative inside String/char literals, and comments (which are all discussed later). Adding extra white space (e.g., blank lines, spaces in a line -often for indenting) to a program changes its appearance but not its meaning to Java: it still comprises exactly the same tokens in the same order. Programmers mostly use white space for purely stylistic purposes: to isolate/emphasize parts of programs and to make them easier to read and understand. Just as a good comedian know where to pause when telling a joke; a good programmer knows where to put white space when writing code. |
Identifiers |
The first category of token is an Identifier.
Identifiers are used by programmers to name things in Java: things such as
variables, methods, fields, classes, interfaces, exceptions, packages, etc.
The rules for recognizing/forming legal identifiers can be easily stated in
EBNF.
id-start    <= alphabetic | $ | _ Although identifiers can start with and contain the $ character, we should never include a $ in identifiers that we write; such identifiers are reserved for use by the compiler, when it needs to name a special symbol that will not conflict with the names we write. Semantically, all characters in an identifier are significant, including the case (upper/lower) of the alphabetic characters. For example, the identifier Count and count denote different names in Java; likewise, the identifier R2D2 and R2_D2 denote different names. When you read programs that I have written, and write your own program, think carefully about the choices made to create identifiers.
Carefully avoid identifiers that contain dollar signs; avoid
|
Keywords |
The second category of token is a Keyword, sometimes called a
reserved word.
Keywords are identifiers that Java reserves for its own use.
These identifiers have built-in meanings that cannot change.
Thus, programmers cannot use these identifiers for anything other than their
built-in meanings.
Technically, Java classifies identifiers and keywords as separate categories
of tokens.
The following is a list of all 49 Java keywords we will learn the meaning of many, but not all,of them in this course. It would be an excellent idea to print this table, and then check off the meaning of each keyword when we learn it; some keywords have multiple meanings, determined by the context in which they are used.
Notice that all Java keywords contain only lower-case letters and are at least 2 characters long; therefore, if we choose identifiers that are very short (one character) or that have at least one upper-case letter in them, we will never have to worry about them clashing with (accidentally being mistaken for) a keyword. Also note that in the Metrowerks IDE (if you use my color preferences), keywords always appear in yellow (while identifiers, and many other tokens, appear in white). We could state this same tabular information as a very long (and thus harder to read) EBNF rule of choices (and we really would have to specify each of these keywords, and not use "...") looking like keyword <= abstract | boolean | ... | while Finally, assert was recently added (in Java 1.4) to the original 48 keywords in Java. |
Separators |
The third category of token is a Separator (also known as a
punctuator).
There are exactly nine, single character separators in Java, shown in the
following simple EBNF rule.
separator <= ; | , | . | ( | ) | { | } | [ | ] In the separator rule, the bracket/brace characters stand for themselves (not EBNF options or repetitions). Note that the first three separators are tokens that separate/punctuate other tokens. The last six separators (3 pairs of 2 each) are also known as delimiters: wherever a left delimiter appears in a correct Java program, its matching right delimiter appears soon afterwards (they always come in matched pairs). Together, these each pair delimits some other entity. For example the Java code Math.max(count,limit); contains nine tokens
|
Operators |
The fourth category of token is an Operator.
Java includes 37 operators that are listed in the table below;
each of these operators consist of 1, 2, or at most 3 special
characters.
The keywords instanceof and new are also considered operators in Java. This double classification can be a bit confusing; but by the time we discuss these operators, you'll know enough about programmig to take them in stride. It is important to understand that Java always tries to construct the longest token from the characters that it is reading. So, >>= is read as one token, not as the three tokens > and > and =, nor as the two tokens >> and =, nor even as the two tokens > and >=. Of course, we can always use white space to force Java to recognize separate tokens of any combination of these characters: writing >   >= is the two tokens > and >=. We could state this same tabular information as a very long (and thus harder to read) EBNF rule of choices (and we really would have to specify each of these operators, and not use "...") looking like operator <=   = | > | ... | >>= | instanceof | new
|
Types and Literals |
The fifth, and most complicated category of tokens is the Literal.
All values that we write in a program are literals: each belongs to one of
Java's four primitive types (int, double, boolean,
char) or belongs to the special reference type String.
All primitive type names are keywords in Java; the String reference
type names a class in the standard Java library, which we will learn much
more about soon.
A value (of any type) written in a Java program is called a literal;
and, each written literal belongs in (or is said to have) exactly one type.
literal <= integer-literal | floating-point-literal | boolean-literal Here are some examples of literals of each of these types.
The next six sections discuss each of these types of literals, in more detail. |
int Literals |
Literals of the primitive type int represent countable, discrete
quantities (values with no fractions nor decimal places
possible/necessary).
We can specify the EBNF for an int literal in Java as
non-zero-digit     <= 1|2|3|4|5|6|7|8|9 This EBNF specifies only decimal (base 10) literals. In Java literals can also be written in ocal (base 8) and hexidecimal (base 16). I have omitted the EBNF rules for forming these kinds of numbers, because we will use base 10 exclusively. Thus, the rules shown above are correct, but not complete. By the EBNF rules above, note that the symbol 015 does not look like a legal integer-literal; it is certainly not a decimal-numeral, because it starts with a zero. But, in fact, it is an octal-numeral (whose EBNF is not shown). Never start an integer-literal with a 0 (unless its value is zero), because starting with a 0 in Java signifies the literal is being written as an octal (base 8) number: e.g., writing 015 refers to an octal value, whose decimal (base 10) value is 13! So writing a leading zero in an integer can get you very confused about what you said to the computer. Finally, note that there are no negative literals: we will see soon how to compute such values from the negate arithmetic operator and a positive literal (writing -1 is exactly such a construct). This is a detail: a distinction without much difference. |
double Literals |
Literals of the primtive type double represent measureable quantities.
Like real numbers in mathematics, they can represent fractions and numbers
with decimal places.
We can specify the EBNF for a double literal in Java as
exponent-indicator   <= e | E This EBNF specifies a floating-point-literal to contain various combinations of a decimal point and exponent (so long as one -or both- are present); if neither is present then the literal must be classified as an int-literal. The exponent-indicator (E or e) should be read to mean "times 10 raised to the power of". Like literals of the type int, all double literals are non-negative (although they may contain negative exponents). Using E or e means that we can specify very large or small values easily (3.518E+15 is equivalent to 3.518 times 10 raised to the power of 15, or 3518000000000000.; and 3.518E-15 is equivalent to 3.518 times 10 raised to the power of -15, or .000000000000003518) In fact, any literal with an exponent-part is a double: so even writing 1E3 is equivalent to writing 1.E3, which are both equivalent to writing 1000. Note this does not mean the int literal 1000! Finally, all double literals must be written in base 10 (unlike int literals, which can be written in octal or hexadecimal) |
boolean Literals |
The type name boolean honors George Boole, a 19th century English
mathematician who revolutionized the study of logic by making it more
like arithmetic.
He invented a method for calculating with truth values and an algebra for
reasoning about these calculations.
Boole's methods are used extensively today in the engineering of hardware
and software systems.
Literals of the primitive type boolean represent on/off, yes/no, present/absent, ... data. There are only two values of this primtive type, so its ENBF rule is trivially written as
boolean-literal <= true | false In Java, although these values look like identifiers, they are classified as literal tokens (just as all the keywords also look like identifiers, but are classified differently). Therefore, 100 and true are both literal tokens in Java (of type int and boolean respectively). Students who are familiar with numbers sometimes have a hard time accepting true as a value; but that is exactly what it is in Java. We will soon learn logical operators that compute with these values of the type boolean just as arithmetic operators compute with values of the type int. |
char Literals |
The first type of text literal is a char.
This word can be pronounced in many ways: care, car, or as in
charcoal
(I'll use this last pronunciation).
Literals of this primitive type represent exactly one character inside
single quotes.
Its EBNF rule is written
character-literal <= 'graphic' | 'space' | 'escape-sequence' where the middle option is a space between single quotes. Examples are 'X', or 'x', or '?', or ' ', or '\n', etc. (see below for a list of some useful escape sequences). Note that 'X' is classified just as a literal token (of the primitive type char); it is NOT classified as an identifier token inside two separator tokens! |
String Literals |
The second type of text literal is a String.
Literals of this reference type (the only one in this bunch; it is not a
primitive type) represent zero, one, or more characters:
Its EBNF is written
string-literal <= "{graphic | space | escape-sequence}" Examples are: "\n\nEnter your SSN:", or "" (the empty String), or "X" (a one character String, which is different from a char). Note that "CMU" is classified just as a literal token (of the reference type String); it is NOT classified as an identifier token inside two separator tokens! |
Escape Sequences |
Sometimes you will see an escape-sequence inside the single-quotes
for a character-literal or one or more inside double-quotes for a
string-literal (see above);
each escape sequence is translated into a character that prints in some
"special" way.
Some commonly used escape sequences are
So, in the String literal "He said, \"Hi.\"" neither escape sequence \" acts to end the String literal: each represents a double-quote that is part of the String literal, which displays as He said, "Hi." If we output "Pack\nage", Java would print on the console Pack agewith the escape sequence \n causing Java to immediately terminate the current line and start at the beginning of a new line. There are other ways in Java to write escape sequences (dealing with unicode represented by octal numbers) that we will not cover here, nor need in the course. The only escape sequence that we wil use with any frequency is \n. |
The null Reference Literal |
There is a very simple, special kind of literal that is used to represent a
special value with every reference type in Java (so far we know only one,
the type String).
For completeness we will list it here, and learn about its use a bit later.
Its trivial EBNF rule is written
null-literal <= null So, as we learned with boolean literals, null is a literal in Java, not an identifier. |
Bounded Numeric Types |
Although there are an infinite number of integers in mathematics, values in
the int type are limited to the range from -2,147,483,648 to
2,147,483,647.
We will explore this limitation later in the course, but for now we will not
worry about it.
Likewise, although there are an infinite number of reals in mathematics,
values in the double type are limited to the range from In fact, there are other primitive numeric types (which are also keywords): short, long, and float. These types are variants of int and double and are not as widely useful as these more standard types, so we will not cover them in this course. Finally, there is a reference type named BigInteger, which can represent any number of digits in an integer (up to the memory capacity of the machine). Such a type is very powerful (because it can represent any integer), but costly to use (in execution time and computer space) compared to int. Most programs can live with the "small" integer values specified above; but, we will also study this reference type soon, and write programs using it. |
Comments |
The sixth and final category of tokens is the Comment.
Comments allow us to place any form of documentation inside our Java code.
They can contain anything that we can type on the keyboard: English,
mathematics, even low-resolution pictures.
In general, Java recognizes comments as tokens, but then excludes these
tokens from further processing; technically, it treats them as white space
when it is forming tokens.
Comments help us capture aspects of our programs that cannot be expressed as Java code. Things like goals, specification, design structures, time/space tradeoffs, historical information, advice for using/modifying this code, etc. Programmers intensely study their own code (or the code of others) when maintaining it (testing, debugging or modifying it). Good comments in code make all these tasks much easier. Java includes two style for comments.
Typically Java comments are line-oriented; we will save block-oriented comments for a special debugging purpose (discussed later). The EBNF rule for comments is more complicated than insightful, so we will not study here. This happens once in a while. |
Program are a Sequence of Tokens built from Characters |
The first phase, a Java compiler tokenizes a program by scanning its
characters left to right, top to bottom (there is a hidden end-of-line
character at the end of each line; recall that it is equivalent to white
space), and combining selected characters into tokens.
It works by repeating the following algorithm (an algorithm is a precise set
of instructions):
Also, the Java compiler uses the "longest token rule": it includes characters in a token until it reaches a character that cannot be included. Finally, after building and recognizing each token, the Java compiler passes all tokens (except for comments, which are ignored after being tokenized) on to the next phase of the compiler. |
Common Mistakes |
I have seen the following mistakes made repeatedly by beginning students
trying to tokenize programs.
Try to understand each of these subtle points.
|
A Simple Program |
The following program will serve as a model of Input/Calculate/Output
programs in Java.
Here are some highlights
|
////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////// // // Description: // // This program computes the time it take to drop an object (in a vacuum) // form an arbitrary height in an arbitrary gravitational field (so it can // be used to calculate drop times on other planets). It models a straight // input/calculate/output program: the user enters the gravitation field // and then the height; it calculates thd drop time and then prints in on // the console. // ////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////// import edu.cmu.cs.pattis.cs151xx.Prompt; public class Application { public static void main(String[] args) { try { double gravity; //meter/sec/sec double height; //meters double time; //sec //Input gravity = Prompt.forDouble("Enter gravitational acceleration (in meters/sec/sec)"); height = Prompt.forDouble("Enter height of drop (in meters)"); //Calculate time = Math.sqrt(2.*height/gravity); //Output System.out.println("\nDrop time = " + time + " secs"); }catch (Exception e) { e.printStackTrace(); System.out.println("main method in Application class terminating"); System.exit(0); } }
How Experts See Programs |
In the 1940s, a Dutch psychologist named DeGroot was doing research on chess
experts.
He performed the following experiment: He sat chess experts down in front of
an empty chessboard, all the chess pieces, and a curtain.
Behind the curtain was a chessboard with its pieces arranged about 35 moves
into a game.
The curtain was raised for one minute and then lowered.
The chess experts were asked to reconstruct what they remembered from seeing
the chessboard behind the curtain.
In most cases, the chess experts were able to completely reconstruct the board that they saw. The same experiment was conducted with chess novices, but most were able to remember only a few locations of the pieces. These results could be interpreted as, "Chess experts have much better memories than novices." So, DeGroot performed a second (similar) experiment. In the second experiment, the board behind the curtain had the same number of chess pieces, but they were randomly placed on the board; they did not represent an ongoing game. In this modified experiment, the chess experts did only marginally better than the novices. DeGroot's conclusion was that chess experts saw the board differently than novices: they saw not only pieces, but attacking and defending structures, board control, etc. In this class, I am trying to teach you how to see programs as a programmer sees them: not as a sequence of characters, but at a higher structural level. Tokens is where we start this process. |
Problem Set |
To ensure that you understand all the material in this lecture, please solve
the the announced problems after you read the lecture.
If you get stumped on any problem, go back and read the relevant part of the lecture. If you still have questions, please get help from the Instructor, a CA, a Tutor, or any other student.
|