Review: The Programming Process:
At the dawn of computing, programmers skipped the second step!
They knew the machine language
coding scheme for a particular computer, and actually wrote
object code programs in machine language
(binary numbers).
They put the machine code directly into the computer's memory,
pushed the "go" button, and
the computer executed the program that was in memory.
For example, to save the decimal value "97" in a register (on a standard Intel machine), instead of writing "int al = 97;" as in Java, you would look up the op code for putting a constant into register 0 (B0), convert 97 into hexadecimal, and then write:
10110000 01100001 (Hexadecimal: B0 61)
This was really a pain.
But beyond the humanity of it, programmer productivity can be
improved by automating the really boring, repetitive parts, such as
looking up instruction numbers and
calculating decimal to binary conversions of constants.
So pretty quickly people switched to using assemblers and
assembly languages.
An assembler is a program that lets you write in a more
human-readable language, but each line of code still corresponds
pretty much to a
machine language instruction.
So to use the same example, instead of "int al = 97;" you would write:
[Intel x86:] mov al, 97which gets turned by the assembler into:
[Intel x86:] 10110000 01100001 (Hexadecimal: B0 61)You'd also have an assembler manual that described the instructions for this computer's assembler:
[Intel x86:] MOV reg8,imm8 ; B0+r ib
So for a while, programmers used assembly languages, and were just really happy not to have to write lots of 1s and 0s (well, hexadecimal numbers, really).
But once computers were a little more powerful, the question arose:
Can we make life even easier for the programmer?
As an obvious example, you often want to execute a loop with a
variable starting with one value, going until it hits a limit, and
updating each time through the loop.
In assembly language, this involves at least 4 instructions, and
there's lots of opportunities for typos (bugs).
Wouldn't it be nice to
have just one line of code do that?
[FORTRAN:] DO 1, I = 0, 7 <body> 1 CONTINUE
Wow! Much less pain, and more programmer productivity!
A program that converts a more abstract (high-level) language into
machine code is called a compiler.
A related idea is that instead of compiling and then executing,
you could write a program that just executes the high-level language
as it reads it in. A program that does this is called an interpreter.
BASIC is an early example of a language that is frequently
interpreted.
[BASIC:] 50 FOR I = 1 TO N 60 LET TOTAL = TOTAL + I 70 NEXT I
Another (famous) insight:
"GO TO Statement Considered Harmful" (Dijkstra 1968).
Structured programming only lets you use well-defined
control constructs, won't let you write "rat's nest" or "spaghetti" code.
(You may never have used a language with a GOTO statement.)
An example of reducing the power of a language to keep you
out of trouble.
Somewhat tangential, but really helpful: smart source code
editors as part of Integrated Development Environments (IDEs).
(We've been using Eclipse to write our Java programs.)
It's really nice to have the editor warn you when you type
something illegal!
Since the compiler/interpreter for a high-level language is really just a
program someone writes, there can be lots of them... and there are.
Hundreds!
Well, actually,
thousands (if you include less popular languages).
(If you have a year to spare, you can build your very
own new language.
The real trick is to get anyone else to really use it.)
I, Dr. Bob, counted up how many programming languages I actually
have known. Over fourteen.
But are there good reasons to have so many different languages?
After all, they're all Turing equivalent, right?
Well, there are reasons, some of them good, some of them maybe not
so great.
The question is now:
How would you like to tell the computer what it should do?
Imagine
you could say it however you wanted.
(Well, within reason. English is
not a choice (yet). English might not be a good idea anyway, since it's so
imprecise; that's why scientists use a lot of math, and mathematicians
use formal logic.)
Different people have different ideas of what they'd like in a
programming language, so you get different languages.
Some examples of varying purposes/situations and languages designed
for them include:
Typically, two major steps:
Specify the language's syntax (grammar) and semantics (meaning).
Implement the program: the actual compiler/interpreter.
Sometimes only one of these is done!
Some languages are never clearly defined; a few have been defined and never implemented.
Define your syntax: What your compiler/interpreter actually will receive as input
will be just a big string of characters.
Dividing this up into significant "words" like
variable names, reserved words (like while), and numbers is called
tokenization or lexical analysis.
Example token definition:
integer ::= [+-]?['0'-'9']+(This form of notation is called Backus-Naur Form (BNF).)
Once you have tokens, you need to define how they combine into
larger meaningful pieces, like arithmetic expressions.
Example piece of syntax:
expr ::= ws factor [ws ('*'|'/') ws factor] factor ::= ws term [ws ('+'|'-') ws term] term ::= '(' ws expr ws ')' | '-' ws expr | numberTaking the series of tokens and analyzing it to get useful structures is called parsing.
Define your semantics somehow: This is often English plus
examples. Plus things like the conventions of arithmetic notation.
ML is unusual in having a true semantic definition.
Once you have the definition worked out, you typically build a
compiler/interpreter that takes source code that complies with your
specification, and
produces machine code that does what you asked for,
according to the semantics you defined. This is not easy.
One popular technique: attach procedures to the parsing rules.
Note that the language designer needs to figure out how to make all their
amazing new features actually work.
And all the features have to work together, in any legal combination.
Seemingly simple things can get tricky:
Even FORTRAN let you have lots of different variable names.
But real machines only have a small number of hardware registers.
So you need a symbol table to keep track of the variable names
in the program, and where you've actually stashed them.
Serious compilers try hard to do a smart job of register
allocation so that if you work on a particular variable a lot, it
stays in a register.
Since this gets complicated, break it up into phases:
All of this will give you a basic optimizing compiler for a
"normal" language.
But things can get even more complicated.
If you're building a serious new language for commercial
use, a big issue is portability: it should work on "all"
platforms, and work the same on all of them.
It should also be as highly optimized as possible.
These kinds of concerns led in Java (and then Microsoft's .NET)
to the idea of compiling the language into bytecodes that run
on a virtual machine.
The front end that converts source code into bytecodes can be the same
on all platforms, thus achieving portability.
The Virtual Machine analyzes the bytecode program, and decides
whether any particular part of the program should be interpreted,
precompiled, or Just-In-Time (JIT) compiled, depending on what would
be the most efficient!
Standard libraries:
The language generally will include a bunch of predefined stuff beyond
the basic structure of the language (like
RobotWindow and String).
A major decision to make is how much and what kind of stuff should
you include in your standard libraries. This again depends on your goals:
One-pass versus multi-pass compilers: if you're allowed to textually
use something before it is defined, the language cannot be compiled in
one pass.
Might be okay, but some languages (Pascal) are careful to
avoid that.
Lazy evaluation: Some languages let you work with infinite sets!
"How?!" you may ask.
Why, by only evaluating things that are actually used, "lazily".
Functions as "first-class objects": Some languages let you treat functions as values, and pass them as parameters or stick them in arrays. Woah.
Parallelism: Greater parallelism allows computing speed to increase, but makes programming more complicated. Can we design high-level languages that let you write parallel programs without having to know lots of details about the parallel hardware it will run on?
Security/Reliability/Verifiability: These go together as "Trustworthy Computing". For crucial programs (such as autopilots), can we prove that the code is correct? Can we prove that software you want to download does not include viruses or spyware?
But we already saw one example of a language for describing things (HTML):
<html> <p>I'm <tt>avrim.pc.cs.cmu.edu</tt>; my primary user is <a href=http://www.cburch.com/>Carl Burch</a>.</p> </html>
There are also languages for asking questions. The most well-known one is probably SQL. It is used for writing database queries:
SELECT isbn, title, price, price * 0.06 AS sales_tax FROM Book WHERE price > 100.00 ORDER BY title;returns a list of books that cost more than 100.00 with an additional "sales_tax" column containing a sales tax figure calculated at 6% of the price.