TCP/IP: Inside the Internet

Textbook: since this section is inexplicably missing from this year's textbook, we'll make do with these notes, and the chapters handed out in class.

Note: A large part of this we will undoubtedly be deferring to the next session.

The Internet: Big, and Getting Bigger

     Number of host computers on the Internet

            ^                                Jan 2007: 433,193,199
            |                                Jan 2003: 171,638,297
100,000,000 |                                Jan 1999:  43,200,000
            |              16,000,000
 10,000,000 |                   #
            |        1,100,000  #
  1,000,000 |              #    #
            |              #    #
    100,000 |              #    #
            |      28,000  #    #
     10,000 |         #    #    #
            |         #    #    #
      1,000 |         #    #    #
            |   213   #    #    #
        100 |    #    #    #    #
            |    #    #    #    #
            +----+----+----+----+---->
                '81  '87  '92  '97

How does my message get through the Internet?
Also, I can send pictures and movies (and viruses) as attachments. How does that work?

Representing information: Bits and bytes

A bit is a single binary digit, either 0 or 1. Not the only way to build hardware, but the simplest.
Think:

electricity through wire = 1
no electricity through wire = 0

Bits are so small that they're inconvenient. So: a byte is a binary number contained in 8 bits:

00000000₍₂₎ = 0₍₁₀₎ 11111111₍₂₎ = 255₍₁₀₎

A byte can represent/stand for/symbolize 256 things.
Which 256 things depends on the byte's type!
The most basic types are integers, reals, and characters(!)

(A kilobyte is 2¹⁰=1,024 bytes.
A megabyte is 2²⁰=1,048,576 bytes.)

Integers

Integers are represented fairly directly as binary numbers in N bytes,
with N depending on how new the computer is.
The only tricky part is representing negative integers;
twos complement is the name of the dominant method.
For our purposes, just think of using one bit to represent negative.
So one byte can hold -127 to +127.

(As an aside, there are some situations where you need really big integers, and some languages like Lisp support that.)

Reals (or "doubles"!)

Real numbers are trickier, because you'd like to work with really big and really small numbers.
So we typically use floating-point numbers, which work like scientific notation, e.g., -1.6745 x 10⁴ or 3.142 x 10^-14, except in base 2.
For this, the dominant scheme is the IEEE Standard.
This uses:

1 bit for the sign
8 bits for the (signed) exponent (+ 128)
23 bits for the fractional part of the mantissa

So 3.14 in 32 bits is

0 10000001 10010001111010111000010

Characters

Characters are translated into bits using the ASCII standard (American Standard Code for Information Interchange - not that it matters).

letter  binary code      decimal
 'A'      01000001     64 +  1 = 65
 'B'      01000010     64 +  1 = 66
  :
 'Z'      01011010     64 + 26 = 90

And so on, to include lower case letters, punctuation, etc. So

01001001 00100000 01100001 01101101 00101110

is the string "I am."
(Languages with other character sets use other encoding schemes, many requiring more than one byte per letter. Like "Unicode".)

The main point: it's all just a bunch of bytes!

Our goal

On one level... The Internet is a network of wires connected with computers.

On another... The Internet gives programs the capability to communicate between computers.

We'll see how the Internet bridges this gap.

Division of labor: Layers

To simplify the matter, we split the bridge into four layers:

application layer: transmits data between application programs (e.g., HTTP, SMTP, FishNet).
transport layer: works with packets but provides the illusion of a telephone-like connection between programs (e.g., TCP).
internetwork layer: ``best-effort'' delivery of packets across Internet (e.g., IP).
physical layer: gets message through single physical network (e.g., Ethernet, DSL, cable-modem).

Headers

Each layer adds a header explaining how to handle the message at its level:

+----------+--------------+---------------+------------------------------+
| physical | internetwork | transport     |       application            |
|  header  |  header (IP) |  header (TCP) |         message (eg, HTTP)   |
+----------+--------------+---------------+------------------------------+
<--- front

Of the physical layer we will say little: It magically sends a message across a single network.

IP: Addresses and Routing

The IP software must figure out where a packet should go, and how to get it there.

Machines have two names: a mnemonic name (composed of words) for humans to remember:

  jasmine.bh.andrew.cmu.edu

and a 4-byte numerical IP address that is really used by the machines:

  128.2.124.152

The IP address is used to describe the destination of a message. The first two numbers, 128.2, indicate CMU's network domain, cmu.edu.
(CMU has so many machines that we now use other numeric domains in addition to 128.2.)

(The hierarchical naming in both the mnemonic and numerical forms is very clever, but not essential to examine for our purposes.)

In order to send a message, the computer must first convert the mnemonic name into the real numerical IP address. This is called name resolution.

Since the Internet is big and always changing, the computer contacts a domain name server (DNS) to resolve the IP address. In principle, to find jasmine.bh.andrew.cmu.edu, your computer

contacts the edu domain name server,
which sends you to the cmu name server,
which sends you to the andrew name server,
which sends you to the bh name server,
which returns the IP address for jasmine.

Of course, in pactice this would make the .edu (or .com!) domain name server really busy, and take a long time for each name resolution.
So in practice the system uses caches: each computer in the chain stores the IP addresses it sees. This saves time and network traffic, and allows names to be resolved quickly most of the time. But if you're the first person in a while to try to contact a webserver in Zanzibar, it will take noticeably longer for the DNS to resolve the name.

Who is in charge of domain names? Look here.

Routing

Since we aren't worrying about the physical layer, the IP only needs to worry about routing between networks. Messages get between networks via gateways, computers that are members of more than one network, which transfer packets between networks.

The gateway computers have routing tables that tell them where non-local messages go. Consider the following relatively simple case:

                gateway                	gateway                	gateway
                10.0.0.5                20.0.0.6                30.0.0.7
                20.0.0.5                30.0.0.6                40.0.0.7

network 10.?.?.?        network 20.?.?.?        network 30.?.?.?        rest of Internet

The routing table of the middle gateway above might look something like this:

if destination is: then route to:

10.?.?.? 20.0.0.5

20.?.?.? local destination

30.?.?.? local destination

else 30.0.0.7

These routing tables need to evolve over time. Periodically, gateways tell their neighbors about the best routes they know. If the recipient decides it needs to update its routing table, it tells its neighbors.

As mentioned before, gateways do not guarantee delivery, only best-effort. They frequently drop packets, for a number of reasons:

the gateway is too busy
the gateway doesn't know a route to the destination
the packet has passed through too many computers (the net might have a loop!)
and other reasons

50% packet loss is not uncommon!

TCP: making IP look better

IP gives us

best-effort delivery
between computers

But our programs want

reliable delivery
between programs

This is a job for TCP (Transport Control Protocol). TCP covers for the packet loss, reordering, and other nasty details of IP.

Ports

Since more than one program might want to use the Internet on a single computer, each program reserves a port when it wants to communicate. There are 65k port numbers (0--65,535), which can be specified using two-byte port numbers.

When a program establishes a TCP connection, it sends its port number, so that the other program knows how to find it to respond (its numerical internet address is already in the IP header).

Clients and Servers

A server is a program waiting for connections on a computer with a port reserved. Common servers have certain well-known ports reserved for them, so that other programs can easily find them and send them messages:

port protocol

21 FTP

25 SMTP

53 DOMAIN

80 HTTP

1530 FishNet

port	protocol
21	FTP
25	SMTP
53	DOMAIN
80	HTTP
1530	FishNet

A client reserves a port on its own computer and sends messages to the server by sending messages to the server's port-computer combination. Then the server can respond by sending messages to the client's port-computer combination. And they talk.

(Notice there's nothing wrong with a server or client talking to multiple programs using the same port.)

Reliable delivery

The basic approach to providing reliable delivery is straightforward, but things get complicated in order to be efficient.

The receiver sends an acknowledgement message (ACK) when it receives some data. If the sender doesn't get an ACK soon enough, it resends the data:

The packets in one connection are numbered in order to allow the receiver to be sure it has them all, and in the right order.

One challenging problem is deciding how long to wait before giving up on acknowledgements.

The sender adapts based on what it has recently seen. Doing this well turns out to be quite complicated.

Let's skip that part.

Sliding window

Our simple acknowledgement protocol is very slow, like a bucket brigade with only one bucket:

It would clearly be better to use many buckets at once:

This is done with a sliding window:

     +-------------------+
+----|----+----+----+----|----+----+----+----+----+----+
|  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 | 11 |
+----|----+----+----+----|----+----+----+----+----+----+
     +-------------------+
	  A sliding window of size 4

The window size corresponds to the number of buckets.

Sliding-window delivery

The TCP header

Armed with the information above, we can understand the actual TCP headers used in the Internet:

byte  0   source port
byte  2   destination port
byte  4   sequence number
          (tells which segment is sent)
byte  8   acknowledgement number
          (tells which segment has been received)
byte 12   header length
byte 12.5 ignore
byte 14   desired window size
byte 16   ignore

byte 20
   :      options
   :
byte ??
   :      application message

Application protocols

Okay, so TCP/IP gives programs the ability to communicate smoothly between computers. Now, what do we want to do with that?
Let's look at simple examples of two of the most popular Internet applications: the WWW and email.

HTTP: Web content

HTTP, the HyperText Transfer Protocol, is the basis for Web communication.
Suppose we point our browser at

		http://avrim.pc.cs.cmu.edu/index.html

This indicates to the browser that it should use HTTP to request the file /index.html from avrim.pc.cs.cmu.edu. So the browser uses TCP to open a connection to port 80 on that machine (since that is HTTP's well-known port number).

Once the connection is open, it sends the following message to the web server there:

	GET /index.html HTTP/1.1
	Accept: text/html

(This ends with a blank line.)

The server responds with a message like the following message, and then closes the connection:

	HTTP/1.0 200 Document follows
	Server: CERN/3.0A
	Date: Mon, 11 Jan 1999 03:22:42 GMT
	Content-Type: text/html
	Content-Length: 115
	Last-Modified: Mon, 11 Jan 1999 03:17:24 GMT

	<p>I'm <tt>avrim.pc.cs.cmu.edu</tt>; my primary user is
	<a href=http://www.cburch.com/>Carl Burch</a>.</p>

The HTML (HyperText Markup Language) encoding seen here is not part of the network protocols, but rather a well-designed way of embedding addresses etc. invisibly into text.

SMTP: email

Most email on the Internet is transferred using SMTP (Simple Mail Transfer Protocol).

Suppose I'm spot@cburch.com working on avrim.pc.cs.cmu.edu and I tell my email program to send email to burch@andrew.cmu.edu. It uses TCP on avrim to open a connection to port 25 on andrew.cmu.edu. (We'll use boldface to distinguish text sent from avrim below.)

First andrew responds with a welcome message (220 codes let any program reading this know that it's a welcome message):

	220-andrew.cmu.edu ESMTP Sendmail 8.8.5/8.8.2
	220-Mis-identifying the sender of mail is an abuse of computing facilities
	220 ESMTP spoken here
	helo avrim.pc.cs.cmu.edu
	250 andrew.cmu.edu Hello AVRIM.PC.CS.CMU.EDU [128.2.185.114], pleased to meet you
	mail from: spot@cburch.com
	250 spot@cburch.com... Sender ok
	rcpt to: burch@andrew.cmu.edu
	250 burch@andrew.cmu.edu... Recipient ok
	data
	354 Enter mail, end with "." on a line by itself
	Arf, arf!
	.
	250 XAA21092 Message accepted for delivery
	quit
	221 andrew.cmu.edu closing connection

Current network research

malevolent traffic
mobile computers
wireless networks
guarantees on throughput, delay, etc
multimedia
speed

if destination is:	then route to:
10.?.?.?	20.0.0.5
20.?.?.?	local destination
30.?.?.?	local destination
else	30.0.0.7