Distributed Multimedia

Jon Crowcroft, UCL CS with non-trivial assistence from Mark Handley, Steve
Hailes, Nermeen Ismail, Angela Sasse and Ian Wakeman

MULTIMEDIA-WHAT IS IT?

-------------------------------------------------------------------------------

Throughout the 1960s, 1970s and 1980s, Computers have been restricted to
dealing with two main types of data - words and numbers. Text and Arithmetic
processing, Word Processing and Spreadsheets etc. etc.

Codes for numbers (binary, BCD, Fixed point etc., IEEE floating point), are
fairly well standardized. Codes for text (ASCII, EBCDIC, but also fonts, Kangi,
ppt etc.) are also reasonably well understood.

Higher level "codes" - links, indexes, references, and so on are the subject of
such standards as HTML, HyTime and so forth.

Now computers, disks and networks are fast enough to process, store and
transmit audio and video and computer generated visualization material as well
as text and graphics and data: hence the multimedia revolution

One thing about multimedia that cannot be overstated: It is big, like space in
the Hitchhiker's Guide to the Universe, it is much bigger than you can imagine.
Of course, I am not talking about the hype here, I am talking about the storage
transmission and processing requirements!

MULTIMEDIA-WHAT IS IT?

-------------------------------------------------------------------------------

   * Anything beyond letters and numbers, text and arithmetic
   * Grapics, Still Photos, Audio, Video, Animation
   * VR, Hypertext, Hypermedia
   * Some has Time Structure for user
   * Some has non-linear sequences, or choice
   * All Costs more to create and to use
   * If used well, has greater value than traditional "mono"-media

-------------------------------------------------------------------------------

Multimedia Source Characteristics

   * Spatially Self Similar
   * Temporally Self Similar
   * Amenable to compression
   * Large amounts of Redundancy
   * Similarity is structure - i.e. can use Compression to Aid searching

-------------------------------------------------------------------------------

Multimedia Access Patterns

   * Traditional Data Access patterns have strong temporal and spatial
     correlation
   * i.e. if you look at first page of a document, probably look at rest.
   * Multimedia access is not necessarily like that
   * Zapping, Searching, Rewinding etc!!!
   * Hyper links, all contradict this model

-------------------------------------------------------------------------------

"EVERY ENCODING IS A DECODING"

-------------------------------------------------------------------------------

The word "encoding" is often used as a noun as well as a verb when talking
about multimedia.

The first thing to understand about multimedia is the vast range of encodings
currently in use or development. There are a variety of reasons for this:

Codes for audio, video depend on the quality of audio or video required. A very
simple example of this is the difference between digital audio for ISDN
telephones (64Kbps PCM see later) and for CD (1.4Mbps 16 bit oversampled etc.).

Another reason for the range of encodings is that some encodings include
linkages to other media for reasons of synchronization (e.g. between voice and
lips).

Yet another reason is to provide future proofing against any new media
(holograms?).

Finally, because of the range of performance of different computers, it may be
necessary to have a "meta-protocol" to negotiate what is used between encoder
and decoder. This permits programs to encode a stream of media according to
whatever is convenient to them, while a decoder can then decode it according to
their capabilities.

For example, some HDTV (High Definition Television Standards) are actually a
superset of current standard TV encoding so that a "rougher" picture can be
extracted by existing TV receivers from new HDTV transmissions (or from paying
back new HDTV videotapes). This principle is quite general.

"EVERY ENCODING IS A DECODING"

-------------------------------------------------------------------------------

   * Even numbers and letters have an encoding: ASCII and IEEE Floating Point
   * Each new medium needs to be coded
   * The codings now involve possible relationships between different media
   * Compression, and Hierarchical encoding are also needed
   * Meta-languages (codes for codings) are required
   * First, lets look at some audio and video input forms and digital
     encodings.

ANALOG AND DIGITAL

-------------------------------------------------------------------------------

Digital Audio and Video all start life in the "Analog Domain". (Domain is used
in this context just to mean before or after some particular conversion). It is
important to understand the basic requirements of the media in time and space.
The analog domain is usually best understood in terms of the range of
frequencies in use for a particular quality. For sound, this means how low and
high a note/sound is allowed. For video, this translates into the number of
distinguishable colors. For video, we also have to consider the frame rate.
Video is similar to film in that it consists of a number of discrete frames.
You may recall seeing old films which were shot at a lower frame rate than is
used nowadays, and flicker is visible.

Both sound and image can be broken down at any instant into a set of basic
frequencies. This is the so-called "waveform". We can record all of the
frequencies present at anyone time, or we can choose to record only the
"important" ones. If we choose to record less than all frequencies, we get less
"fidelity" in our recording, so that the playback is less like the original.
However, the less we record, the less tape/recording media we need.

ANALOG AND DIGITAL

-------------------------------------------------------------------------------

   * Audio and Video start as waves
   * Waves need to be sampled digitally
   * We can do this "perfectly" by sampling twice as often digitally as the
     highest analog frequency
   * Or we can take advantage of human frailty and reduce the quality

What we have to work with - Input and Output

ANALOG BANDWIDTH

-------------------------------------------------------------------------------

Analog Audio is in the range 50Hz to 20KHz. Human speech is typically in the
range 1-3KHz, and the telephone networks have taken advantage of this since
very early days by providing only limited quality lines. This has meant that
they can use low quality speakers and microphones in the handset - the quality
is similar to AM radio.

The copper wires used for transmission were, in most systems luckily,
over-engineered. They are capable of carrying a signal at up to 16 times the
'bandwidth' of that used by pure analog phones from the home to the exchange
over a kilometer, and 300 times this bandwidth up to 100 meters. For the
moment, though, the "last mile" or customer subscriber-loop circuits have boxes
at the ends that limit this to what is guaranteed for ordinary audio telephony,
while the rest of the frequencies are used for engineering work.

ANALOG BANDWIDTH

-------------------------------------------------------------------------------

   * Transmission and storage of analog audio (and video) is reasonably
     familiar to all
   * Note though that we tolerate much lower quality audio transmission for the
     phone (3000 Hz) than for entertainment (30khz)
   * This brings home the economics of bandwidth
   * Only recently have transmission techniques got to the point where we might
     consider video down a telephone copper wire, and even then, only a few 100
     metres

TRANSFORMS

-------------------------------------------------------------------------------

An analog signal can be broken down into component signal frequencies. There is
a mathematical theorem due to Fourier, that in fact there are lots of ways of
doing this, but that one particular one, a set of frequencies made up of sine
and cosine waves, is sufficient to represent just about any real waveform.
There are others based just on Cosine, etc. etc.

If you transform a signal into base frequencies, then you can remove detail
from the image, simply by removing high frequency components. For audio, this
results in a lower quality sound, where the tone of the notes may have less
timbre. For video, this results in loss of fine granularity in a picture.

The discrete cosine transform is based on this idea and is fundamental to many
video compression schemes.

TRANSFORMS

-------------------------------------------------------------------------------

   * Fourier showed you could take any signal as a sum of base set of
     frequencies at given strengths
   * Leaving out components (zeroing coefficients for those frequencies)
     doesn't necessary degrade the result much.
   * This is the basis of many compression schemes
   * There are others - run length encoding and Huffman coding are two very
     simple ones.

DIGITAL SAMPLING

-------------------------------------------------------------------------------

You can take snapshots of a waveform as it changes in time, and represent what
you see as a number (or set of numbers). The sequence of numbers is now
something that a computer can store, process, transmit and receive. Such a
sequence is really what we call "multimedia data".

DIGITAL SAMPLING

-------------------------------------------------------------------------------

   * Snapshot of the input in time => sequence of values
   * If snapshot is sufficiently short, the values range can be small
   * Can be stored as a word or byte digitally
   * Au fond, this is digital multimedia - just more bits and bytes!

AUDIO SAMPLING

-------------------------------------------------------------------------------

Analog sound is made by creating waves in the air of compressed and rarefied
air. When such waves in the right frequency range (roughly 1-20KHz) hit the
human ear, we here notes with a particular timbre. By imposing other, complex
modulations on sound, we can form all kinds of neat sounds like speech. Most
speech is made up of sounds between 1 and 3KHz. There's a simple law due to
Shannon that tells us how many bits we need to store or send per second to
represent such a wave - so if we sample the analog signal that often, we have
the simplest possible representation of sound - this is Pulse Code Modulation
(PCM). Other techniques are possible - we could actually store a snapshot of
the frequencies present every instant, and their strengths. (i.e. do a spectrum
analysis of the incoming signal!).

Other things we might want to store about sound are positions (e.g. stereo or
quad image information for each source), and we might want some information
about the resonance and reverberation of the room/space it was originally made
in so that we can reproduce this for people in a different space relative to
different listeners positions at playback time.

This can all take quite a bit of data - the best standard for audio recording
now in use, CD Digital Audio, takes 1.4 Mbps.

AUDIO SAMPLING

-------------------------------------------------------------------------------

   * Audio quality ranges from a few Kbps, to 1.4Mbps (CD)
   * Spatial information can be costly (stereo could require twice the
     bandwidth) but can in some cases be stored more simply).
   * Source room resonance and qualities are usually abandoned, but may prove
     important in the future (VR, Games Telepresence, etc.)

COLOR (OR COLOR)

-------------------------------------------------------------------------------

There are several approaches to color processing:

  1.  Full Color
  2.  Pseudo Color
  3. Grayscale
       1. Color is very complex. Basically, light is from a spectrum
          (continuum), but we typically manipulate colors by manipulating
          discrete things like pens, or the colored dots of phosphor on a CRT
          which emit light of a given intensity but at a single color, when hit
          by an electron of a given energy. There are several ways of mixing
          discrete colors to get a new color that has the right appearance to
          the human eye. The human eye does not perceive a spectrum, but rather
          all colors as combinations of 3 so called primary colors, Red
          (435nm), Green (546nm) and Blue (700nm).
       2. These primaries can be added to produce secondaries, magenta, cyan
          and yellow. [The roles of primary and secondary are reversed in
          pigments, over those in light, since the concern of a dyemaker is
          concerned with which color is absorbed, rather than which is
          transmitted].

COLOUR (OR COLOR)

-------------------------------------------------------------------------------

   * Colour is tricky stuff
   * Most MM users use it too much
   * In natural situations, it is very rich
   * Human perception is not of spectrum, but of approximately RGB
   * So are most cameras now
   * Human mental perception is of a spectrum, though...

COLOR INPUT BY HUMANS

-------------------------------------------------------------------------------

The human eye can perceive a very wide range of colors compared with
grayscales. It actually has different sensors for color than for monochrome.
Color is detected by "cones", cells in the retina that distinguish a range of
different signals, while black and white (mono-chrome) is dealt with by rods.
Rods are actually sensitive to much lower light levels (intensity/power), and
are particularly good at handling motion. Cones are specialized to higher light
levels (hence why color vision doesn't work in dim light levels such as during
dawn/dusk/twilight).

COLOR INPUT BY HUMANS

-------------------------------------------------------------------------------

   * Eye/retina has Rods and Cones
   * Rods see greys and motion
   * Cones see color
   * Respond to 3 wavelengths, and perceive a mix

COLOR INPUT BY COMPUTERS

-------------------------------------------------------------------------------

A color input device such as a video camera has a similar set of sensors to
cones. These respond to different wavelengths with different strengths.
Essentially, a video camera is a digital device, based around an array of such
sensors, and a clock that sweeps across them the same way that the electron gun
in the back of a TV or computer display is scanned back and forth, and up and
down, to refresh the light emission from the dots on the screen. So, for a
single, still frame, a scan produces an array of reports of intensity, one
element for each point in the back of the camera. For a system with 3 color
sensor types, you get an array of triples, values of intensity of light of each
of the sensors at being a real. This is then converted into an analog signal
for normal analog recording. Some devices are emerging where the values can be
directly input to a computer rather than converted to analog, and then have to
be converted to digital again by an expensive frame grabber or video card.
Given the range of intensities the human eye can perceive isn't huge, they are
usually stored digitally in a small number of bits - most usually 8 per color -
hence a "true" color display has 24 bits, 8 bits each for R, G and B. RGB is
the most commonly used computing color model. CMY is just [1] - [RGB], and
vice, versa. [0,0,0] is black, and [255,255,255] is white.

COLOR INPUT BY COMPUTERS

-------------------------------------------------------------------------------

   * Input is usually a 2D array of triples
   * RGB = Red, Green, Blue
   * YUV = Chrominance, Luminance, Value
   * Similar to HSV = Hue/Saturation/Value (or Intensity)
   * CMY = Cyan Magenta, Yellow

COLOR OUTPUT BY COMPUTERS AND OTHER DEVICES

-------------------------------------------------------------------------------

Image or video output is just the reverse of input. Thus an area of memory is
set aside for the "framebuffer". Data written here will be read by the video
controller, and used to control the signal to the display's electron gun for
intensity of each of the colors for the corresponding pixel. By changing what
is in the framebuffer once per scan time, you get motion/animation etc. So to
play back digital video from disk, typically, you read it from disk to the
framebuffer at the appropriate rate, and you have a digital VCR!

"Video RAM" is not usually quite the same as other memory since it is targeted
at good row then column scans rather than true RAM access.

COLOR OUTPUT BY COMPUTERS

-------------------------------------------------------------------------------

   * Output to Framebuffer = VRAM, is n bits of each of RGB
   * If n=8, "True Color"
   * n < 8, can have color maps - values are indexes
   * Color maps lead to flicker or false color
   * n=1, monochrome
   * Greyscale displays can be hi-quality

VIDEO FRAMES

-------------------------------------------------------------------------------

An image received by the retina in the eye persists for a short while. A
sequence of images or frames, with small changes that impinge on the eye
sufficiently close together will give the illusion of a moving pictures. How
much of the picture changes between one image and the next affects how smooth
or how jerky the movement will appear. Frame rates of 10 per second and above
are enough to give reasonably realistic rendition of natural scenes. In fact,
the way that motion is perceived by the human brain means that less detail is
required in fast moving segments of a picture.

[Interlacing is a scan technique used to try to get the persistence of the
image higher without increasing the scan rate - basically, each alternate frame
time, odd or even lines are refreshed].

VIDEO FRAMES

-------------------------------------------------------------------------------

   * Eye and screen have persistence - image lasts a while
   * Screen is refreshed from framebuffer, can last "for ever"
   * Frame rates > 10 fps generally look 'smooth'
   * Frame rates > 20 fps capture fast motion
   * Eye perceives motion with less resolution than still images

OTHER COLOR SCHEMES

-------------------------------------------------------------------------------

There are other ways of storing color - rather than a set of discrete values
that are "added" by the eye, the Hue/Saturation/Value (a.k.a.
Hue/Saturation/Intensity) scheme stores three different values:

  1. frequency (hue/true-color),
  2. saturation, the amount a color is "diluted" by all the other colors or
     white
  3. Intensity

  1. This is useful, since we can process intensity separately. Conversion from
     RGB to HSV is pretty straightforward.

OTHER COLOR SCHEMES

-------------------------------------------------------------------------------

   * Can store real values if input is from spectrum analyzer
   * E.g. HSV
   * Hue = frequency
   * Saturation = dilution
   * Value = intensity

HYBRID ANALOG VIDEO SYSTEMS

-------------------------------------------------------------------------------

Early video on computers was (and still is in some cases) provided by a hybrid
approach. Basically, any computer with an bitmap display could have a dual port
into the video controller, and the signal that was used to drive the display
intercepted for portions of the scan of the CRT, and an external video signal
used. This would result in perfect video in a sub-area of the display. The only
problem is that the video was at no stage digitised, and is therefore not
amenable to capture and processing.

A later version of this trick (dare one say hack) is to digitise the external
video, and write it into a dual ported framebuffer (the video memory that the
controller scans to update the display). However, if the video card was a
replacement for the actual standard computers video RAM, often it would be that
the access to read the part of the framebuffer holding the video would be
significantly slower than full video speed (so much so that even a single still
frame grab might not be feasible from the CPU.

Another hybrid approach to multimedia is where storage devices are used that
have hybrid recording tracks - this (contrary to the hacks above) is(was)
genuinely useful. Where a high quality film or sound track might be put onto
very high density mag tape, a separate index track might be put alongside,
digitally.

This could be used by editing systems to create edit sequences, so that a
mix-down of the analog track could be performed many times without making any
generational copies, until the editor is satisfied., Then the final actual mix
down from the master tape to a master copy (e.g. CD, or analog vinyl!) could be
done automatically. If the density/quality of the master material is very high
and precludes use of compression (say due to lack of technology or money) this
is a very useful technique.

HYBRID ANALOG VIDEO SYSTEMS

-------------------------------------------------------------------------------

   * Hybrid systems combine analog with digital
   * Often used to "dual port" a screen for external input
   * Sometimes use dual port VRAM/Framebuffer
   * Most useful for digital indexing of stored high quality analog
   * Can provide no-copy editing facilities

STILL IMAGES

-------------------------------------------------------------------------------

There are many many bitmap or other image formats. Since many are functionally
equivalent, this has led to a plethora of tools to convert betwixt and between
them - GIF, TIF, WMF, PPM, PBM, etc. etc.

The main compressed still image form for quality multimedia is based on the
JPEG standard, but since this is also used for Video ("motion JPEG") we discuss
this below.

STILL IMAGES

-------------------------------------------------------------------------------

   * Still digital image formats are many
   * GIF is currently proprietary, compressed form
   * TIF, TIFF, PPM and PBM are all Public Domain
   * WMF is commonly used
   * JPEG is standard, very good for Photos/Natural scenes, high quality
     compression, though lossy

INPUT MEDIA FORMATS

-------------------------------------------------------------------------------

There are two main audio encodings in common use in the world:

CD (Compact Disk) and PCM (Pulse Code Modulation). PCM is from the telephony
world, and is described with other audio encodings later. CD is from the
entertainment business.

Most common video encodings are based on those from the TV industry and
standards world.

INPUT MEDIA FORMATS

-------------------------------------------------------------------------------

   * Audio typically arrives as 64Kbps PCM or 1.4Mbps CD
   * Video typically in Common Interchange Format, but...
   * Differs in aspect ration (height * width) for:
   * NTSC, SECAM, PAL, etc.

PAL/NTSC/SECAM

Before you can digitize a moving image, you need to know what the analog form
is, in terms of resolution and frame rate. Unfortunately, there are 3 main
standards in use. PAL is used in the UK, while NTSC is used in the US and in a
modified form in JAPAN, and SECAM is used in France and Russia. The differences
are in number of lines, framerate, scan order and so forth.

   * PAL
   * NTSC
   * SECAM

PAL/NTSC/SECAM

   * PAL used in UK
   * NTSC in USA and Japan
   * SECAM in France and Russia
   * Differ in lines, framerate, interlace order and so on

HDTV

High Definition TV has yet to make it into standards. One problem is that the
technology has moved quite quickly, so although the Japanese and Americans were
ready to roll with a double resolution standard a few years back, noone would
accept this as they foresaw a short lifetime for an inferior technology.

HDTV

   * Not wide standard yet
   * Too high data rate for current computer storage, processing or
     transmission
   * Standard TV as sub-sample would have been nice (DMAC etc.)

DATA COMPRESSION

-------------------------------------------------------------------------------

Devices that encode and decode as well compress and decompress are called
CODECs or CODer DECoders. Sometimes, these terms are used for audio, but mainly
they are for video devices.

A video CODEC can be anything from the simplest A2D device, through to
something that does picture pre-processing, and even has network adapters build
into it (i.e. a videophone!). A CODEC usually does most of its work in
hardware, but there is no reason not to implement everything (except the a2d
capture:-), in software on a reasonably fast processor.

The most expensive and complex component of a CODEC is the
compression/decompression part. There are a number of international standards,
as well as any number of proprietary compression techniques for video.

DATA COMPRESSION

-------------------------------------------------------------------------------

   * Data (files etc.) typically compressed using Huffman codes or Run Length,
     or clever statistical rules such as Lempel-Ziv
   * Audio and Video are loss tolerant, so can use cleverer compression that
     discards some information
   * Compression of 400 times is possible on video - useful given the base
     uncompressed data rate of a 25 fps CIF image is 140Mbps
   * A lot of standards for this now
   * Some good proprietary techniques
   * Note that lossy compression of video is not acceptable to some classes of
     user (e.g. radiologist, or air traffic controller).

Video compression

VIDEO COMPRESSION

-------------------------------------------------------------------------------

Video compression can take away the requirement for the very high data rates
and move video transmission and storage into a very similar regime to that for
audio. In fact, in terms of tolerance for poor quality, it seems humans are
better at adapting to poor visual information than poor audio information. A
simple minded calculation shows:

1024 x 1024 pixels,

   * 3 bytes per pixel (24 bit RGB)
   * 25 Frames per second

yields 75Mbytes/second, or 600Mbps - this is right on the limit of modern
transmission capacity.

Even in this age of deregulation and cheaper telecoms, and larger, faster
disks, this is profligate.

On the other hand, for a scene with a human face in, as few as 64 pixels
square, and 10 frames per second might suffice for a meaningful image.

   * 64x 64 pixels
   * 3 bytes per pixel (24 bit RGB)
   * 10 Frames per second

yields 122KBytes/Second, or just under 1 Mbps - this is achievable on modern
LANs and high speed WANs but still not friendly!

Notice that in the last simple example, we did two things to the picture.

   * 1. We used less "space" for each frame by sending less "detail".
   * 2. We sent frames less frequently since little is moving.

  1. This is a clue as to how to go about improving things. Basically, if there
     isn't much information to send, we avoid sending it. Spatial and temporal
     domain compression are both used in many of the standards.

VIDEO COMPRESSION

-------------------------------------------------------------------------------

1024 x 1024 pixels,

   * 3 bytes per pixel (24 bit RGB)
   * 25 Frames per second

yields 75Mbytes/second, or 600Mbps!!!

   * 1. We could use less "space" for each frame by sending less "detail".
   * 2. We could send frames less frequently since little is moving.

LOSSY VERSUS LOSSLESS COMPRESSION

-------------------------------------------------------------------------------

If a frame contains a lot of image that is the same, maybe we can encode this
with less bits without losing any information (run length encode, use logically
larger pixels etc. etc.). On the other hand, we can take advantage of other
features of natural scenes to reduce the amount of bits - for example, nature
is very fractal, or self-similar:- there are lots of features, sky, grass,
lines on face etc., that are repetitive at any level of detail. If we leave out
some levels of detail, the eye (and human visual cortex processing) end up
being fooled a lot of the time.

LOSSY VERSUS LOSSLESS COMPRESSION

-------------------------------------------------------------------------------

   * If area of input doesn't change, don't send it
   * If area of input doesn't change much, don't send it
   * If moving area is detailed, could send "fuzzy" version
   * If still area has detail, could send this slower than large features
   * All depends on human frailty!

HIERARCHICAL CODING

-------------------------------------------------------------------------------

Hierarchical coding is based on the idea that coding will be in the form of
quality hierarchy where the lowest layer of hierarchy contains the minimum
information for intelligibility. Succeeding layers of the hierarchy adds
increasing quality to the scheme.

This compression mechanism is ideal for transmission over packet switched
networks where the network resources are shared between many traffic streams
and delays, losses and errors are expected.

Packets will carry data from only one layer, accordingly packets can be marked
according to their importance for intelligibility for the end-user. The network
would use these information as a measure of what sort of packets to be dropped,
delayed and what should take priority. It should be noted that priority bits
already exist in some protocols such as the IP protocol.

Hierarchical coding will also be ideal to deal with multicasting transmission
over links with different bandwidths. To deal with such problem in a
non-hierarchical encoding scheme, either the whole multicasting traffic adapts
to the lowest bandwidth link capabilities thus degrading the video/audio
quality where it could have been better or causing the low link to suffer from
congestion and thus sites affected will lose some of the intelligibility in
their received video/audio. With hierarchical coding, low level packets can be
filtered out whenever a low bandwidth link is encountered thus preserving the
intelligibility of the video/audio for the sites affected by these links and
still delivering a better quality to sites with higher bandwidth.

Schemes that are now in relatively commonplace use include H.261 for
videotelephony, MPEG for digital TV and VCRs and JPEG for still images. Most
current standards are based on one simple technique, so first lets look at
that.

HIERARCHICAL CODING

-------------------------------------------------------------------------------

   * Last idea was that levels of detail can be sent at different rates or
     priorities
   * Can be useful if there are different users (e.g. in a TV broadcast, or
     Internet multicast)
   * Can be useful for deciding what to lose in the face of overload or lack of
     disk storage etc.
   * Many of the video encodings (and still picture standards) are well suited
     to this.

JPEG

-------------------------------------------------------------------------------

The JPEG standard`s goal has been to develop a method for continuous-tone image
compression for both color and greyscale images. The standard define four
modes:

   * Sequential In this mode each image is encoded in a single left-to-right,
     top-to-bottom scan. This mode is the simplest and most implemented one in
     both hardware and software implementation.
   * Progressive In this mode the image is encoded in multiple scans. This is
     helpful for applications in which transmission time is too long and the
     viewer prefers to watch the image building in multiple coarse-to-clear
     passes.
   * Lossless The image here is encoded to guarantee exact recovery of every
     source image sample value. This is important to applications where any
     small loss of image data is significant. Some medical applications do need
     that mode.
   * Hierarchical Here the image is encoded at multiple resolutions, so that
     low-resolution versions may be decoded without having to decode the higher
     resolution versions. This mode is beneficial when transmission over packet
     switched networks. Only the data significant for a certain resolution
     determined by the application can be transmitted, thus allowing more
     applications to share the same network resources. In real time
     transmission cases (e.g. an image pulled out of an information server and
     synchronized with a real-time video clip), a congested network can start
     dropping packets containing the highest resolution data resulting in a
     degraded quality of the image instead of delay.

JPEG uses the Discrete Cosine Transform to compress spatial redundancy within
an image in all of its modes apart from the lossless one where a predictive
method issued instead.

As JPEG was essentially designed for the compression of still images, it makes
no use of temporal redundancy which is a very important element in most video
compression schemes. Thus, despite the availability of real-time JPEG video
compression hardware, its use will be quite limit due to its poor video
quality.

JPEG

-------------------------------------------------------------------------------

   * JPEG has 4 modes

  1. Sequential scanned left to right, top to bottom
  2. Progressive - coarse to clear
  3. Lossless
  4. Hierarchical

   * Uses the Discrete Cosine Transform to encode and compress blocks

H261

-------------------------------------------------------------------------------

H261 is the most widely used international video compression standard for video
conferencing. The standard describes the video coding and decoding methods for
the moving picture component of a audiovisual service at the rates of p * 64
kbps where p is in the range of 1 to 30. The standard targets and is really
suitable for applications using circuit switched networks as their transmission
channels. This is understandable as ISDN with both basic and primary rate
access was the communication channel considered within the framework of the
standard.

H.261 is usually used in conjunction with other control and framing standards
such as H221, H230 H242 and H320, of which more later.

H.261

-------------------------------------------------------------------------------

   * ITU (was CCITT) standard for video telephony
   * Very commonly implemented now in hardware and software
   * aimed at ISDN, anything from 64Kbps to 2Mbps
   * PC cards to do video, audio and ISDN exist
   * Used with other standards for communications and conference control.

H.261 SOURCE IMAGES format

The source coder operates on only non-interlaced pictures. Pictures are coded
as luminance and two color difference components(Y, Cb, Cr). The Cb and Cr
matrices are half the size of the Y matrix.

H261 supports two image resolutions, QCIF which is (144x176 pixels)and ,
optionally, CIF which is(288x352).

H.261 SOURCE IMAGES format

   * [Image]
   * The diagram shows the sampling of Chrominance and Luminance.
   * H.261 supports two resolutions:

  1. CIF = 288*352 pixels
  2. QCIF = 144*176 pixels

H.261 SOURCE CODER

   * There main elements in an H.261 encoder are

  1. Prediction
  2. Block Transformation
  3. Quantization

H.261 SOURCE CODER

[Image] Encoder

H.261 Prediction

H261 defines two types of coding. INTRA coding where blocks of 8x8 pixels each
are encoded only with reference to themselves and are sent directly to the
block transformation process. On the other hand INTER coding frames are encoded
with respect to another reference frame.

A prediction error is calculated between a 16x16 pixel region (macroblock) and
the (recovered) correspondent macroblock in the previous frame. Prediction
error of transmitted blocks (criteria of transmission is not standardized) are
then sent to the block transformation process.

H.261 prediction

   * Blocks are inter or intra coded
   * Intra-coded blocks stand alone
   * Inter-coded blocks are based on predicted error between the previous frame
     and this one
   * Intra-coded frames must be sent with a minimum frequency to avoid loss of
     synchronisation of sender and receiver.


H.261 Block transformation

H261 supports motion compensation in the encoder as an option. In motion
compensation a search area is constructed in the previous (recovered) frame to
determine the best reference macroblock . Both the prediction error as well as
the motion vectors specifying the value and direction of displacement between
the encoded macroblock and the chosen reference are sent. The search area as
well as how to compute the motion vectors are not subject to standardization.
Both horizontal and vertical components of the vectors must have integer values
in the range + 15 and 15 though

In block transformation, INTRA coded frames as well as prediction errors will
be composed into 8x8 blocks. Each block will be processed by a two-dimensional
FDCT function.

H.261 Block Transformation

   * Each Block (and prediction error) is an 8*8 pixel square
   * It is coded as a forward discrete cosine transform
   * If this sounds expensive, there are fast table driven algorithms
   * Can be done in s/w quite easily, as well as very easily in h/w

H.261 Quantization & Entropy Coding

The purpose of this step is to achieve further compression by representing the
DCT coefficients with no greater precision than is necessary to achieve the
required quality. The number of quantizers are 1 for the INTRA dc coefficients
and 31 for all others.

Entropy coding involves extra compression (non-lossy) is done by assigning
shorter code-words to frequent events and longer code-words to less frequent
events. Huffman coding is usually used to implement this step.

H.261 Quantization

   * For a given quality, we can lose coefficients of the transform by using
     less bits than would be needed for all the values
   * Leads to a "coarser" picture
   * Can then entropy code the final set of values by using shorter words for
     the most common values and longer ones for rarer ones (like using 8 bits
     for three letter words in English:-)

H.261 Multiplexing

The video multiplexer structures the compressed data into a hierarchical
bitstream that can be universally interpreted.

The hierarchy has four layers :

   * Picture layer : corresponds to one video picture (frame)
   * Group of blocks: corresponds to 1/12 of CIF pictures or 1/3 of QCIF

   * Macroblocks : corresponds to 16x16 pixels of luminance and the two
     spatially corresponding 8x8 chrominance components.
   * Blocks: corresponds to 8x8 pixels

H.261 Multiplexing.

   * Bitstream made up of 4 things:

  1. Pictures (A video frame)
  2. Groups of Blocks (1/3 of QCIF picture)
  3. Macroblocks (16*16 luminence and 2 8*8 Chrominence)
  4. Blocks (8*8 pixels)

H.261 Error Correction Framing

An error correction framing structure is described in the H261 standard. The
frame structure is shown in the figure. The BCH(511,493) parity is used to
protect the bit stream transmitted over ISDN and is optional to the decoder.
The fill bit indicator allows data padding thus ensuring the transmission on
every valid clock cycle

H.261 Error Correction and Framing

   * The framing structure for H.261 is H.221, which includes a FEC scheme, as
     shown in the 3 diagrams below.

[Image] H.261 FEC

[Image] H221 Structure

[Image] H221 Framing

H.261 Summary

Though H261 as mentioned before can be considered the most widely video
compression standard used in the field of multimedia conferencing, it has its
limitations as far as its suitability for transmission over PSDN. H261 does not
map naturally onto hierarchical coding. A few suggestions has been made as to
how this can happen but as a standard there is no support of that. H261
resolution is fine for conferencing applications. Once more quality critical
video data need to be compressed, the upper limit optional CIF resolution can
start showing inadequate.

H.261 Summary

   * H.261 is good for Videotelephony and conferencing
   * Currently mainly used over ISDN, but could be used over packet nets.
   * Hierarchical use not part of the standard (yet)
   * At 2Mbps, it approximates to entertainment quality (VHS) video.

MPEG

-------------------------------------------------------------------------------

The aim of the MPEG-II video compression standard is to cater for the growing
need of generic coding methods for moving images for various applications such
as digital storage and communication. So unlike the H261 standard who was
specifically designed for the compression of moving images for video
conferencing systems at p * 64kbps , MPEG is considering a wider scope of
applications.

MPEG

   * Aimed at storage as well as transmission
   * Higher cost and quality than H.261
   * Higher minimum bandwidth
   * Decoder is just about implementable in software
   * Target 2Mbps to 8Mbps really.
   * The "CD" of Video?

MPEG SOURCE IMAGES format

The source pictures consist of three rectangular matrices of integers: a
luminance matrix (Y) and two chrominance matrices (Cb and Cr).

The MPEG supports three format :

   * 4:2:0 format

  1. In this format the Cb and Cr matrices shall be one half the size of the Y
     matrix in both horizontal and vertical dimensions.

   * 4:2:2 format

  1. In this format the Cb and Cr matrices shall be one half the size of the Y
     matrix in horizontal dimension and the same size in the vertical
     dimension.

   * 4:4:4 format

  1. In this format the Cb and Cr matrices will be of the same size as the Y
     matrix in both vertical and horizontal dimensions.

MPEG Source Images Format

   * YUC sampling in 4 forms
   * 4:2:0, 4:2:2, 4:4:4
   * Looking at some video capture cards (e.g. Intel's PC one) it may be hard
     to convert to this
   * But then this is targeted at digital video tape and video on demand
     really.

MPEG frames

The output of the decoding process, for interlaced sequences, consists of a
series of fields that are separated in time by a field period. The two fields
of a frame may be coded independently (field-pictures) or can be coded together
as a frame (frame pictures).

MPEG Frames

The diagram shows the intra, predictive and bi-directional frames that MPEG
supports:

[Image] MPEG

MPEG source coder

An MPEG source encoder will consist of the following elements:

   * Prediction (3 frame times)
   * Block Transformation
   * Quantization and Variable Length Encoding

MPEG Prediction

  1. MPEG defines three types of pictures:

1. Intrapictures (I-pictures)

These pictures are encoded only with respect to themselves. Here each picture
is composed onto blocks of 8x8 pixels each that are encoded only with respect
to themselves and are sent directly to the block transformation process.

2. Predictive pictures (P-pictures)

These are pictures encoded using motion compensated prediction from a past
I-picture or P-picture. A prediction error is calculated between a 16x16 pixels
region (macroblock) in the current picture and the past reference I or P
picture. A motion vector is also calculated to determine the value and
direction of the prediction. For progressive sequences and interlaced sequences
with frame-coding only one motion vector will be calculated for the P-pictures.
For interlace sequences with field-coding two motion vectors will be
calculated. The prediction error is then composed to 8x8 pixels blocks and sent
to the block transformation

3. Bi-directional pictures (B-pictures)

These are pictures encoded using motion compensates predictions from a past
and/or future I-picture or P-picture. A prediction error is calculated between
a 16x16 pixels region in the current picture and the past as well as future
reference I-picture or P-picture. Two motion vectors are calculated. One to
determine the value and direction of the forward prediction the other to
determine the value and direction of the backward prediction. For field-coding
pictures in interlaced sequences four motion vectors will thus be calculated.

It must be noted that a B-picture can never be used as a prediction picture.

The method of calculating the motion vectors as well as the search area for the
best predictor is left to be determined by the encoder.

MPEG prediction

   * I pictures are encoded as intra- w.r.t themselves only
   * P-pictures are coded w.r.t the last I-Picture (including any motion
     compensation)
   * B-Pictures use forward and backward predictions to encode w.r.t other I or
     P Pictures

MPEG Block Transformation

  1. In block transformation, INTRA coded blocks as well as prediction errors
     are processed by a two-dimensional DCT function.
        o Quantization
            1. The purpose of this step is to achieve further compression by
               representing the DCT coefficients with no greater precision than
               is necessary to achieve the required quality.
        o Variable length encoding
            1. Here extra compression (non-lossy) is done by assigning shorter
               code-words to frequent events and longer code-words to less
               frequent events. Huffman coding is usually used to implement
               this step.

MPEG Block Transformation

   * As with H.261, frames are compressed using discrete cosine transforms
   * These are (again) quantized and the resulting values Huffman coded
   * There are, however, a few more things to MPEG
       1.
       2.

MPEG Multiplexing

  1. The video multiplexer structures the compressed data into a hierarchical
     bitstream that can be universally interpreted.
  2. The hierarchy has four layers :
        o Videosequence
  3. This is the highest syntactic structure of the coded bitstream. It can be
     looked at as a random access unit.
        o Group of pictures

This is optional in MPEG II. This corresponds to a series of pictures. The
first picture in the coded bitstream has to be an I picture. Group of pictures
does assist random access. They can also be used at scenes cuts or other cases
where motion compensation is ineffective. Applications requiring random access,
fast-forwarder fast-reverse playback may use relatively short group of
pictures.

   * Picture

This would correspond to one picture in the video sequence. For field pictures
in interlaced sequences, the interlaced picture will be represented by two
separate pictures in the coded stream. They will be encoded in the same order
that shall occur at the output of the decoder.

   * Slice

This corresponds to a group of Macroblocks. The actual number of Macroblocks
within a slice is not subject to standardization. Slices do not have to cover
the whole picture. Its a requirement that if the picture was used subsequently
for predictions, then predictions shall only be made from those regions of the
picture that were enclosed in slices.

   * Macroblock

  1. A macro block contains a section of the luminance component and the
     spatially corresponding chrominance components. A 4:2:0 macroblock
     consists of 6 blocks (4Y, 1 Cb, 1Cr) A 4:2:2 Macroblock consists of 8
     blocks (4Y, 2 Cb, 2 Cr) A4:4:4 Macroblock consists of 12 blocks (4Y,4Cb,
     4Cr)
        o Block
  2. Corresponds to 8x8 pixels.

MPEG multiplexing

The structure of the MPEG bitstream is a tad more complex than that of H.261:

   * Video Sequence
   * Group of Pictures
   * Picture
   * Slice
   * Macroblock
   * Block

  1.

MPEG Picture Order

It must be noted that in MPEG the order of the picture in the coded stream is
the order in which the decoder process them. The reconstructed frames are not
necessarily in the correct form of display. The following example shows such a
case

   * At the encoder input,

12 3 4 5 6 78 9 10 11 12 13

IB B P B B PB B I B B P

   * At the encoder output, in the coded bitstream and at the decoder input,

14 2 3 7 5 610 8 9 13 11 12

IP B B P B BI B B P B B

At the decoder output:

12 3 4 5 6 78 9 10 11 12 13

MPEG Picture order

   * The order of pictures at the decoder is not the display order, always
   * This leads to potential for delays in the encoder/decoder loop
   * This is also true of H.261 - at its highest compression ratio, it may
     incur as much as 0.5 seconds delay - not very pleasant for interactive
     use!

SCALEABLE EXTENSIONS

The scalability tools specified by MPEG II are designed to support applications
beyond that supported by single layer video. In a scaleable video coding, it is
assumed that given an encoded bitstream, decoders of various complexities can
decode and display appropriate reproductions of coded video. The basic
scalability tools offered are: data partitioning, SNR scalability, spatial
scalability and temporal scalability. Combinations of these basic scalability
tools are also supported and are referred to as hybrid scalability. In the case
of basic scalability, two layers of video referred to as the lower layer and
the enhancement layer are allowed. Whereas in hybrid scalability up to three
layers are supported.

MPEG Extensions

   * Spatial scalable extension

  1. This involves generating two spatial resolution video layers from a single
     video source such that the lower layer is coded by itself to provide the
     basic spatial resolution and the enhancement layer employs the spatially
     interpolated lower layer and carries the full spatial resolution of the
     input video source.

   * SNR scalable extension

  1. This involves generating two video layers of same spatial resolution but
     different video qualities from a single video source. The lower layer is
     coded by itself to provide the basic video quality and the enhancement
     layer is coded to enhance the lower layer. The enhancement layer when
     added back to the lower layer regenerates a higher quality reproduction of
     the input video.

   * Temporal scalable extension

  1. This involves generating two video layers whereas the lower one is encoded
     by itself to provide the basic temporal rate and the enhancement layer is
     coded with temporal prediction with respect to the lower layer. These
     layers when decoded and temporally multiplexed yield full temporal
     resolution of the video source.

   * Data partitioning extension

  1. This involves the partitioning of the video coded bitstream into two
     parts. One part will carry the more critical parts of the bitstream such
     as headers , motion vectors and DC coefficients). The other part will
     carry less critical data such as the higher DCT coefficients.

   * Profiles and levels

Profiles and levels provide a means of defining subsets of the syntax and
semantics and thereby the decoder capabilities to decode a certain stream. A
profile is a defined sub-set of the entire bitstream syntax that is defined by
MPEG II. A level is a defined set of constraints imposed on parameters in the
bit stream.

MPEG Extensions

   * Can encode different levels of spatial or temporal quality
   * Can partition the bitstream appropriately
   * Can profile an MPEG encoder.

MPEG II Profiles

Five profiles are defined :

  1. Simple
  2. Main
  3. SNR scalable
  4. Spatially scalable
  5. High

Along with four levels

  1. Low
  2. Main
  3. High 1440
  4. High

MPEG Profiles

   * Important to realize specification is of encoded stream
   * Leaves lots of options open to the implementor
   * Profiles allow us to scope these choices (as in other standards, e.g. in
     telecommuncations)
   * This is important, as the hard work (expensive end) is the encoder, while
     the stream as specified, is generally easy however it Is implemented, to
     decode.
   * For information, the diagram shows a comparison of the data rate out of an
     H.261 and an MPEG coder

[Image] h261 v mpeg

MPEG II

MPEG II is now an ISO standard. Due to the forward and backward temporal
compression used by MPEG, a better compression and better quality can be
produced. As MPEG does not limit the picture resolution, high resolution data
can still be compressed using MPEG. The scaleable extensions defined by MPEG
can map neatly on the hierarchical scheme explained in 2.1. The out-of- order
processing which occurs in both encoding and decoding side can introduce
considerable latencies. This is undesirable in video telephony and video
conferencing.

Prices for hardware MPEG encoders are quite expensive at the moment though this
should change over the near future. The new SunVideo board (see below) does
support MPEG I encoding. Software implementation of MPEG I DECoders are already
available.

MPEG II

   * MPEG II now an ISO standard
   * Slightly better than MPEG I
   * CODECs very very pricey right now
   * Software for decoders exists (in the public domain) and performs
     reasonably well for small pictures.

MPEG III and IV

MPEG III was going to be a higher quality encoding for HDTV. It transpired
after some studies that MPEG II at higher rates is pretty good, and so MPEG III
has been dropped.

MPEG IV is aimed at the opposite extreme - that of low bandwidth or low storage
capacity environments (e.g. PDAs). It is based around model-based image coding
schemes (i.e. knowing what is in the picture!). It is aimed at UP TO 64kbps.

MPEG III and IV

   * MPEG III was going to be High Definition MPEGII
   * Turns out MPEG II at higher rates is good enough
   * MPEG IV is for lower rates, such as a few 10s kbps

SUBBAND CODING

-------------------------------------------------------------------------------

Subband coding is given as an example of an encoding algorithm that can neatly
map onto hierarchical coding. There are other examples of hierarchical encoding
none of them is a standard or widely used as the international standards such
as H261 and MPEG.

Subband coding is based on the fact that the low spatial frequencies components
of a picture do carry most of the information within the picture. The picture
can thus be divided into its spatial frequencies components and then the
coefficients are quantized describing the image band according to their
importance; lower frequencies being more important. The most obvious mapping is
to allocate each subband (frequency) to one of the hierarchy layers. If
inter-frame coding is used, it has to be adjusted as not to create any upward
dependencies.

Subband Coding

   * Layered or subband coding uses a repeated application of the coder to
     different spatial frequencies in the picture
   * Similar to the ideas in H.261 and MPEG but applied more directly
   * Have to take care with inter-frame coding interactions with a subband
     coding scheme (areas change in detail...)

DVI

-------------------------------------------------------------------------------

Intel's Digital Video Interactive compression scheme is based on the region
encoding technique. Each picture is divided into regions which in turn is split
into subregions and so on, until the regions can be mapped onto basic shapes to
fit the required bandwidth and quality. The chosen shapes can be reproduced
well at the decoder. The data sent is a description of the region tree and of
the shapes at the leaves. This is an asymmetric coding, which requires large
amount of processing for the encoding and less for the decoding.

DVI ,though not a standard, started to play an important role in the market.
SUN prototype DIME board used DVI compression and it was planned to be
incorporated in the new generation of SUN videopix cards.

This turned out to be untrue. Intel canceled the development of the V3 DVI
chips. SUN next generation of VideoPix, the SunVideo card does not support DVI.
The future of DVI is all in doubt.

DVI

-------------------------------------------------------------------------------

   * Region based coding scheme
   * Good compression
   * No loss tolerance
   * Chipset was developed by Intel
   * Not popular anymore

CELLB COMPRESSION

-------------------------------------------------------------------------------

CellB image compression is introduced by SUN and is supported by its new
SunVideo cards. CellB is based on the techniques of block truncation and vector
quantization.

In vector quantization, the picture is divided into blocks and the coefficients
describing the blocks are used as vectors. As the vector space in which the
block vectors exist would not be evenly populated by the blocks, the vector
space can be divided into subspaces selected to provide equal probability of a
random vector being in any of the subspaces. A prototype vector will be then
used to represent all blocks whose vectors fall into a certain subspace.

The most processor intensive part of vector quantization is the generation of
the codebook, that is the division of the vector space into subspaces. Then a
copy of the codebook is sent to the other end. The image is then divided into
blocks which is represented by the vector in the codebook that is closest to it
and the label is sent. Decoding is done by looking up the labels in the code
book and use the correspondent vector to represent the block.

CellB uses two fixed codebooks. It takes 3-band YUV images as input, the width
and height must be dividable by 4. The video is broken into cells of 16 pixels
each arranged in 4x4 group. The 16 pixels in a cell are represented by a 16-bit
mask and two intensities or colors. These values specify which intensity to
place at each of the pixel positions. The mask and intensities

can be chosen to maintain certain statistics of the cell, or they can be chosen
to reduce contouring in a manner similar to ordered dither. This method is
called Block Truncation Coding. It takes advantage of the primitives already
implemented in graphics accelerators to provide video decoding.

CELLB

-------------------------------------------------------------------------------

   * Proprietary Sun Microsystems
   * Implemented on their video cards
   * Good loss tolerance
   * based on vector quantization

See the diagram

[Image] VQ

QUICKTIME AND VIDEO FOR WINDOWS

-------------------------------------------------------------------------------

Apple and Microsoft have both defined standards for their respective systems to
accommodate video. However, in both cases, they are more concerned with
defining a usable API so that program developers can generate applications that
interwork quickly and effectively. Thus, Video for Windows and QuickTIme both
specify the ways that video can be displayed and processed within the framework
of the GUI systems on MS-Windows and Apple systems. However, neither specifies
a specific video encoding. Rather, they assume that all kinds of encodings will
be available through hardware CODECs or through software and thus they provide
meta-systems that allow the programmer to name the encoding, and provide
translations.

QUICKTIME & VIDEO FOR WINDOWS

-------------------------------------------------------------------------------

   * Apple and Microsoft rely on hardware manufacturers for processors
   * Neither specify a particular video format
   * Rather, specify a framework for accommodating many video formats
   * Also specify an API for manipulating and displaying video widgets

AUDIO Compression standards

THE CCITT AUDIO FAMILY

-------------------------------------------------------------------------------

The fundamental standard upon which all videoconferencing applications are
based is G.711 , which defines Pulse Code Modulation(PCM). In PCM, a sample
representing the instantaneous amplitude of the input waveform is taken
regularly, the recommended rate being 8000 samples/s (50 ppm). At this sampling
rate frequencies up to 3400--4000Hz are encodable. Empirically, this has been
demonstrated to be adequate for voice communication, and, indeed, even seems to
provide a music quality acceptable in the noisy environment around computers
(or perhaps my hearing is failing). The samples taken are assigned one of 212
values, the range being necessary in order to minimize signal-to-noise ratio
(SNR) at low volumes. These samples are then compressed to 8 bits using a
logarithmic encoding according to either of two laws (A-law and =-law). In
telecommunications, A-law encoding tends to be more widely used in Europe,
whilst =-law predominates in the US However, since most workstations originate
outside Europe, the sound chips within them tend to obey =-law. In either case,
the reason that a logarithmic compression technique is preferred to a linear
one is that it more readily represents the way humans perceive audio. We are
more sensitive to small changes at low volume than the same changes at high
volume; consequently, lower volumes are represented with greater accuracy than
high volumes.

CCITT AUDIO FAMILY

-------------------------------------------------------------------------------

   * Based on G.711, Pulse Code Modulation
   * 8000 samples/second
   * Assigned one of 212 values (8 bits) either A or Mu law

ADPCM

-------------------------------------------------------------------------------

ADPCM (G.721) allows for the compression of PCM encoded input whose power
varies with time. Feedback of a reconstructed version of the input signal is
subtracted from the actual input signal, which is then quantised to give a 4
bit output value. This compression gives a 32 kbit/s output rate. This standard
was recently extended in G.726 , which replaces both G.721 and G.723 , to allow
conversion between 64 kbit/s PCM and 40, 32, 24, or 16 kbit/s channels. G.727
is an extension of G.726 and issued for embedded ADPCM on 40, 32, 24, or 16
kbit/s channels, with the specific intention of being used in packetised speech
systems utilizing the Packetized Voice Protocol (PVP), defined in G.764.

The encoding of higher quality speech (50Hz--7kHz) is covered in G.722 and
G.725 , and is achieved by utilizing sub-band ADPCM coding on two frequency
sub-bands; the output rate is 64 kbit/s.

ADPCM

-------------------------------------------------------------------------------

   * Adaptive Differential Pulse Code Modulation
   * G.721 - compresses down to 16Kbps
   * Can be good quality

LPC AND CELP

-------------------------------------------------------------------------------

LPC (Linear Predictive Coding) is used to compress audio at 16 Kbit/s and
below. In this method the encoder fits speech to a simple, analytic model of
the vocal tract. Only the parameters describing the best-fit model is
transmitted to the decoder. An LPC decoder uses those parameters to generate
synthetic speech that is usually very similar to the original. The result is
intelligible but machine-sound like talking.

CELP (Code Excited Linear Predictor) is quite similar to LPC. CELP encoder does
the same LPC modeling but then computes the errors between the original speech
and the synthetic model and transmits both model parameters and a very
compressed representation of the errors. The compressed representation is an
index into a 'code book' shared between encoders and decoders. The result of
CELP is a much higher quality speech at low data rate.

LPC AND CELP

-------------------------------------------------------------------------------

   * Linear Predictive Coding
   * Code Excited Linear Prediction
   * Both achieve massive compression at expense of Dalek sounds
   * Lossy schemes - only use if desperate!

MPEG AUDIO

-------------------------------------------------------------------------------

High quality audio compression is supported by MPEG. MPEG I defines sample
rates of 48 KHz, 44.1 KHz and 32 KHz. MPEG II adds three other frequencies , 16
KHz, 22,05 and 24 KHz. MPEG I allows for two audio channels where as MPEG II
allows five audio channels plus an additional low frequency enhancement
channel.

MPEG defines three compression levels that is Audio Layer I, II and III. Layer
I is the simplest, a sub-band coder with a psycho-acoustic model. Layer II adds
more advanced bit allocation techniques and greater accuracy. Layer III adds a
hybrid filterbank and non-uniform quantization. Layer I, II and III gives
increasing quality/compression ratios with increasing complexity and demands on
processing power.

MPEG AUDIO

-------------------------------------------------------------------------------

   * High quality
   * 32Khz - 48khz
   * Based on psycho-acoustic model
   * Costly to encode (again!)

Video Conference CONTROL standards

H221

-------------------------------------------------------------------------------

H221 is the most important control standard when considered in the context of
equipment designed for ISDN specially current hardware video CODECs. It defines
the frame structure for audiovisual services in one or multiple B or H0
channels or single H11 or H12 channel at rates of between 64 and 1920 Kbit/s.
It allows the synchronization of multiple 64 or 384 Kbit/s connections and
dynamic control over the subdivision of a transmission channel of 64 to 1920
kbit/s into smaller subchannels suitable for voice, video, data and control
signals. It is mainly designed for use within synchronized multiway multimedia
connections, such as video conferencing.

H221 was designed specifically for usage over ISDN. A lot of problems arise
when trying to transmit H221 frames over PSDN.

H.221

-------------------------------------------------------------------------------

   * Used for framing H.261 video & audio
   * Targeted at low delay (ISDN) scenarios
   * Very cramped encoding
   * Bad for software and packet switched nets

H242

-------------------------------------------------------------------------------

Due to the increasing number of applications utilizing narrow (3KHz) and
wideband (7KHz) speech together with video and data at different rates, a
scheme is recommended by this standard to allow a channel accommodates speech
and optionally video and/or data at several rates and in a number of different
modes. Signaling procedures for establishing a compatible mode upon call
set-up, to switch between modes during a call and to allow for a call transfer,
is explained in this standard.

Each terminal would transfer its capabilities to the other remote terminal(s)
at call set-up. The terminals will then proceed to establish a common mode of
operation. A terminal capabilities consist of : Audio capabilities, Video

capabilities, Transfer rate capabilities , data capabilities, terminals on
restricted networks capabilities and encryption and extension-BAS capabilities.
-------------------------------------------------------------------------------

H.242

-------------------------------------------------------------------------------

   * A multiplexing protocol for carrying several lots of narrow band speech
     and video
   * Has a protocol for negotiation of capabilities between "terminals"

H230

-------------------------------------------------------------------------------

This standard is mainly concerned with the control and indication signals
needed for the transmission of frame-synchronous or requiring rapid response.
Four categories of control and indication signals have been defined, first one
related to video, second one related to audio, third one related to maintenance
purposes and the last one is related to simple multipoint conferences control
(signals transmitted between terminals and MCU's).

H.230

-------------------------------------------------------------------------------

   * Protocol for controlling and mixing and muxing video and audio
   * Aimed at simple multipoint extensions for point to point and ISDN
     videotelephony
   * Used by Multi-point Control Units with H.261 for n-way conferencing
   * More later...

H320

-------------------------------------------------------------------------------

H.320 covers the technical requirements for narrow-band telephone services
defined in H.200/AV.120-Series recommendations, where channel rates do not
exceed 1920 kbit/s.

                                               Communication
    modes      of      visual      telephone

                                       Channels             ISDN

                    Visual
                                                     Coding

              telephone mode              rate          channel
           ISDN interface

                                      (kbit/s)       (Note   2)

Primary

Basic                            Audio                Video

rate

                a         ao           6     4            B
                                         Rec.    G.711
   Not


                                                             applicable

                          al
                                         Rec.   H.200/


                                           AV.254

                 b        b,           128                2B
                                         Rec.    G.711

                          b2
                                         Rec.    G.722

                          b3
                                         Rec.H.200/


                                           AV.254,


                                           AV.253


                                           (Note 1)

                         c             198                3B

                         d             256                4B

                         c             320                5B

Rec. H.261

                         f             384                6B

                         9             384                ik
                        Applicable         Rec. G.722

                         h             768                2HO

                         i             1152               3HD
             Not applicable

1536               4HO

                         k             1536               HI,

1920
m             1920               H12
    Normative references

The following CCITT Recommendations and  International  Standards
 contain  provisions  which,  through  reference  in  this

text, constitute provisions of this Recommendation. At the time
of publication, the editions  indicated  were  valid.  All

Recommendations and Standards are subject to revision,  and  parties
 to  agreements  based  on  this  Recommendation  are

encouraged to investigate the possibility of applying the most
recent edition of the Recommendations and  Standards  listed

below. Members of IEC and ISO maintain  registers  of  currently
 valid  International  Standards.  The  CCITT  Secretariat

maintains a list of the currently valid CCITT-F Recommendations.

   * CCITT Recommendation F.710 (19??), General Principles for
     Audiogiaphic Conference Services.
   * CCITT Recommendation T.35 (1988), Procedure for the Allocation
     of CCITT Member Codes
   * CCITT Recommendation T.50 (1988), International Alphabet
     No. 5
   * ITU-T Recommendation T.1 20 (199x), Introduced to Audiographics
     and Audiovisual Conferencing
   * ITU-T Recommendation T.121 (199x), Audiographic Conferencing
     - in development
   * ITU-T Recommendation T.1 22 (1993), Multipoint Communications
     Service Audiographic

   * ITU-T Recommendation T. 1 23 (1993), Protocol Stack- Audiographics
     and Audiovisual Teleconferencing Applications

   * ITU-T Recommendation T. 1 25 (1994), Multipoint Comunications
     Service Protocol Specification
   * CCITT Recommendation H.22 1, Frame Structure for- a 64 to
     1920 Kbps Channel ill AudioVisual Teleservices

   * CCITT Recommendation X.208 (1988), Specification of abstract
     Syntax: Notation One (ASN.1)
   * CCITT Recommendation X.209 (1988), Specification of Basic
     Encoding Rules for Abstract Syntax  Notation One (ASN.1)

T.GCC

         Within  the  context  of  the  CCITT-T  Audio-visual
 Conferencing  Service  (AVCS),  a  conference  refers  to  a
 group   of
geographically dispersed nodes that are joined together and that
are capable of  exchanging  audiographic  and  audiovisual
information across various communication networks. Participants
taking part in a conference may have  access  to  various
types of media handling capabilities such as audio only (telephony),
 audio  and  data  (audiographics),  audio  and  video (audiovisual),
and audio, video, and data (multimedia).

The F, H, and T Series Recommendations provide a framework for
the interworking of audio,  video,  and  graphics terminals
on a point-to-point basis through existing, telecommunication
networks. They also provide the capability for three or  more
terminals in the same conference to be interconnected by means
of an MCU.

This Recommendation  provides  a  high-level  framework  for
conference  management  and  control  of  audiographics  and
audiovisual  terminals,  and  MCUS.  It  coexist-;  with  companion
 Recommendations  T.122  and  T.125  (MCS)  and   T.123
(AVPS) to provide a mechanism for conference  establishment  and
 control.  T.GCC  also  provides  access  to  certain  MCS
functions and primitives, including tokens for  conference  conductorship.
 T.GCC,  T.122,  T.123,  and  T.125  form  the
minimum set of Recommendations to develop a fully functional terminal
or MCU.

This  Recommendation  includes  the  followin@-  generic  conference
 control  (GCC)  functional components:   conference
establishment and termination. maintenance,  the  conference
roster,  managing  the  application  roster,  remote  actuation,
conference conductorship, bandwidth control, and application registry
services. The service definitions for the  primitives
associated with these functional components are  contained  later,
as are the  corresponding  protocol  definitions  are

The figure below shows an example  of  how  GCC  components  are  distributed
 throughout  an  MCS  domain.  The  GCC  components are shown
 in  white.  Each  terminal  or  MCU  contains  a  GCC  Agent
 which  provides  GCC  services  to  local  Client
Applications.

[Image]

The Top GCC Server conlains Application Registry
information for the conference
Example of GCC components distributed throughout an MCS Domain

Each Node participating in a GCC conference  consists  of  an
 MCS  layer,  a  GCC  layer,  a  Node  Controller  and  may  also
include one or more Client Applications. The relationship between
these components within  a  single  node  is  illustrated  in
the figure below.

[Image]

The Node Controller is the controlling entity at a
node, dealing with the aspects of a conference  which  apply
to
the entire node. The Node Controller interacts with GCC, but may
not interact  directly  with  MCS.  Client  Applications  also
interact with GCC, and may or may not interact with  MCS  directly.
 The  services  provided  by  GCC  to  Client  Applications
are  primarily  to  enable  peer  Client  Applications  to  communicate
 directly,  via  MCS.  Communication   between   Client

Applications or between Client Applications and the Node Controller
may take  place,  but  is  a  local  implementation  matter
not covered by this Recommendation. The practical distinction
 between  these  Node  Controller  and  the  Client  Applications
is also a local matter not covered by this Recommendation.

The service primitives as described in Recommendation
apply to the GCC  Service  Interface  as  indicated  in
Node User Interface.

An example is illustrated below:

[Image]

GCC Service

[Image]
System model showing GCC Service Interface and relationship with MCS

Generic Conference Control Service

  1. GCC abstract services Conference Establishment and Termination

        o GCC-Coriferencc-Join
        o GCC-Conference-Query
        o GCC-Conference-Create
        o GCC-Conference-Add
        o GCC-Conference-invite
        o GCC-Conference-Lock
        o GCC-Conference-Unlock
        o GCC-Conference-Disconnect
        o GCC-Conference-Terminate
        o GCC-Conference-Eject-User
        o GCC-Conference-Transfer
        o GCC-Conference-Time-Remaining
        o GCC-Conference-Time-Inquire
        o GCC-Conference-Extend
        o GCC-Conference-Ping

  2. The Conference Roster:

        o GCC-conference-Announce-Presence
        o GCC-Conference-Roster-Inquire

  3. The Application Roster:

        o GCC-Application-Enrol
        o GCC-Application-Attach
        o GCC-Application-User-ID
        o GCC-Application-Roster-Report
        o GCC-Application-Roster-Inquire

  4. Remote Actuation

        o GCC-Action-List-Announce
        o GCC-Action-List-Inquire
        o GCC-Action-Actuate

  5. Conference Conductorship

        o GCC-Conductor-Assign-@
        o GCC-Conductor-Release
        o GCC-Conductor-Please
        o GCC-Conductor-Give
        o GCC-Conductor-Inquire

Table GCC-Conference-Query - Types of primitives and their
parameters

Primitive/                Request       Indication      Response
       confirm
Figure -Model of the MCS layer

Services provided by the MCS layer

The MCS protocol supports the services defined in ITU-T Rec. T.122.
Information is transferred to and from the MCS.
Table 5 - MCS primitives

                        Functional Unit                   Primitives
                Associated   MCSPDUs

                     Domain Management     MCS-CONNECT-F-PROVIDER
request              Connect-Initial

                                           MCS-CONNECT'-PROVII)ER
indication          Connect-Initial

                                           MCS-CONNECT-F-PROVIDER
response             Connect-Response

                                           MCS-CONNECT-PROVIDER
confirm              Connect-Response

                                             (side effects)
                      Connect-Additional


                              Connect-Result


                            PDin    EDrq


                         MCrq   MCcf    PCin


                         MTrq   MTcf    PTin

                                           MCS-DISCONNECT-PROVIDER
request                  DPum

                                           MCS-DISCONNECF-PROVIDER
iiidication            DPum    RJum

                                           MCS-Attach-USER request
                          AUrq

                                           MCS-Attach-USER confirm
                          AUcf

                                           MCS-DETACH-USER request
                          DUrq

                                           MCS-DETACH-USER indication
                       DUin


                            MCcf    PCin


                            MTcf    PTin

                     Channel Management    MCS-CHANNEL-JOIN request
                         CJrq

                                           MCS-CHANNEL-JOIN conf-=
                          Cjcf

                                           MCS-CHANNEL-LEAVE request
                        CLrq

                                           MCS-CHANNEL-LEAVE
   indication               MCcf    PCin

                                           MCS-CHANNEL-CONVENE
request                       CCrq

                                           MCS-CHANNEL-CONVENE
conf=                         cccf

                                           MCS-CHANNEL-DISBAND
     request                  CDrq

                                           MCS-CHANNEL-DISBAND
indication                 MCcf    PCin

                                           MCS-CHANNEL,ADNUT request
                        CArq

                                           MCS-CHANNEL-ADMTr
   indication                  CAin

                                           MCS-CHANNEL-EXPEL request
                        CErq

                                           MCS-CHANNEL-EXPEL
   indication               CEin    CDin


                            MCcf    PCin

Functional Unit                  Primitives                 Associated
MCSPDUS

                     Data Transfer          MCS-SEND-DATA request
                           SDrq

SD3'-n

MCS-SEND-DATA indication

                                            MCS-UNEFORM-SEND-DATA
request                    USrq

                                            MCS-UNIFORM-SEND-DATA
indication                 USin

                      Token Management      MCS-TOKEN-GRAB request
                          TGrq

                                            MCS-TOKEN-GRAB confimi
                          TGcf

                                            MCS-TOKEN-INHIBFF
request                        TIrq

                                            mcs-TOKEN-INHIBr-F
confirm                       TIcf

                                            MCS-TOKEN-GIVE request
                          TVrq

                                            MCS-TOKEN-GIVE indication
                       TVin

                                            MCS-TOKEN-GIVE response
                         TVrs

                                            MCS-TOKEN-GIVE confirni
                         TVcf

                                            MCS-TOKEN-PLEASE request
                        TPrq

TPin

MCS-TOKEN-PLEASE indication

                                            MCS-TOKEN-RELEASE
request                        TRrq

                                            MCS-TOKEN-RELEASE
conf=                          TRcf

                                            MCS-TOKEN-TEST request
                          TTrq

                                            MCS-TOKEN-TEST confirm
                         TTcf

Services assumed from the transport layer

The MCS protocol assumes the use of a subset of the connection-oriented
transport service defined in CCITT Rec.X.214, information is transferred to and
from a TS provider as in the table above.

MPEG SYSTEMS

-------------------------------------------------------------------------------

MPEG systems part is the control part of the MPEG standard. It addresses the
combining of one or more streams of video and audio as well as other data, into
a single or multiple streams which are suitable for storage or transmission.
The figure below shows a simplified view of the MPEG control system.

Packetised Elementary Stream (PES)

PES stream consists of a continuous sequence of PES packets of one elementary
stream. The PES packets would include information regarding the Elementary
clock reference and the Elementary stream rate. The PES stream is not defined
for interchange and interoperability though. Both fixed length and variable
length PES packets are allowed.

MPEG SYSTEMS

-------------------------------------------------------------------------------

The diagram illustrates the components of the MPEG Systems module:

[Image] MPEG Sys

TRANSPORT AND Program Streams

There are two data stream formats defined: the Transport Stream, which can
carry multiple programs simultaneously, and which is optimized for use in
applications where data loss may be likely (e.g. transmission on a lossy
network), and the Program stream, which is optimized for multimedia
applications, for performing systems processing in software, and for MPEG-1
compatibility.

Synchronization

The basic principle of MPEG System coding is the use of time stamps which
specify the decoding and display time of audio and video and the time of
reception of the multiplexed coded data at the decoder, all in terms of a
single 90kHz system clock. This method allows a great deal of flexibility in
such areas as decoder design, the number of streams, multiplex packet lengths,
video picture rates, audio sample rates, coded data rates, digital storage
medium or network performance. It also provides flexibility in selecting which
entity is the master time base, while guaranteeing that synchronization and
buffer management are maintained. Variable data rate operation is supported. A
reference model of a decoder system is specified which provides limits for the
ranges of parameters available to encoders and provides requirements for
decoders.

Putting this on the Desktop, on the Internet

DESKTOP SYSTEMS MODEL

-------------------------------------------------------------------------------

So what happens when we want to put all of this onto our desktop system? There
are impacts on the whole architecture, for processor, bus, I/O, storage devices
and so on. At the time of writing this course, even with the massive advances
in processor and bus speed (e.g. Pentium and PCI), we are still right on the
limits of what can be handled for video.

Desktop Systems Model - ISDN style

[Image] Conferencing

Desktop Performance Regime

However, with judicious optimisation of the implementation of some of the
compression schemes described above, it is now possible to encode, compress and
transmit a single CIF video stream at 25 frames per second on a workstation
with about 50MIPS processing power. The key is to look at the DCT transforms
and realize that large chunks of them can be done in table lookup form, at the
expense of memory utilization (but then if you are using a lot of memory for
video anyhow, this is not that significant).

Desktop Performance Regime

   * H.261 or MPEG in software take a lot of CPU
   * 50MIPS (fast 486) can code H.261 stream
   * 100MIPS (fast Pentium) can code a QCIF MPEG stream
   * But (big but) 1/10th of this to decode/display

Encoding/Compression versus Decoding/Decompression

The most expensive part of the transform in the encoder/transmitter side is the
frame differencing (differencing of the DCT coded blocks), since this involves
a complete pass over the data (frame) every frame time (say 25 times per second
over nearly a Megabyte). It turns out that this, and motion prediction if
employed, are really I/O intensive rather than strictly being CPU/Instruction
intensive and are currently the main bottleneck.

In the meantime, the receiver/decoder/decompression task is a lot easier,
possibly as much as 10-25 times less work. This is simply because if there is
no change in the video image, no data arrives, and if there is a change, data
arrives, so the only work is in the inverse DCT (or other transform) plus
copying the data fro the network to the framebuffer. Basically, a modest PC can
sustain this task for several video streams simultaneously.

Encoding/Compression versus Decoding/Decompression

   * Expensive part of compression is frame/block differencing
   * DCT can (both forward and back) be done largely by table lookup
   * Costs in memory
   * Decoder has no frame differencing to do
   * Irony - more the scene changes, less decoder has to do

NETWORKED SYSTEMS MODELS

-------------------------------------------------------------------------------

When we want to network our audio and video, again we are up against the limits
of what can be done under software control now. There are implications for
source, link, switch and sink processing, in terms of throughput, although for
compressed video, most modest machines are now pretty capable of what's
required. But in terms of reconstructing the timing of a multimedia stream,
there are a few tricky problems. These can be solved as we'll see later, but
there are basically two approaches:

  1. Use a synchronous circuit switched network (e.g. ISDN or a leased line).
  2. Use a packet network, but put in adaption to delay and loss (perhaps
     through redundancy in the encoding, or retransmission, or interpolation or
     extrapolation of a signal at the receiver).

We will compare these approaches below.

NETWORK SYSTEMS MODELS

-------------------------------------------------------------------------------

Here we illustrate the two basic approaches - use of a constant bit rate CODEC
and circuit based network:

[Image] CODEC Usage

And use of software and packetizers and a Packet Switched network:

[Image] Packet Switch Conferencing

HARDWARE

-------------------------------------------------------------------------------

There is no doubt that special purpose hardware is needed for some multimedia
tasks. The shear volume of data that must be dealt with, and the CPU intensive
nature of much audio and video processing means that some special purpose
devices are needed. Some of these are purely in the digital domain, some sit
between the analog and the digital, and others are most cost effective in the
analog realm.

Digital Signal Processors and Graphics Co-processors

DSPs are specially designed chips that are basically miniature vector
processors good at the set of tasks that audio, and video, signal processing
involve - typically, these involve a repetitive sequence of instructions
carried out over an array of data - e.g. fast Fourier or other transform, or
matrix multiplication (to rotate or carry out other POV transforms), or even to
render a scene with a given light source.

DSPs and other Co-processors

   * The serious graphics house will have these anyhow
   * Can help a lot with basics of video
   * Worth noting that a lot of video processing is similar to the compression
     task
   * Audio is less worth concerning oneself with special hardware
   * except if very heavy compression required

Video/audio CODEC operation

Coder/Decoder cards in workstations vary enormously in their interface to the
a/v world, as well as their interface to the computer.

Some CODECs do on card compression, some don't. Some replace a framebuffer,
while others expect the CPU to copy video data to the framebuffer (or network).

Some include a network interface (e.g. ISDN card in PC video cards). Some
include audio with the ISDN network interface (the chipsets are often related
or the same).

Most that carry out some extra function like this are good for their alloted
task, but poor as general purpose video or audio i/o devices.

Nowadays, most UNIX and Apple workstations have good audio i/o, at least at
64kbps PCM, and sometimes even at 1.4 Mbps CD quality. Most PC cards are still
poor (e.g. the soundblaster card is half duplex - not much use for interactive
PC based network telephoning).

CODEC Operation

   * Video and audio devices vary a lot
   * Some have onboard compression
   * Some even have on board ISDN
   * The more on the card, the less flexibility
   * The more on the card, the less CPU burden

Frame Grabbers

There are low price framegrabbers available, that often operate as low frame
rate video cards.

Mixers, Multiplexors

It is often useful to be able to choose or mix audio (or video) input to a
framegrabber or CODEC. However, by far the cheapest and most effective way to
do this is by getting an analogue mixer. To mix n digital streams requires n
codecs. Sometimes, within a building, one wishes to carry multiple streams
(even of analog and digital) between different points. Again, appropriate
broadband multiplexors may be cheaper than going to the digital domain and
using general purpose networking - the current cost of the bandwidth you need
is still quite high. If you want 4 pictures on a screen, an analog video
multiplexor is an inexpensive way of achieving this, although this is the sort
of transformation that might be feasible digitally very soon for reasonable
cost.

Mixers, Multiplexors

   * Software mixing of video is a way off yet at any reasonable price
   * Even captioning video in s/w is tricky
   * Use analog devices for this - they are cheap and effective
   * and available
   * Future work will result eventually in good transform domain video
     processing

Mikes, Cameras

Currently, most mikes and cameras are pure analog. Mikes are inexpensive and
audio codecs becoming commonplace in any case. But cameras could easily be
constructed that are pure digital, by simply extracting the signal from the
scan across the CCD area in a video camera. There are a couple of such devices
coming on to the market this year.

Digital Mikes & Cameras

   * Are starting to appear
   * Still Cameras already around
   * Digital Video camera should be cheaper!
   * Mikes will cost more though
   * May make automatic calibration a lot easier

Echo Cancellation

Interactive audio is nigh on impossible if a user can hear their own voice more
than a few 10s of milliseconds after they speak. Thus if you are speaking to
someone over a long haul net, and your voice traverses it, turns around at the
far end, and comes back, then you may have this problem.

In fact, echo cancellors can be got which go between the Audio out and in, and
sense the delay in the room between the output signal on speakers and the input
on a mike. If they then simply introduce the same signal but with its phase
reversed, with that delay, to the input, then the echo is (largely ) canceled.

Unfortunately, it isn't quite that simple!! The signal arriving at the speaker
is transformed by the room, and may not be easily recognized as the same as
that picked up by the mike. However, this might not matter if a calibration
signal can be used to set up the delay line.

Failing this, many systems fall back on a conference control technology, using
either a master floor control person who determines who may speak when (see
below) or a simple manual "click to talk" interface which disables speakers in
the users room.

Echo Cancellors

   * A requirement if you want to avoid using headsets or click-to-talk
   * Analog devices limited in range (echo delay)
   * digital echo cancellors not widely integrated into voice capture systems
     yet
   * Generally a painful area!

Multimedia conferencing

CONFERENCING MODELS-CENTRALISED, DISTRIBUTED, ETC.

-------------------------------------------------------------------------------

There are two fundamentally different approaches to video teleconferencing and
multimedia conferencing that spring from two fundamentally different
philosophies:

  1. The Public Network Operators and ITU model of circuit based, resource
     reservation videoconferencing, with its incumbent complexity for multisite
     operations.
  2. The Internet and Packet Switched adaptive approach using multicast
     (many-to-many packet distribution) facilities to achieve multisite
     operations.

An overview of Internet Based Approach

Conferencing models - centralised and distributed

   * PNO/ITU approach is circuit based
   * Resource reservation, and expensive
   * Internet approach is packet based
   * Unreliable, but cheap
   * There are emerging middle ways...

ITU Model H.320/T.gcc

This is based around the starting point of person to person video telephony,
across the POTS (Plain Old Telephone System) or its digital successor, ISDN
(Integrated Services Digital Networking). The Public Network Operators (PNOs,
or telcos or PTTs), have a network already, and its based on a circuit model -
you place a call using a signaling protocol with several stages - call request,
call indication, call proceeding, call complete and so on. Once the call has
been made, the resources are in place for the duration of the call. You are
guaranteed (through expensive engineering, and you pay!), that your bits will
get to the destination with:

  1. Constant Rate
  2. Constant Delay (plus or minus a few bit times in a million)

  1. To achieve this, the talc has a complex arrangement of global clocks and
     an over resourced backbone network.
  2. To match video traffic to such a service, the output from a video
     compression algorithm has to be padded out to a constant bit rate (i.e.
     its constant rate, not constant quality). The assumption is that you have
     a special purpose box that you plug cameras and mikes into, (a CODEC) and
     it plugs into the phone or ISDN line or leased line, and you conference
     with your equivalent at the far end of the call.
       1. How is multisite conferencing achieved?

ITU Model

   * Based on ISDN or leased lines
   * Constant Rate
   * Video padded out to fit
   * Access for Video "terminals"
   * Access from computers inconvenient

Multisite Circuit Based Conferencing - MCUs

There are two ways you could set up a multisite conference:

  1. Have multiple CODECs at each site, and multiple circuits, one from each
     site to all the others. This would involve n*(n-1) circuits in all, and n
     CODECs at each site to decode the incoming video and audio.
  2. Use a special purpose Multi-point Control Unit, which mixes audio signals,
     and chooses which video signal fro which site is propagated to all the
     others.

With this latter approach, each site has a single CODEC, and makes a call to
the MCU site. The MCU has a limit on the number of inbound calls that it can
take, and in any case, needs at least n circuits, one per site. Typically, MCUs
operate 4-6 CODECs/calls. To build a conference with more than this many sites,
you have multiple MCUs, and there is a protocol between the MCUs, so that one
build a hierarchy of them (a tree).

Which site's video is seen at all the others (remember it can be only one, as
CODECs for circuit based video can only decode one signal), is chosen through
floor control, which may be based on who is speaking or on a chairman approach
(human intervention).

Multisite Circuit Based Conferencing

The diagram illustrates the use of an MCU to link up 3 sites for a circuit
based conference:

[Image] H/W

Should greater than basic rate ISDN be needed, it can be combined via a BONDing
box as shown:

[Image] Bonding

Multicast Packet Based Multisite Conferencing

In a packet switched network, all is very different from the ITU model.

Firstly, on a Local Area Network (LAN), a packet sent can be received by
multiple machines (multicast) at no additional cost. Secondly, as we pointed
out earlier when looking at the performance of compression algorithms, it is
possible for the same power machine to decode many more streams than it
encodes. Hence we can send video and audio from each site to all the others
whenever we like. A receiver can select which (possibly several or all) of the
senders to view.

Thirdly, an audio compression algorithm may well use silence detection and
suppression. This can be used for rate adaption (as we will see later), but
primarily, it means that, so long as only one person is speaking at any one
time usually, the cost in terms of network utilization for audio at least is
hardly any more if we send everything all the time. In many uses of such
systems where there are a lot of participants it is common that a lot of them
are audio senders only (e.g. a class, a seminar) so this can work very well.

Multicast Packet based multisite conferencing

   * A packet net might support multicast - we look at this later
   * Then a more general interconnection strategy can be used
   * The architecture might look a bit like that in the figure:

[Image] Protocols and so on

Internet Based Multimedia Conferencing

There is one remaining non-trivial difference between circuit based networks
and packet based networks, currently, and that concerns resource reservation:

  1. Most packet switched networks have no guarantees of throughput or delay.
  2. Many packet switched networks (notably the Internet) have relatively high
     losses when they are busy.

Provided that a network is not actually overcommitted, this is not necessarily
a problem. We can still run packet based video and audio over the Internet
quite easily.

The key observations are:

  1. Compressed audio and video are not fixed rate naturally.
  2. Users may have a minimum acceptable quality (which may be very low) and
     above that may be happy to have free extra quality when available.
     Adapting compression schemes to available bandwidth is close to trivial.
  3. Adapting to delay and loss with a compressed image or sound is not very
     compute intensive.

Internet Based Multimedia Conferencing

   * Jitter will need dealing with - the figure illustrates this:

[Image] Jitter

If the overall use at minimum quality exceeds the capacity of the network, then
this 'best effort' approach will not work. But within this constraint it works
just fine. Even as the delay goes up, the sources and sinks adapt (as we'll see
later) and the system proceeds correctly. At a certain point, either the
throughput will fall below that which can sustain a tolerable quality audio
and/or video, or else the delay will become too high for interactive
applications (or both!). At this stage, we would need some scheme for
establishing who has priority to use the network, and this would then be based
on resource reservation, and potentially, on charging.

FLOOR CONTROL

-------------------------------------------------------------------------------

Floor control is the business of deciding who is allowed to talk when. We are
all familiar with this in the context of meetings or natural face-to-face
scenarios. People use all kinds of subtle clues, some less subtle, to decide
when they or someone else can talk.

In a video conference, the view of the other participants is often limited (or
non-existent) so computer support for helping with floor control is necessary
(just think of talking to someone who you don't know, maybe over a poor
satellite phone call with a � second delay, then you get the idea, then add 5
other people on the same line!).

Floor control systems can be nearly automatic, triggered simply by who speaks,
or they can use the fact that the participants are in front of computers, and
have a user interface to a distributed program (either packet based or MCU
based) to request and grant the floor.

FLOOR CONTROL

-------------------------------------------------------------------------------

   * The picture illustrates a possible protocol for floor control

[Image] Floor Control

ACCESS CONTROL AND PRIVACY

-------------------------------------------------------------------------------

Access control in conferencing and multimedia in general is complex. In a
circuit based system, it can just rely on trust with the phone company, and
perhaps the addition of closed user groups, lists of numbers that are allowed
to call in or out of the conferencing group.

In a packet network , there are a number of other questions:

  1. How do we determine who is in and who is allowed to be in a conference?
  2. How do we stop people simply listening in?
  3. How do we know someone is who they say they are (assuming we don't know
     them personally)?

These are all dealt with by applying the principle of end-to-end security.
Basically, if we encrypt the audio or video, perhaps signing it with some magic
value before encrypting it with keys known only to the sender or receiver (or
else using a public key crypto system more suitable to multipoint
communication), then we can be assured that our communication I private.

It turns out that encrypting compressed video and audio is really very simple
for many compression schemes - in the case of H.261 for example, simply
scrambling the Huffman codes used for carrying around the DCT coefficients
might do!

Public key cryptography is preferred over private key since one has a n easier
key distribution problem.

ACCESS CONTROL AND PRIVACY

-------------------------------------------------------------------------------

   * Who is in or out of a conference?
   * How do we stop eavesdroppers?
   * The basic security techniques apply:
   * End to end encryption (Public Key Cryptography best for n-way)
   * Authentication through passwords or Digital Signatures
   * PGP or RSA both viable

PLAYOUT BUFFER ADAPTION FOR PACKET NETS

-------------------------------------------------------------------------------

It has been asserted that you cannot run audio (or video) over the Internet due
to

   * Delay variation due to other traffic through routers
   * Loss due to congestion

In fact, both are tolerable up to a point. The delay budget for bearable
interaction is often cited as around 200ms. However, for a lecture or broadcast
of a seminar, any amount of delay might not matter. The key requirement is to
adapt to delay variation, rather than the transit delay.

Given that a sender and receiver are matched at the audio i/o rates, or even if
they are slightly askew, a combination of an adaption buffer and silence
suppression at the send side can accommodate this.

The receiver estimates the interpacket arrival time variance, using exactly the
same technique as TCP uses to estimate the RTT, an exponential weighted moving
average calculated from:

  1. The current packet arrival time and media sample timestamp
  2. The previous packet arrival time and media timestamp.
       1. This is rolled into a running mean variance:

mi = mi-1 + g(vi - mi-1)

Then depending whether interaction or a lecture mode are in use, the receiver
buffers sound before playing out for 1 or more of these variances. When needing
to adapt, silence is added or deleted (rather than actual sound) at the
beginning of a talkspurt.

A similar inter-arrival pattern can be used by a video receiver to adapt to a
sender that is too fast, or by a decoder of compressed audio or video where the
CPU times vary depending on the audio contents!

Playout Buffer Adaption for Packet Nets

   * Delay and loss mean that some form of adaption must run at the receiver.
     The diagram shows this:

[Image] Txmit

MMCC - THE CENTRAL INTERNET MODEL

-------------------------------------------------------------------------------

It has been argued that the problem with the Internet model of multimedia
conferencing is that it doesn't support simple phone calls, or secure closed
("tightly managed") conferences.

However, it is easy to add this functionality after one has built a scalable
system such as the Mbone provides, rather than limiting the system in the first
place. For example, the management of keys can provide a closed group very
simply. If one is concerned about traffic analysis, then the use of secure
management of IP group address usage would achieve the effect of limiting where
multicast traffic propagated. Finally, a telephone style signaling protocol can
be provided easily to "launch" the applications using the appropriate keys and
addresses, simply by giving the users a nice GUI to a distributed calling
protocol system.

MMCC AND CMMC - CENTRAL INTERNET MODEL

-------------------------------------------------------------------------------

   * Can have it both ways - can interwork and have tightly coupled/controlled
     conferences
   * The diagram illustrates a Conference Management and multiplexing center
   * This would provide interworking of data (video and audio) as well as
     control (e.g. H.230 to MMCC)
   * It could also provide software mixing so that sites with only one decoder
     could still see multiple senders

[Image] CMMC

Picture: cmmc_sw_mix.ps

CCCP - THE DISTRIBUTED INTERNET MODEL

-------------------------------------------------------------------------------

   * 1. The conference architecture should be flexible enough so that any mode
     of operation of the conference can be used and any application can be
     brought into use. The architecture should impose the minimum constraints
     on how an application is designed and implemented.
   * 2. The architecture should be scaleable, so that ``reasonable''
     performance is achieved across conferences involving people in the same
     room, through to conferences spanning continents with different degrees of
     connectivity, and large numbers of participants. To support this aim, it
     is necessary explicitly to recognize the failure modes that can occur, and
     examine how they will affect the conference, and to design the
     architecture to minimise their impact.

Currently, the IETF working group on Conference Control is liasing with the
T.120 standards work in the ITU and have made some statements about partial
progress.

CCCP - DISTRIBUTED INTERNET MODEL

-------------------------------------------------------------------------------

   * Based on multicast
   * Based on packets
   * scalable
   * not yet standard, but basic idea the way forward

CCCP Model

We model a conference as composed of an unknown number of people at
geographically separated sites, using a variety of applications. These
applications can be at a single site, and have no communication with other
applications or instantiations of the same application across multiple sites.
If an application shares information across remote sites, we distinguish
between the cases when the participating processes are tightly coupled- the
application cannot run unless all processes are available and connectable - and
when the participating processes are loosely coupled, in that the processes can
run when some of the sites become unavailable. A tightly coupled application is
considered to be a single instantiation spread over a number of sites, whilst
loosely coupled and independent applications have a number of unique
instantiations, although possibly using the same application specific
information (such as which multicast address to use...).

The tasks of conference control break down in the following way:

   * Application control - Applications as defined above need to be started
     with the correct initial state, and the knowledge of their existence must
     be propagated across all participating sites. Control over the starting
     and stopping can either be local or remote.
   * Membership control - Who is currently in the conference and has access to
     what applications.
   * Floor management - Who or what has control over the input to particular
     applications.
   * Network management - Requests to set up and tear down media connections
     between end-points (no matter whether they be analogue through a video
     switch, a request to set up an ATM virtual circuit, or using RSVP over the
     Internet), and requets from the network to change bandwidth usage because
     of congestion.
   * Meta-conference management - How to initiate and finish conferences, how
     to advertise their availability, and how to invite people to join.

CCCP Model

The diagram illustrates the CCCP Model

[Image] CCCP

CCCP Class Hierarchy

We then take these tasks as the basis for defining a set of simple protocols
that work over a communication channel. We define a simple class hierarchy,
with an application type as the parent class and subclasses of network manager,
member and floor manager, and define generic protocols that are used to talk
between these classes and the application class, and an inter-application
announcement protocol. We derive the necessary characteristics of the protocol
messages as reliable/unreliable and confirmed/unconfirmed (where `unconfirmed'
indicates whether responses saying ``I heard you'' come back, rather than
indications of reliability).

Its easily seen that both closed and open models of conferencing can be
encompassed, if the communication channel is secure.

To implement the above, we have abstracted a messaging channel, using a
distributed inter-process communication system, providing confirmed/unconfirmed
and reliable/unreliable semantics. The naming of sources and destinations is
based upon application level naming, allowing wildcarding of fields such as
instantiations (thus allowing messages to be sent to all instantiations of a
particular type of application). The final section of paper briefly describes
the design of the high level components of the messaging channel (named
variously the CCC or the triple-C). Mapping of the application level names to
network level entities is performed using a distributed naming service, based
upon multicast once again, and drawing upon the extensive experience already
gained in the distributed operating systems field in designing highly available
name services.

REQUIREMENTS on CCCP from tools

Multimedia Integrated Conferencing has a slightly unusual set of requirements.
For the most part we are concerned with workstation based multimedia
conferencing applications. These applications include vat (LBL's Visual Audio
Tool), IVS (INRIA Videoconferencing System), NV (Xerox's Network Video tool)
and WB (LBL's shared whiteboard) amongst others. These applications have a
number of things in common:

   * They are all based on IP Multicast.
   * They all report who is present in a conference by occasional multicasting
     of session information.
   * The different media are represented by separate applications (1)
   * There is no conference control, other than each site deciding when and at
     what rate they send.

These applications are designed so that conferencing will scale effectively to
large numbers of conferees. At the time of writing, they have been used to
provide audio, video and shared whiteboard to conference with about 500
participants. Without multicast, this is clearly not possible. It is also clear
that these applications cannot achieve complete consistency between all
participants, and so they do not attempt to do so- the conference control they
support usually consists of:

   * Periodic(unreliable) multicast reports of receivers.
   * The ability to locally mute a sender if you do not wish to hear or see
     them.

  1. However, in some cases stopping the transmission at the sender is actually
     what is required.

Requirements from tools

  1. Common multicast channel used for control messages
  2. Different media from different applications
  3. Need session participant and other information to be added
  4. Need control

Common Control for Conferencing

Thus any form of conference control that is to work with these applications
should at least provide these basic facilities, and should also have scaling
properties that are no worse that the media applications themselves.

It is also clear that the domains these applications are applied to vary
immensely. The same tools are used for small (say 20 participants), highly
interactive conferences as for large (500 participants) disseminations of
seminars, and the application developers are working towards being able to use
these applications for ``broadcasts" that scale towards millions of receivers.

It should be clear that any proposed conference control scheme should not
restrict the applicability of the applications it controls, and therefore
should not impose any single conference control policy. For example we would
like to be able to use the same audio encoding engine (such as vat),
irrespective of the size of the conference or the conference control scheme
imposed. This leads us to the conclusion that the media applications (audio,
video, whiteboard, etc.) should not provide any conference control facilities
themselves, but should provide the handles for external conference control and
whatever policy is suitable for the conference in question.

Conferencing - special needs?

We often have the slightly special needs of being able to support:

   * Multicast based applications running on workstations where possible.
   * Hardware codecs at rates up to 2Mb/sand the need to multiplex their
     output.
   * Sites connecting into conferences from ISDN.
   * Interconnecting all the above.

These requirements have dictated that we build a number of Conference
Management and Multiplexing Centres to provide the necessary format conversion
and multiplexing to interwork between the multicast workstation based domain
and unicast(whether IP or ISDN) hardware based domain.].

What we need for packet based conferencing:

   * Multicast based for scaling
   * Software codecs
   * interworking with circuit based
   * interworking with hardware codecs

WHERE CURRENT CONFERENCE CONTROL SYSTEMS FAIL

-------------------------------------------------------------------------------

The sort of conference control system we are addressing here cannot be:

   * CENTRALISED. This will not scale.
   * Fixed Policy. This would restrict the applicability. The important point
     here is that only the users can know what the appropriate policies a
     meeting may need.
   * Application Based. It is very likely that separate applications will be
     used for different media for the foreseeable future. We need to be able to
     switch media applications where appropriate. Basing the conference control
     in the applications prevents us changing policy simply for all
     applications.

So what is wrong with current videoconferencing systems?

   * Don't scale
   * Fixed policies
   * Application based

Specific requirements - Modularity

Conference Control mechanisms and Conference Control applications should be
separated. The mechanism to control applications (mute, unmute, change video
quality, start sending, stop sending, etc.) should not be tied to any one
conference control application in order to allow different conference control
policies to be chosen depending on the conference domain. This suggests that a
modular approach be taken, with for example, a specific floor control modules
being added when required (or possibly choosing a conference manager tool from
a selection of them according to the conference).

Special Requirements: A single conference ctl user interface

A general requirement of conferencing systems, at least for relatively small
conferences, is that the participants need to know who is in the conference and
who is active. Vat is a significant improvement over telephone audio
conferences, in part because participants can see who is (potentially)
listening and who is speaking. Similarly if the whiteboard program WB is being
used effectively, the participants can see who is drawing at any time from the
activity window. However, a participant in a conference using, say, vat
(audio),IVS (video) and WB (whiteboard) has three separate sets of session
information, and three places to look to see who is active.

Clearly any conference interface should provide a single set of session and
activity information. A useful features of these applications is the ability to
``mute" (or hide or whatever) the local playout of a remote participant. Again,
this should be possible from a single interface. Thus the conference control
scheme should provide local inter-application communication, allowing the
display of session information, and the selective muting of participants.

Taking this to its logical conclusion, the applications should only provide
media specific features (such as volume or brightness controls), and all the
rest of the conference control features should be provided through a conference
control application.

Special Requirements: flexible floor control policies

Conferences come in all shapes and sizes. For some, no floor control, with
everyone sending audio when they wish, and sending video continuously is fine.
For others, this is not satisfactory due to insufficient available bandwidth
for a number of other reasons. it should be possible to provide floor control
functionality, but the providers of audio, video and workspace applications
should not specify which policy is to be used. Many different floor control
policies can be envisaged. A few example scenarios are:

   * Explicit chaired conference, with a chairperson deciding when someone can
     send audio and video. Some mechanism equivalent to hand raising to request
     to speak. Granting the floor starts video transmission, and enables the
     audio device. Essentially this is a schoolroom type scenario, requiring no
     expertise from end users.
   * Audio triggered conferencing. No chairperson, no explicit floor control.
     When someone wants to speak, they do so using ``push to talk". Their video
     application automatically increases its data rate from, for example,
     10Kb/s to 256Kb/s as they start to talk. 20 seconds after they stop
     speaking it returns to 10Kb/s.
   * Audio triggered conferencing with a CMMC (3). The CMMC can mix four
     streams for decoding by participants with hardware CODECs. The four
     streams are those of the last four people to speak, with only the current
     speaker transmitting at a high data rate. Everyone else stops sending
     video automatically.
   * A background Mbone engineering conference that's been idle for 3 hours.
     All the applications are ionized, as the participant is doing something
     else. Someone starts drawing on the whiteboard, and the audio application
     plays an audio icon to notify the participant.

Scaling from tightly to loosely coupled conferences

CCCP originates in part as a result of experience gained from the CAR
Multimedia Conference Control system. The CAR system was a tightly coupled
centralised system intended for use over ISDN. The functionality it provided
can be summarized up by listing its basic primitives:

   * Create conference
   * Join/Leave Conference
   * List members of conference
   * Include/exclude application in conference
   * Take floor

In addition, there were a number of asynchronous notification events:

   * Floor change

  1. Conference Management and Multiplexing Centre - essentially one or more
     points where multiple streams are multiplexed together for the benefit of
     people on unicast links, ISDN, hardware CODECs and the like
        o Participant joining/leaving
        o Application included/excluded

Packet Conferencing Requirements

   * Modularity - separation of conference control and applications
   * Single user interface (API) for conference control
   * Flexibility (e.g. for floor control)

The Conference Control Channel (CCC)

To bind the conference constituents together, a common communication channel is
required, which offers facilities and services for the applications to send to
each other. This is akin to the inter process communication facilities offered
by the operating system. The conference communication channel should offer the
necessary primitives upon which heterogeneous applications can talk to each
other.

The first cut would appear to be a messaging service, which can support
1-to-many communication, and with various levels of confirmation and
reliability. We can then build the appropriate application protocols on top of
this abstraction to allow the common functionality of conferences.

We need an abstraction to manage a loosely coupled distributed system, which
can scale to as many parties as we want. In order to scale we need the
underlying communication to use multicast. Many people have suggested that one
way of thinking about multicast is as a multifrequency radio, in which one
tunes into particular channels in which we are interested in. We take this one
step further and use it as a handle on which to hang the Inter Process
Communications model we offer to the protocols used to manage the conference.
Thus we define an application control channel.

Conference Control Channel, continued

CCCP originates in the observation that in a reliable network, conference
control would behave like an Ethernet or bus - addressed messages would be put
on the bus, and the relevant applications will receive the message, and if
necessary respond. In the Internet, this model maps directly onto IP multicast.
In fact the IP multicast groups concept is extremely close to what is required.
In CCCP, applications have a tuple as their address: (instantiation,
application type, address). We shall discuss exactly what goes into these
fields in more detail later. In actual fact, an application can have a number
of tuples as its address, depending on its multiple functions.

CCC Model

   * Network is a bus for control messages
   * Messages are directed to groups, but these are class based
   * Classes bind() to the appropriate groups to receive all the messages for
     that function

Examples of CCC use of this would be:

DESTINATION TUPLE Message

   * (1,audio, localhost) <start sending>
   * (*,activity`management, localhost) <receiving audio fromhost:> ADDRESS
   * (*,session`management, *) <I am:> NAME
   * (*,session`management, *) <I have media:>{application list}
   * (*,session`management, *) <Participant list:> {participantlist}
   * (*,floor`control, *) <REQUEST FLOOR>
   * (*,floor`control, *) <I HAVE FLOOR>

and so on. The actual messages carried depend on the application type, and thus
the protocol is easily extended by adding new application types.

Unreliability

CCCP would be of very little use if it were merely the simple protocol
described above due to the inherent unreliable nature of the Internet.
Techniques for increasing the end-to-end reliability are well known and varied,
and so will not be discussed here. However, it should be stressed that most
(but not all) of the CCCP messages will be addressed to groups. Thus a number
of enhanced reliability modes may be desired:

   * None. Send and forget. (an example is session management messages in a
     loosely coupled system)
   * At least one. (an example is a request floor message which would not be
     ACKed by anyone except the current floor folder).
   * n out of m. (an example may be joining of a semi-tightly coupled
     conference)
   * all. (an example may be ``join conference" in a very tightly coupled
     conference)

It makes little sense for applications requiring conference control to
re-implement the schemes they require. As there are a limited number of these
messages, it makes sense to implement CCCP in a library, so an application can
send a CCCP message with a requested reliability, without the application
writer having to concern themselves with how CCCP sends the message(s). The
underlying mechanism can then be optimized later for conditions that were not
initially foreseen, without requiring a re-write of the application software.

Reliable Multicast

There are a number of ``reliable" multicast schemes available. It may be
desirable to incorporate such a scheme into the CCC library, to aid support of
small tightly coupled conferences.

We believe that sending a message with reliability all to an unknown group is
undesirable. Even if CCCP can track or obtain the group membership through it's
distributed nameserver, which requires explicit application messages to the
nameserver, we believe that the application should explicitly know who it was
addressing the message to. It does not appear to be meaningful to need a
message to get to all the members of a group if we can't find out who all those
members are, as if the message fails to get to some members, the application
can't sensible cope with the failure. Thus we intend to only support the all
reliability mode to an explicit list of fully qualified (i.e. no
wildcards)destinations. Applications such as joining a secure(and therefore
externally anonymous) conference which requires voting can always send a
message to the group with "at least one" reliability, and then an existing
group member initiates a reliable vote, and returns the result to the new
member.

Ordering

Of course loss is not the only reliability issue. Messages from a single source
may be reorderedor duplicated and due to differing delays, messages from
different sources may arrive in ``incorrect'' order.

SINGLE SOURCE Reordering

Addressing reordering of messages from a single source first; there are a few
possible schemes, almost all of which require a sequence number or a timestamp.
A few examples are:

   * 1. Ignore the problem. A suitable example is for session messages
     reporting presence in a conference.
   * 2. Deal with messages immediately. Discard any packets that are older than
     the latest seen. Quite a number of applications may be able to operate
     effectively in the manner. However, some networks can cause very severe
     reordering, and it is questionableas to whether this is desirable.
   * 3. Using the timestamp in a message and the local clock, estimate the
     perceived delay from the packet being sourced that allows (say) 90% of
     packets to arrive. When a packet arrives out of order, buffer it for this
     delay minus the perceived trip time to give the missing packet(s) time to
     arrive. If a packet arrives after this timeout, discard it. A similar
     adaptive playout buffer is used in vat for removal of audio jitter. This
     is useful where ordering of requests is necessary and where packet loss
     can be tolerated, but where delay should be bounded.
   * 4. Similar to above, specify a fixed maximum delay above minimum perceived
     trip time, before deciding that a packet really has been lost. If a packet
     arrives after this time, discard it.
   * 5. A combination of both of the above. Some delay patterns may be so odd
     that they upset the running estimate in [3]. Many conference control
     functions fall into this category, i.e. time bounded, but tolerant of
     loss.
        o 6. Use a sliding window protocol with retransmissions as used in TCP.
          Only useful where loss cannot be tolerated, and where delay can be
          unbounded. Very tightly coupled conferences may fall into this
          category, but will be very intolerant to failure. Should probably
          only be used along with application level timeouts in the
          transmitting application.

  1. It should be noted that all except [1] require state to be held in a
     receiver for every source. As not every message from a particular source
     will be received at a particular receiver due to CCCP's multiple
     destination group model, receiver based mechanisms requiring knowing
     whether a packet has been lost will not work unless the source and
     receivers use a different sequence space for every (source, destination
     group) pair. If we wish to avoid this (and I think we usually do!), we
     must use mechanisms that do not require knowing whether a packet has been
     lost.
  2.

Reliability and ordering of multicast control messages

   * 1. Have CCCP ignore the problem. Let the application sort it out.
   * 2. Have CCCP deal pass messages to the application immediately. Discard
     any packets that are older than the latest seen.
   * 3. As above, estimate the perceived delay within which (say) 90% of
     packets a particular source arrive, but delay all packets from this source
     by the perceived delay minus the perceived trip time.
   * 4. As above, calculate the minimum perceived trip time. Add a fixed delay
     to this, and buffer all packets for this time minus their perceived trip
     time.
   * 5. A combination of [3] and [4], buffering all packets by the smaller of
     the two amounts.
   * 6. Explicitly ack every packet. Do not use a sliding window.

MULTIPLE SOURCE Ordering

In general we do not believe that CCCP can or should attempt to provide
ordering of messages to the application that originate at different sites. CCCP
cannot predict that a message will be sent by, and therefore arrive from, a
particular source, so it cannot know that it should delay another message that
was sent at a later time. The only full synchronization mechanism that can work
is an adaptation of [3]..[5] above, which delays all packets by a fixed amount
depending on the trip time, and discards them if they arrive after this time if
another packet has been passed to the user in the meantime. However, unlike the
single source reordering case, this requires that clocks are synchronised as
each site.

CCCP does not intend to provide clock synchronization and global ordering
facilities. If applications require this, they must do so themselves. However,
for most applications, a better bet is to design the application protocol to
tolerate temporary inconsistencies, and to ensure that these inconsistencies
are resolved in a finite number of exchanges. An example is the algorithm for
managing shared teleconferencing state proposed by Scott Shenker, Abel Weinrib
and Eve Schooler [she].

For algorithms that do require global ordering and clock synchronization, CCCP
will pass the sequence numbers and timestamps of messages through to the
application. It is then up to the application to implement the desired global
ordering algorithm and/or clock synchronization scheme using one of the
available protocols and algorithms such as NTP [lam],[fel],[bir].

CCC Addresses

As already mentioned, a CCC destination is a tuple of the following form:

(instantiation, type, address)

An application registers itself with its CCC library (and possibly with a
distributed nameserver - more of that is a later version of this paper),
specifying one or more tuples that it considers describe itself. Note that
there is no conference identifier specified - it is presumed that a control
group address or control host address or address list are specified at startup,
and that Meta-conferencing (i.e., allocation and discovery of conference
addresses) is outside the scope the CCC itself. Is this too restrictive? May
not if we let allow CCC lib to open multiple CCC's simultaneously, but this may
complicate the applications The parts of the tuple are:

   * Address
   * Type
   * Instantiation

CCCP Address

The address field will normally be registered as one of the following:

   * hostname
   * username@hostname

When other applications wish to send a message to a destination group (a single
application is a group of size 1), they can specify the address field as one of
the following:

username@hostname

   * hostname
   * username@*.domain
   * username@*

The CCC library is responsible for ensuring a suitable multicast group (or
other means) is chosen to ensure that all possible matching applications are
potentially reachable (though depending on the reliability mode, it does not
necessarily ensure the message got to them all).

It should be noted that in any tuple containing a wildcard (*) in the address,
specifying the instantiation (as described below) does not guarantee a unique
receiver, and so normally the instantiation should be wildcarded too.

CCCP type

The type field is a class hierarchy that can be literally anything. However,
some guidelines are needed to ensure that common applications can communication
with each other.

Normally an application would register itself under the name of the application
to ensure that an message specific to that application can be delivered - for
example vat would register itself under the type vat.

An application will also register itself under any types it wishes to receive
messages on. As a first pass, the following types have been suggested:

   * audio. send- the application is interested in messages about sending audio
   * audio. recv- the application is interested in messages about receiving
     audio
   * video. send- the application is interested in messages about sending video
   * video. recv- the application is interested in messages about receiving
     video
   * workspace- the application is a shared workspace application, such as a
     whiteboard
   * session. remote - the application is interested in knowing the existence
     of remote applications(exactly which ones depends on the conference, and
     the session manager)
   * session.l ocal - the application is interested in knowing of the existence
     of local applications
   * media-ctrl - the application is interested in being informed of any change
     in conference media state (such as unmuting of a microphone).
   * floor.manager- the application is a floormanager
   * floor.slave - the application is interested in being notified of any
     change in floor, but not (necessarily) in the negotiation process.

It should be noted that types can be hierarchical, so (for example) any message
addressed

to audio would address both audio.send and audio.recv applications. It should
also be

noted that an application expressing an interest in a type does not necessarily
mean that the application has to be able to respond to all the functions that
can be addresses to that type, although (if required) the CCC library will
acknowledge receipt on behalf of the application.

Examples of the types existing applications would register under are:

   * vat- vat, audio.send, audio.recv
   * IVS- IVS, video.send, video.recv
   * NV- NV, video.send, video.recv
   * WB- WB, workspace
   * a conference manager - confman, session.local, session.remote, media-ctrl,
     floor.slave
   * a floor ctrl agent - floor agent, floor.manager,floor.slave

CCCP instantiation

The instantiation field is purely to enable a message to be addressed to a
unique application. When an application registers, it does not specify the
instantiation - rather this is returned by the CCC library such that it is
unique for the specified type at the specified address. It is not guaranteed to
be globally unique - global uniqueness is only guaranteed by the triple of
(instantiation, type, address) with no wildcards in any field. When an
application sends a message, it uses one of its unique triples as the source
address. Which one it chooses should depend on to whom the message was
addressed.

A few examples

Before we describe what should comprise CCCP, we will present a few simple
examples of CCCP in action. There are a number of ways each of these could be
done- this section is not meant to imply these are the best ways of
implementing the examples over CCCP.

Unifying user interfaces - session messages in a ``small'' conference

Applications:

   * An Audio Tool (at), registers as types: at, audio.send, audio.recv
   * A Video Tool (vt), registers as types: vt, video.send, video.recv
   * A Whiteboard (wb), registers as types: wb, workspace
   * A Session Manager (sm), registers as types: sm, session.local,
     session.remote

The local hostname is x. There are a number of remote hosts, one of which is
called y.

A typical exchange of messages may be as follows:

   * The following will be sent periodically:

(1,audio.recv,x) (*,sm.local,x) KEEPALIVE

(1,video.recv,x) (*,sm.local,x) KEEPALIVE

(1,wb,x) (*,sm.local,x) KEEPALIVE

   * The following will be sent periodically with interval

(1,sm,x) (*,sm.remote,*) I_HAVE_MEDIA text_user_name audio.recv video.recv wb

   * An audio speech burst arrives at the audio application from y

(1,audio.recv,x) (*,sm.local,x) MEDIA_STARTED audio y

   * session manager highlights the name of the person who is speaking
   * speech burst finishes

(1,audio.recv,x) (*,sm.local,x) MEDIA_STOPPED audio y

   * session manager de-highlights the name of the person who was speaking
   * video starts from z

(1,video.recv,x) (*,sm.local,x) MEDIA_STARTED video z

   * periodical reports:

(1,audio.recv,x) (*,sm.local,x) KEEPALIVE

(1,video.recv,x) (*,sm.local,x) MEDIA_ACTIVE video z

(1,wb,x) (*,sm.local,x) KEEPALIVE

   * someone restarts the session manager:

(1,sm,x) (*,*,x) WHOS_THERE

(1,audio.recv,x) (*,sm.local,x) KEEPALIVE

(1,video.recv,x) (*,sm.local,x) MEDIA_ACTIVE video z

(1,wb,x) (*,sm.local,x) KEEPALIVE

   * and so on...this is illustrated in the diagram below

[Image] Unification

A voice controlled video conference

In this example, the desired behavior for participants to be able to speak when
they wish. A user's video application should start sending video when their
audio application starts sending audio. No two video applications should aim to
be sending at the same time, although some transient overlap can be tolerated.

Applications:

   * An Audio Tool (at), registers as types: at, audio.send, audio.recv
   * A VIdeo Tool (vt), registers as types: vt, video.send, video.recv
   * A Session Manager (sm), registers as types: sm, session.local,
     session.remote
   * A Floor Manager (fm), registers as types: fm, floor.master

There are hosts x and y, amongst others.

Its assumed that session control messages are being sent, as in the example
above.

   * The user at x starts speaking. Silence suppression cuts out, and the audio
     tool starts sending audio data:

(1,audio.send, x) (*,sm.local,x),(*,floor.master,x) MEDIA_STARTED audio x

   * ...this causes the sm to highlight the ``you are sending audio'' icon it
     also causes the floor manager to report to the other floor managers:

(1,floor.master,x) (*,floor.master, *) MEDIA_STARTED audio x

   * and also it requests the local video tool to send video:

(1,floor.master,x) (*,video.send, x) START_SENDING video

   * ...this causes the video tool to start sending

(1,video.send, x) (*, sm.local, x),(*.floor.master, x)MEDIA_STARTED video x

   * ...which, in turn, causes the sm to highlight the ``you are sending video
     ''icon
   * the user at x stops speaking. Silence suppression cuts in, , and the audio
     tool stops sending audio data

(1,audio.send, x) (*,sm.local,x),(*,floor.master,x) MEDIA_STOPPED audio x

   * ...this causes the sm to de-highlight the ``you are sending audio'' icon
   * ...the session manager starts a timeout procedure before it will stop
     sending video

...

   * a user at y starts sending audio and video data.
   * The local audio and video tools report this to the session manager:

(1,audio.recv,x) (*,sm.local,x) MEDIA_STARTED audio y

(1,video.recv,x) (*,sm.local,x) MEDIA_STARTED video y

   * ...as in previous example, sm highlights sender's name and the floor
     manager reports what's happening:

(1,floor.master, y) (*,floor.master,*) MEDIA_STARTED audio y

(1,floor.master, y) (*,floor.master,*) MEDIA_STARTED video y

the local floor manager tells the local video tool to stop sending

(1,floor.master,x) (*,video.send, x) STOP_SENDING video

   * ...this causes the video tool at x to stop sending

(1,audio.send, x) (*,sm.local,x),(*,floor.master,x) MEDIA_STOPPED video x

...

More complex needs

Dynamic style-group membership

Many potential applications require to be able to contact a server or a token
holder reliably without necessarily knowing the location of that server. An
example may be a request for the floor in a conference with one roaming floor
holder. The application requires that the message gets to the floor holder if
it is at all possible, which may require retransmission and will require
acknowledgement from the remote server, but the application writer should not
have to write the re-transmission code for each new application. CCCP supports
"at least one" reliabilty, but to address such a REQUEST_FLOOR message to all
floor managers is meaningless. By supporting dynamic type-groups CCCP can let
the application writer address a message to a group which is expected to have
only one(or a very small number) of members, but whose membership is changing
constantly.

In the example described, the application requiring the floor sends:

(1,floor.master, x) (*, floor.master.holder, *) REQUEST`FLOOR

with "at least one" reliability. retransmissions continue until the message is
acknowledged or a timeout occurs.

When the floor holder receives this message, it can then either send a grant
floor or a deny floor message:

(1,floor.master, y) (1, floor.master, x) GRANT`FLOOR

This message is sent reliably (i.e., retransmitted by CCCP until an ACK is
received).

On receiving the GRANT_FLOOR message, the floor manager at x expresses an
interest in the type-group floor.master.holder. On sending the GRANT_FLOOR
message, the floor manager at y also removes it's interest in the type-group
floor.master.holder to prevent spurious acking of other REQUEST_FLOOR messages.
However, if the GRANT_FLOOR message retransmissions time out, it should
re-express an interest. This is illustrated in the diagram below:

[Image] Floor Ctl Eg

Conference Membership Discovery

CCCP will support conference membership discovery by providing the necessary
functions and types. However, the choice of discovery algorithm, loose or tight
control of the conference membership and so forth, are not within the scope of
CCCP itself. Instead these algorithms should be implemented in a Session
Manager on top of the CCC.

Network support and protocols

MULTIMEDIA COMMUNICATION

-------------------------------------------------------------------------------

The Information Superhighway needs network protocols that can carry multimedia
around with the sorts of guarantees. At least that is what the communications
companies say. In fact, if a network is provisioned at the right level (its
lines are fast enough), it may not be necessary to impose any special
protocols. Even with networks running close to capacity, the requirement is
really merely for a way for routers and switches to distinguish different
traffic types, and give them the appropriate forwarding priority.

The difference between different networks comes down (as do many in computer
science) to when the binding is done between the flow of traffic and the state
instantiated (together with resources) in a router to support the traffic class
that this flow needs.

Two extreme examples of multiservice network architectures illustrate this:

Two network approaches to multiservice

   * IP + RSVP + Flows
   * ATM + Q.2931
   * Both classify packets either one at a time, or per call
   * Illutrated below

[Image] CBQ

IP, RSVP, Flow Ids

In the Internet, the RSVP protocol can be used by a recipient to request a
specified a quality of call for a flow that they require. This request is
periodically resent, so there is no binding necessarily between the call and
the route (i.e. rerouting can happen between 1 packet and the next). If no
special quality is required, or else the routers already know about this
traffic class and have capacity, then no RSVP is needed.

A source chooses a unique flow id for the traffic, which can be used by routers
as a fast lookup for route and quality requirement for the traffic. In the
absence of an entry for this flow in a router, the rest of the packet's IP
header can be consulted and the packet forwarded with some default quality
anyhow. If the required quality varies, then it can simply be latched by the
next RSVP refresh.

RSVP carries a flow specification and a filter specification. The flow
specification is a list of parameters to do with throughput, delay and errors
that will be needed to meet the flow's requirements for reasonable delivery.
The filter specification is a patter that is used to match the flow when it
arrives at a router. A filter can be turned on and off without removing the
flow specification, so that intermittent flows (e.g. video or voice in a floor
controlled video conference) can be quickly turned on and off within the net.
This is important for multicast.

RSVP interacts with the Routing protocol by possibly locking routes while any
reservation is in progress to avoid looping. Filters can be wildcard, which is
shared amongst all senders to a group, fixed or dynamic. Dynamic filters are
sets of fixed flowspecs that can be chosen between on demand.

IP and RSVP and Flows

   * The IP model is being enhanced to add soft state
   * This refers to use of RSVP to establish traffic classes
   * Can do this for 1 or a group or a type
   * IP6 carries a flow id, like a connection id for fast lookup
   * RSVP permits specification of leaky bucket parameters for this
   * For non-delay bounded traffic, just don't this!

ATM, Q.2931 and VCI/VPIs.

With an ATM network, before any packet can be sent, the call setup protocol
(recently standardised as Q.2931 by the ITU), is invoked to setup path, call
and resources needed. The binding of all these is needed up front. There isn't
yet any way of aggregating calls, as there is in RSVP through clever filters.
Long term calls might be configured through network management to use PVCs, to
deal with the intermittent bandwidth problem above (a PVC allows a receiver to
control specification of a flow, counter-intuitively).

ATM and Q.2931

   * ATM is the telco/PNOs approach
   * Packet (cell) oriented
   * Resources reserved by sender
   * Allows fine grain allocation
   * Application must know its needs

Quality of Service Parameters: How Many

The ATM and Q.2931 specifications list a huge number of QoS parameters
including:

   * Mean, Sustainable and Peak Cell (packet) Rate
   * Cell Loss Tolerance
   * Burst Tolerance
   * Cell Delay Variance

  1. The Internet Community is working on the basis of a much simpler
     formulation of quality for an application. Basically, there is a minimum
     throughput, and a delay tolerance. The delay variation is only necessary
     to specify for either tightly bounded conferences in overloaded networks,
     or else to support legacy eequipment (CODECs that don't tolerate time
     variation beyond some bound).

QOS - how many parameters

   * Q.2931 permits a plethora of parameters
   * How many are really needed? Depends on application
   * Mean throughput and delay tolerance are probably about it

Real Time Protocol and Real Time Control Protocol

The Internet Community has developed a standard protocol for audio and video
and other image distribution applications to use to carry their data around and
provide a common platform to express some of the timing and session information
needed by real time applications.

This is RTP and its associated control protocol, RTCP. RTP is simply a framing
protocol. It contains no comprex exchanges of messages (handshaking), but
rather leaves any conference control matters to higher levels.

RTP packets contain media types and media specific timestamps. These are used
in adapative playout buffer schemes. RTCP packets carry source and receiver
reports that describe the users, and the reception quality.

RTP uis usually multicast (even when there is only one sender and recipient)
using the User Datagram Protocol (UDP) over IP multicast.

RTP and RTCP

   * Internet packet format/protocol for carrying audio and video
   * Used now for several years
   * carries media time stamp and not a lot else
   * RTCP performs some of the CCC functions

-------------------------------------------------------------------------------

RTP Packet Format

0 1 2 3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

|V=2|P|X| CC |M| PT | sequence number |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| timestamp |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| synchronization source (SSRC) identifier |

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

| contributing source (CSRC) identifiers |

| .... |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The first twelve octets are present in every RTP packet, while the list

of CSRC identifiers is present only when inserted by a mixer.

version (V): 2 bits

This field identifies the version of RTP. The version defined by this
specification is two (2).

padding (P): 1 bit

If the padding bit is set, the packet contains one or more additional padding
octets at the end which are not part of the payload.

extension (X): 1 bit

If the extension bit is set, the fixed header is followed by exactly one header
extension, with a format defined in Section 5.2.1.

CSRC count (CC): 4 bits

The CSRC count contains the number of CSRC identifiers that follow the fixed
header.

marker (M): 1 bit

The interpretation of the marker is defined by a profile. It is intended to
allow significant events such as frame boundaries to be marked in the packet
stream.

payload type (PT): 7 bits

This field identifies the format of the RTP payload and determines its
interpretation by the application.

sequence number: 16 bits

The sequence number increments by one for each RTP data packet sent, and may be
used by the receiver to detect packet loss and to restore packet sequence.

timestamp: 32 bits

The timestamp reflects the sampling instant of the first octet in

the RTP data packet. The sampling instant must be derived from a

clock that increments monotonically and linearly in time to allow

synchronization and jitter calculations

SSRC: 32 bits

The SSRC field identifies the synchronization source.

CSRC list: 0 to 15 items, 32 bits each

The CSRC list identifies the contributing sources for the payload contained in
this packet. The number of identifiers is given by the CC field. If there are
more than 15 contributing sources, only 15 may be identified. CSRC identifiers
are inserted by mixers, using the SSRC identifiers of contributing sources.

RTCP Packet Format

0 1 2 3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

|V=2|P| RC | PT=SR=200 | length | header

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| SSRC of sender |

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

| NTP timestamp, most significant word | sender

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ info

| NTP timestamp, least significant word |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| RTP timestamp |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| sender's packet count |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| sender's octet count |

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

| SSRC_1 (SSRC of first source) | report

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ block

| fraction lost | cumulative number of packets lost | 1

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| extended highest sequence number received |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| interarrival jitter |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| last SR (LSR) |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| delay since last SR (DLSR) |

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

| SSRC_2 (SSRC of second source) | report

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ block

: ... : 2

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

| profile-specific extensions |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

SR or RR: The first RTCP packet in the compound packet must always be a

report packet to facilitate header validation as described in Appendix A.2.
This is true even if no data has been sent nor received, in which case an empty
RR is sent, and even if the only other RTCP packet in the compound packet is a
BYE.

Additional RRs: If the number of sources for which reception statistics

are being reported exceeds 31, the number that will fit into one SR or RR
packet, then additional RR packets should follow the initial report packet.

SDES: An SDES packet containing a CNAME item must be included in each

compound RTCP packet. Other source description items may optionally be included
if required by a particular application, subject to bandwidth constraints (see
Section 6.2.2).

BYE or APP: Other RTCP packet types, including those yet to be defined,

may follow in any order, except that BYE should be the last packet sent with a
given SSRC/CSRC. Packet types may appear more than once.

Middle Layers

ST-II/PVP/etc

An alternative to IP+RSVP, and under some active research by IBM amongst
others, is the ST-II protocol. ST is like IP, but has two added functions, one
subsumes RSVP (its called the ST Control Message Protocol, SCMP) and the other
is t support multidestination calls.

ST is not yet widely available, but it is used on one very large network called
the DSINet (Defense Simulation Internet), which runs all around the world and
allows NATO countries to use videoconferncing and access remote computer
warfare simulations, hence it is quite important.

ST and PVP and so on...

   * ST was an experimental version of IP for flows
   * Still in use on some research nets
   * may still come onstream...

MULTICAST

-------------------------------------------------------------------------------

   * So why is IP Style Multicast is Better than Unicast?
   * The internet has provided multicast for about 7 years now.
   * Only recently have products included this facility
   * No more so than with routers!
   * The graph illustrates past growth in Internet multicast reacaility

-------------------------------------------------------------------------------

Multicast Routing Protocols

   * Reverse Path Unicast routes:
   * DVMRP, MOSPF, PIM
   * Own Routes:
   * CBT
   * Single Tree:
   * CBT and PIM Sparse mode
   * Source Tree: DVMRP, Dense Mode ...

Reliable Multicast Transport

   * RMP - from Berkeley
   * Uses virtual Ring to circulate token
   * Scales well for small numbers of sources
   * Bad for video/audio with lots of sources and sinks
   * Better to distribute relaibility and ordering

Internet MM Applications

   * Video: IVS, Nv, Vic, CuSeeMe
   * Audio: IVS, Vat, Bat, Maven, Nevot
   * Whiteboards: ShowMe, Wb, MScrawl
   * Other: Imm, etc
   * Reliability and Ordering, distributed
   * Not a problem for human-to-human -
   * Consistency only needed for "data"

Multicast Coordination

   * With CCCP or MMCC, also need Session
   * Session Directory Tools allocate addresses for groups to avoid clashes
   * SD Tool provides user with "bboard" style navigation of topics
   * Protocol is draft for now....

VR: General Tools

   * Few - probably only VRML (Virtual Reality Modelling Language, Web like,
     from SGI)
   * Also Vidl (Video Extensions to Tcl)
   * See also MUDs, MOOs and:
   * Jupiter - Xerox multimedia MUD.

[Image] Mbone Growth

Multicast Lesson 1.

   * If S sends to D unicast, but D is a replicated service, S sends n packets.

   * If S multicasts, and all the D's recognize multicast destination as
     themselves (and dont lose it), S sends 1 packet

   * N-fold decrease in bandwidth!
   * The figure shows a multicast stream
   *  [Image] Mcast Model

Multicast Lesson 2.

   * R recognizes multicast from S, knows there are members of D on "other
     side", N-fold decrease in bandwidth.
   * Especially useful since path through R probably slower line!
   * D's group members may come and go.
   * R has to now keep track of this - in unicast case, it did something
     similar through ARP ....now, group memberships changes made by D's sending
     (IGMP) better approach.
   * [note similarity to mobile host problem]
   * Count cost in terms of number of times packet traverses each link.

Multicast Lesson 3.

   * Now routers must either exchange D location information, or forward all
     messages to all D's, or exchange D non-location information.
   * Key tradeoffs;
        o a number of groups at site and not at others (sparseness of group
          distribution).
        o b rate of groups appearing and disappearing, members appearing and
          disappearing affects size and frequency of updates.
        o c sources appearing and disappearing affects tradeoff in choosing
          distribution by default with "pruning" versus distribution of joins

Scaling 1.

  1. Simple optimization for join or prune - "aggregate"different Ds...

   * if possible. Note: Aggregation implies

   * a routes for different Ds are same and
   * b at the same time...

   * include cost of router exchange information, and unnecessary visits to
     links without members....

Multicast Lesson 4.

   * Single tree (centered = steiner [NP-Hard] problem) versus
   * Tree per source on reverse path tree from unicast SPF from D to S. versus
   * SPF tree from each source to D leaves

Scaling 2.

   * Alternative approach altogether -
   * Group = list of unicast addresses (or 'site' router' address).

   * Count an optimization in paths in terms of two (usually conflicting)
     factors -

a number of times a link is visited by different sources

for same destinations D.

b minimizing delay (or other metric)

Multicast IP (DVMRP, CBT, PIM, MOSPF)

For datagram networks, (be they IP, Novell or CLNP), there are two

basic approaches to calculating multicast routes.

   * 1. Calculate a Reverse Path Multicast tree from each source - this is
     essentially just the tree made up of the unicast routes from the
     destination to the source, and can be built on demand when sources start,
     and removed when they stop, by extracting it from the unicast routing
     tables.

   * There are two variants of this, based on which unicast routing paradigm is
     in use
        o i. If the underlying scheme is a Link-state one
        o (c.f. OSPF etc), then a link state multicast tree can be built from
          this (in the case of MOSFP, as specified in RFC1584, as supported in
          Proteon Routers, this is made more scaleable by using aggregation).
        o ii. If the underlying system is a distance vector one, then
        o RFC 1075 describes how to build source based trees. It uses pruning
          to achieve better scaling.
   * 2. Find the Center of the group, and build a tree from it to them (and
     them to it, and thence to each other). This scheme is used by CBT, and is
     also part of the basis of the new Protocol Independent Multicast routing
     protocol (Cisco support), which switches between mode 1/ and mode 2/
     depending whether a group is sparse or dense in terms of how its
     membership is distributed over the Internet.

  1. The only problem with this latter approach is that "finding the center" is
     a well known non-computable problem (the Steiner Tree NP-Hard problem).
     Luckily, there are quite a few heuristics emerging.
  2. Finally, a single tree is good for minimizing the delay amongst ALL the
     participants, whilst a source specific tree may be better in terms of
     optimal use of links and may result in better source specific delay.

Different multicast schemes differ in their delay and cost tradeoffs

This is illustrated in the picture:

[Image] Delay Versus Cost

Multicast ATM

In a virtual circuit based network such as X.25 or ATM, it is still possible to
build point-multicast calls (albeit a lot less efficient than the IP many to
many model, if you have a large number of sources in the group). To route the
call from a caller to a number of callees is easy - it would simply rely on the
standard call routing in ATM switches. However, if we want the same support for
"receiver join" or "leaf join", then we need a rendezvous point. This means
that CBT or PIM would make good candidates for multicast call routing in
circuit networks.

ATM Multipoint/Multicast Call Routing

This may well look like this:

[Image] ATM Mcast

Multi-point Control Units for ISDN

In networks built out of physical digital circuits such as ISDN, (in the
absence of multipoint physical circuits!) we need some other mechanism for
multiparty calls - this ends up being a question of building a higher layer
entity to do the fanout. For ISDN, and for ISDN Video, this has been defined as
an application layer unit, called a Multipoint Control Unit, and we'll talk
about its protocols more below.

QoS Based Routing.

Multimedia communication can entail multiple, heterogeneous networking
requirements. This can interact with unicast, just as it did for multicast
routing. For example, if I transfer a large document including video from one
machine t another while I am talking to someone over the same network, it maybe
that my voice call is best routed over a modest bandwidth, but low delay path,
while the file transfer better moved over a high throughput, but relatively
high delay path (e.g. satellite, for example). In general, multi-metric routing
is a very hard problem. However, it is usually easier if first you separate
metrics into traffic independent and dependent ones - for example, the
linespeeds and propagation delays are not subject to other traffic, while the
packet store and forward times and rates are subject to queuing.

ADSL (SUBSCRIBER LOOP VIDEO DELIVERY)

-------------------------------------------------------------------------------

An exciting new development in the transmission systems world has been that of
high bandwidth transmission over existing copper into the home via a digital
subscriber loop protocol. It has been discoverd that the POTS cable plant is
good enough to get as much as 8Mbps into a home over the distance fro mthe
phone exchange. This may be used by cable tv or Video rental companies to
deliver video in to the home, and use a low bandwidth output path to allow the
user access to other services and to control this one - it remains to be seen
if there really is going to be a demand for video in this form.

WHAT WILL IT COST AND WHO WILL SELL IT TO US?

-------------------------------------------------------------------------------

The current tarrifing of networks and multimedia is a thorny question:

ISDN based networks are relatively inexpensive in terms of perfermance versus
call charges in Europe, but relatively non-existent in the US. However, for
multiparty international calls, even I neurope, the tarrifs rapdily become less
attractive than leased lines - 5 minutes a day between 4 different coutries at
basic rate will cost more than the same bandwidth leased.

This means that in terms of multiparty conferencing, it is likely that packet
based networks built on top of leased lines will be more attractive, esprcially
since they can be used for data when no in use for conferencing.

The fact that the Internet does not currently have the capacity for much of
these types of use is simply because it is early days for desk top distrbiuted
conferencing yet.

The end system capabilities are rapidly becoming marginal - the cost of a video
and audio card adds around 10% to the price of a mid-range PC now.

It seems likely that multiservice packet networks, and in particular those with
good multicast support, will eventually be the way forward. However, it is also
likely that the last mile ("subscriber loop" may well (as with BT's Internet
Service) be provided through basic rate ISDN. However, with IP, or ATM based
end systems at the end of such a hop, there is no reason not to take full
advantage of distrubuted conferencing.

As the take up gets larger, even just for data access, the backbone bandwidths
will have to increase. It may well be that the bottlenecks we see today are
just a figment of the current tariff structures that are required to fund the
grwoth in the superhighway. Once the capacity is in place, the prices will
tumble

However, thjere will always be usedrs who can overload the backbone - if not
video, then HDTV, or 3D motioj holography, or multi-player VR r something. So
reservation will be needed, together with some enforcement, whether through
priority or charging or both, is needed. Note however that it is only needed
for these heavy duty customers. There may come a time when line rental will be
all you pay even if you spend 2 hours a day in 5 way videoconferences with
prople in 5 different countries.

What will it cost and who will sell it?

   * The telcos and entertainment cos would love to own it
   * Truth Is that the net is broader and more radical than that
   * Anyone can be an 'author' or 'performer'
   * Leads to a different model of

  1. billing
  2. security
  3. dimensioning

   * The overcapacity needed to permit busniess to work will mean that idle
     time capacity will be very large...
   * Likely to see strange traffic patterns!

Operating system

DEVICE DRIVERS

-------------------------------------------------------------------------------

Just as within the network, within a host computer you need to control delay
and avoid ignoring or starving a multimedia device. To some extent, this might
be the job of the system scheduler (see below), but this can be saved a lot of
work by device drivers providing adaquate buffering and timeing support.

Device drivers operate out of hardware interrupt levels, so priorities can be
set approprirately for the appropriate input or output urgency, combined with
buffers appropriate to the next task in hand.

For example, an audio device might sample its input 8 bits at a time (e.g.
8mhz, 8-bit mu or a-law samples). But if we are going to process these in an
application, or perhaps send them over a packet net, it may make more sense to
ask for 40msec samples at a time (i.e. 40 *8000/1000, or 320 bytes), since this
is a unit that can more easily be processed or packetised. This will depend on
the exact deveice hardware whether one can program it in a driver to deliver
audio/video dma to some particular buffer (or from) and only interrupt at
finish. Then the driver has to turn around the buffer and give the device an
new buffer before it runs out (actually, this is easily done using a circular
buffer that is sized to be twice the size of the application reads, plus the
amount of storage necessary for the arrival rate for the time the read itself
takes).

Video devices vary a lot more than this, but ideally would lok like VRAM to the
application.

OPERATING SYSTEMS

-------------------------------------------------------------------------------

   * Need same timeliness and throughput control in system as in network
   * devicedrivers may take most the strain
   * If, and only if, deveices and systems have good clock access

REAL TIME SCHEDULING

-------------------------------------------------------------------------------

Realtime scheduling is not neessary in hybrid systems, nor is it necessary in
the operating system for networked applications that use adaptive playout
schemes. However, it may be necessary to support multiple priorities (or
hierarcahil round robin scheduling) in a system that supported multiple
multimedia applciations simulataneously, otherwise one might starve out the
others. One concrete exampel would be a general purpose computer used to
support video on demand.

In a System V Unix like system (e.g. Solaris, HP-UX, AIX and NT), you might
implement the data and control paths from a multimedia I/O device to the
network i/o devie entirely within the operating system rather than developing a
special applciation and relying o nthe operating system (or having to put up
with the kernel to user level scheduling overheads!), or you mighjt be able to
use one of the more modern programming facilities such as kernel threads to
program such an application more flexibly.

REAL TIME SCHEDULING

-------------------------------------------------------------------------------

   * Advent of continuous media may need real time scheduling
   * May not, though - can overprovision system
   * Note that priorities would then be a sufficient system

SYNCHRONIZATION

-------------------------------------------------------------------------------

There are three plaes where synchronisation is important:

  1. Within a stream, we need to make sure that transmitter and receiver are in
     synchronisation. This entails encoding a clock in the data, or else using
     a network that conveys the clock, or both.
  2. Between separate streams, e.g. the video from two people in a
     videoconference, we might want to make sure that the relative timeing
     perceived by one view at one site of the two streams is the same as that
     perceived by a different viewer at a different site - for example, a
     videoconference with 4 people, A,a, B, b, where A and B are sending, and a
     and b are watching, and a is near A and b is near B. A delay needs to be
     added to the stream from B to b, and to the stream from A to a to create a
     level playing field. In a multicast situation, this delay is incorporated
     in to the playout buffer as a baseline.
  3. Between different media - e.g. lip synch

Synchronisation

   * Intra-stream synch - inside a stream, need to know where in the "time
     structure" a bit goes
   * Inter-stream - e.g. we are watching two different people and want to see
     their reactions to what they see of a third
   * Inter-media - this is just lip-synch!

Intra-stream Synch

Intra-stream synchronisation is a base part of the H.261 and MPEG coding
systems. H.221 and MPEG systems, specify an encapsulation of multiple streams,
but also how to carry timing information in the stream.

In the Internet, the RTP media specific timestamp provides a general purpose
way of carrying out the same function.

Intra-stream Synch

   * Part of H.261 and MPEG and so on
   * Also in the RTP Internet Profocol spec

Inter-Stream Synch

The easiest way of synchrinisaing between streams at different sites is based
on providing a globally synchronised clock. There are two ways this might be
done:

  1. Have the network provide a clock. This is used in H.261/ISDN based
     systems. A single clock is propogated aroudn a set of CODECs and MCUs.
  2. Have a clock synchrnisation protocol, such as NTP (the network Time
     Protocol) or DTS (Digital Time Service). This operates between all the
     computers in a data network, and continually exchanges messages between
     the computers to monitor:
       1. Clock Offsets
       2. Network Delays

Alternatively, the media streams between sites could carry clock offset
information, and the media timestamps with arrival tiems could be used to
measure network delays, and clocks adjusted accordingly, and then used to
insert a baseline delay into the adaptive playout algorithms so that the
different streams are all synchronised.

Inter-media Synch

   * Could have global clock fro mnetwork
   * Culd use clock synch between computers
   * Could carry clock in all packets and use for clock synch calculation a la
     NTP/DTS

Inter-Media Synch

There are two basic ways of synchronising different media:

  1. Encapsualte the media in the same transmission stream. This is very
     effective but may entail computationally expensive labour at the
     recipieitn unravelling the streams - for example, H.221 works like this,
     but since it is designed to introduce only minimal delay in doing so, is a
     bit level framing protcool and is very hard to decode rapidly.
  2. Use much the same scheme as is used to synchronise different sources from
     different places. However, since media from the same source are
     timestampeld by the same clock, the offset calculation is a lot simpler,
     and can be done I nthe receiver only - basically messages between an audio
     decoder and a video decoder can be ecxchanged inside the receiver and used
     to synchronise the playout points.

This latter approach assumes that the media are timestamped at the "real"
source (i.e. at the point of sampling, not at the point of transmission) to be
accurate.

Inter-media synch

   * Can multiplex different media in a single data stream, or
   * Can carry media timestamps, same as for inter-stream synchronisation
   * Not difficuly, but may not be necessary either - depends on quality and
     delay bound requirements!

Storage Media

COMPACT DISK FORMATS (CD, CD-I, CD-I VIDEO ETC)

-------------------------------------------------------------------------------

CD was developed by Phillips as a digital replacement for the old Vinyl
long-playing album, which was expensive, error prone and highly variable in
quality.

CD-ROM stands for "Compact Disc Read Only. It is physically the same as a Music
CD (In fact , just about all CD ROM computer drives will play music CDs if only
through the headphone output, but sometimes even by retrieving the music as if
it were data, and then directing it to a digital audio output device (e.g.
soundblaster card on a PC!).

A CD-ROM can hold about 650 megabytes of data (i.e. a few thousand floppies
worth) , and is impervious to magnets and xrays and even modest physical
impact. However, CDs are a lot slower than most magnetic storage technology,
and what is more, you cannot write a CD-ROM no matter how hard you try
(although a machine for mastering them is not that expensive - typically around
10k, and most shops that have one will take your data and produce it on CD-ROM
for around 1k for 1, and 1 dollar a disk thereafter!)

CD-ROMs are exactly as good as CDs for reading sequential data (i.e. sustained
1.4Mbps) but for any random access, the heads have to be moved. Unfortunately,
so does the disk speed, since it is designed to generate a constant rate at a
constant physical recording density, so the disk moves faster when you are
reading near the middle than at the outside (i.e. linear velocity is constant,
so angular velocity is in inverse proportion to radius). So far, it has
defeated technical design to make seeking on such a device at all reasonable.

CD-ROM File formats are usually based on the old IBM High- Sierra design, now
ratified as ISO 9660. This is fine for DOS machines, but is a bit limiting for
UNIX systems, so people tend to use the Rock Ridge extensions.

CD is not particularly flexible or high performance for anything but it is a
base piece of technology for the multimedia world.

CD-I stands for Compact Disc Interactive. It was designed to provide a single
format for all multimedia(especially educational) packages. However, it tries
to capitalize on CD, but adds an entire system (CPU, etc) to provide the
interactive side to access. Unfortunately, this means it is very tied to media,
performance, and understanding of structure of such systems as and when it was
designed, and it is also tied very much to one manufacturer, Philips, and is
unlikely to be picked up by that many others.

Storage Media

   * Conventional media are catching up (mag disks are 1 K per Gigabyte)
   * CD based technology a useful stop gap
   * CD very poor for random access, but fine for sequential
   * DAT and Video 8 also useful stopgaps

-------------------------------------------------------------------------------

CD

   * "Color Books":
   * Red:CD DA (Digital Audio)
   * Yellow: CD ROM (Read Only Memory)
   * Green: CD-i (interactive)
   * Orange: CD-R (Recordable)
   * White: CD-v (video): MPEG-1

CD-i

   * Multiple Media: Audio - multilevel
   * Video - CD-v based
   * Text&Graphics: Up to applciation
   * Player: Specification Includes CPU
   * Includes ADPCM Audio and separate video coder/decoder...

Digital Video Interactive: DVI

   * DVI includes video, audio, image, text.
   * Two level: Real Time and Production Quality
   * Media organised into streams, interleaved in single file
   * Products based on Intel chipsets (i750/ActionMedia boards)

DVI Operations

   * AVSS: Audio Video Support System
   * supports interaction between playback and display and host OS
   * Very PC Specific
   * No multi-stream operations, so:
   * not good for conferencing.

Quicktime

   * Media includes MPEG1 video and lower quality specs such as Road Pizza.
   * Photos are JPEG compressed
   * Organisation is into: {data, media, track, movie}
   * Hierarchy of types
   * Also interfaces to MIDI

QuickTime Orgaisation

   * Data = file
   * Media = media type+start time + duration
   * Track = ordering of a media item - like an edit
   * Movie = group of tracks

Multimedia PC:- MPC

   * Media as per other schemes:
   * Audio - WAVE (Waveform Audio File format)
   * Music via MIDI
   * Image: based on DIB
   * Text+Graphics: RTF
   * Video: VfW - multiple codecs - e.g. Indeo

MME

   * Multimedia Extenstions for MPC:
   * RIFF: Resource Interchange File Format
   * Metaformat for describing contents in terms of media types.
   * Operations are a bit richer than in Quicktime or DVI

MME Operations

   * Capability, Open/Close, Info, Pause, play, resume, seek, set, status, stop
   * Capaiblity: e..g Can Play, Can Eject
   * or Has Audio, Has Video etc
   * Architecture permits separate intelligence in Controllers and Device
     Drivers

Director

   * Macromind director is typical authoring tool
   * scores include channels
   * channel includes tempo, palette, transisions, sounds etc and scripts
   * Scripts are like edit sequences

Use of World Wide Web HTTP, HTML and MIME

WWW - HYPERMEDIA

-------------------------------------------------------------------------------

The World Wide Web makes all previous network services look like stone tablets
and smoke signals. In fact, the Web is better than that! It can read stone
tablets and send smoke signals too!

The World Wide Web service is made up of several components. Client programs
(e.g. Mosaic, Lynx etc) access servers (e.g. HTTP Daemons) using the protocol
HTTP. Servers hold data, written in a language called HTML. HTML is the
HyperText Markup Language. As indicated by its name, it is a language (in other
words it consists of keywords and grammar for using them) for marking up text
that is hyper!(9)

The pages in the World Wide Web are held in HTML format, and delivered from WWW
servers to clients in this form, albeit wrapped in MIME (Multipurpose Internet
Mail Extensions) and conveyed by HTTP. HTTP is the HyperText Transfer Protocol.

What is WWW?

   * Distributed Hypermedia Database
   * Contents are described in MIME, Multipurpose Internet Mail Extensions
   * Servers hold data in HTML - HyperText Markup Language
   * Links are Universal Resource Locators
   * Acess protocol is HTTP - HyperText Transfer Protocol

A Note on Stateless Servers

Almost all Information Servers above are described as stateless.

State is what networking people call memory. One of the important design
principles in the Internet has always been to minimize the number of places
that need to keep track of who is doing what.

In the case of stateless information servers this means that they do not keep
track of which clients are accessing them. In other words, between one access
and the next, the server and protocol are constructed in such a way that they
do not care who, why, how, when or where the next access comes from.

This is essential to the reliability of the server, and to making such systems
work in very large scale networks such as the Internet with potentially huge
numbers of clients: if the server did depend on a client, then any client
failure, or network failure would leave the server in the lurch, possibly not
able to continue, or else serving other clients with reduced resources.

Having said this, the idea of being stateless does not necessarily mean that
the servers do not keep information about clients. For example:

   * Logging how many clients and from where they access

  1. This can be useful even for sites that do not recoup funds for serving
     information, but so that they can point at the effectiveness of their
     information service.

   * Keeping track of most frequently accessed material

This can be useful to age and remove unaccessed information.

It can also be used to decide to put frequently accessed information onto
faster servers, or even move the information to servers nearest the most
frequent clients (called load balancing).

   * Using Access Control Lists to limit who can retrieve which information

  1. Some servers allow the configuration of lists of Internet Addresses, or
     even client users who are (or are not) permitted access to all or
     particular information.

   * Using authentication stages before permitting access, and also to allow
     billing.

  1. While we would not recommend using the Internet to actually carry out
     billing yet, you can certainly employ secure authentication techniques
     that would identify a user beyond doubt. This can then be used with each
     access log, to calculate a bill which can then be sent out-of-band, e.g.
     by post.

   * Sharing out information on heavily loaded servers or networks,
     differentially, depending where clients are.

  1. Some sites offer a wealth of information, but have less good long-haul
     Internet access. They will then distribute data more frequently in favor
     of local, site or national clients, above non-local , or international
     ones.

Stateless Servers

   * Do not track clients
   * Essential to scaling -
   * Can log clients (but in persistent store)
   * Can authenticate clients
   * Can load balance if no memory between one client access and another

Caching

Another use of the term stateless is to describe whether or not the server
keeps note of the actual data from each access by a client (irrespective of
whether it notes who the client was). This is called server caching.

[Cache is usually, but not always pronounced the same way as cash. It is
nothing to do with money, or even ATM, whether ATM stands for Automatic Teller
Machine, or Asynchronous Transfer Mode, or even Another Terrible Mistake]

Server Caching is a way of improving the response time of a server. Usually,
servers keep data on disk. If they keep a copy of all the most frequently or
most recently accessed data in memory, they may be able to respond to new (or
repeating) clients more quickly. Such caching is usually configurable, and
depends largely on measuring a whole lot of system parameters:

   * Disk speed and capacity versus Memory speed and capacity.

  1. Obviously, if there isn't much memory in a system then a cache say of one
     item would have little effect.

   * Network speed versus disk speed

  1. A memory cache is pointless if the network is always slower than the worst
     disk search!.

   * Client access patterns.

  1. Clients may repeatedly access the same information.

Different clients may tend to access the same

information. Even if clients access different

information over time, it may be that at one time, most people tend to access
the same information (this is especially true of news servers or share
information servers for example)

Caching is also employed in client programs. In other words, a client program
may well not only hand each piece of information to the user - it may also
squirrel away a copy of recently accessed items to avoid having to bother the
server again for subsequent repeat requests for the same items.

In both server and client caching, the system should make sure that the actual
master copy hasn't changed since the cache copy was taken. This can be quite
complex!

Caching

   * Trade off network, disk and memory speeds
   * Can optimise servers for client access patterns
   * Can cache in clients as welll as servers

So, what is the World Wide Web?

From the user point of view, the World Wide Web is information, a great tangled
web of information. The user doesn't care anything (well, almost anything)
about where the information is stored, about how it's stored, or about how it
gets to her screen - she just says ``Oh, that looks interesting'', clicks the
mouse, and (after a short time, or a long time if your link is slow and the
file is large), the information arrives.

Here's a short example:

Example of using WWW

A researcher is coming to London for a conference, and she needs information on
hotels to stay in. Starting with the ``Internet Starting Points'', which is
available directly from the ``Navigate'' menu on the screen, she might follow
the a sequence like this:

   * Selecting ``Internet Starting Points'' fetches a
   * list of possible sources of information.

  1. She sees that there's a highlighted phrase which says ``Web Servers
     Directory'', and she thinks ``aha maybe there's a WWW server in London''.
     She clicks on ``Web Servers Directory'', and after short delay the page
     arrives...

...On the Web Servers Directory, she searches down the list of countries until
she finds the entry for the United Kingdom. One entry listed is ``Country
Info'', and she wonders what info is provided. She clicks on it and...

   * ...``Country Info'' turns out to be an active map of
   * the UK. She clicks on London ...
   * ...and gets a guide to London, including an entry
   * labeled ``Hotels in central London''. She clicks

on this and finds the information she was looking for.

She didn't need to know very much information, other than where to start, and
on most browsers there are a few suggested starting points built in. There are
hundreds of other paths she could have followed to get to the same eventual
destination.

Example of WWW

[Image] WWW

Beneath the Surf

Mosaic has a few well know places to look for data built in.

One of these is specified by the URL:

   * http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/-
   * ,!StartingPoints/NetworkStartingPoints.html

A URL is a Uniform Resource Locator. This specifies what a piece of information
is called

(``/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html''),

where to find it (in this case the machine called www.ncsa.uiuc.edu), and which
protocol to use to get the information (in this case http, or HyperText
Transfer Protocol).

When our researcher selects ``Internet Starting Points'', her Mosaic makes a
TCP connection to the World Wide Web server running on www.ncsa.uiuc.edu. It
then uses this connection to send a request for the data called
``NetworkStartingPoints.html''. The WWW server at NCSA uses the connection to
send back the requested data, and then closes down the connection.

Next, Mosaic reads various embedded commands in the data that was retrieved,
and creates a nicely laid out page of text which it presents to our researcher.

Some parts of the text she sees are highlighted (on Mosaic for X, they are
underlined and coloured blue). One entry she sees is:

Web_Servers_Directory_: The central listing of known World Wide Web servers.

She simply clicks on the highlighted text, and the associated page of
information is fetched ``as if by magic''. Of course, what actually happened
was that the text she saw on screen was not the whole story. The page of data
that was retrieved from NCSA was actually in a language called HTML or Hyper
Text Markup Language. Before her copy of Mosaic laid out the text nicely, it
actually looked something like:

<A HREF="http://info.cern.ch/hypertext/DataSources/WWW/Servers.html"> Web
Servers Directory </A>: The central listing of known World Wide Web servers.

Thus the highlighted text she clicked on was associated with the URL:

http://info.cern.ch/hypertext/DataSources/WWW/Servers.html

and clicking on this text causes her Mosaic to make

a connection to info.cern.ch to request the page called

``/hypertext/DataSources/WWW/Servers.html''

Our researcher may be sitting in Melbourne, Australia. The NCSA server is in
Illinois, USA, and the CERN server is near Geneva in Switzerland, but none of
this concerns our researcher - she just clicks on the highlighted items, and
the hyper-links(4) behind them take her from server to server around the world.
Unless she pays close attention to the URLs being requested, she will not know
or care where the data is actually stored (except that some places have slower
links than others).

Another Example

On the list of places she retrieved from the CERN server, she sees the entry:

United Kingdom (sensitive_map_, country_info_)

The HTML behind(5) this entry is actually:

United Kingdom

(<A HREF="http://scitsc.wlv.ac.uk/ukinfo/uk.map.html"> sensitive map</A>, <A
HREF="http://www.cs.ucl.ac.uk/misc/uk/intro.html"> country info</A>)

She clicks on country info, thus requesting the HTML text with the URL:

  1. http://www.cs.ucl.ac.uk/misc/uk/intro.html

As before, her Mosaic sets up a connection, this

time to www.cs.ucl.ac.uk, and retrieves the page called

``/misc/uk/intro.html''.

However, this time the HTML her Mosaic gets back contains

the command:

<img src=uk_map_lbl.gif ISMAP>

Ignoring the ``ISMAP'' bit for a second, this says that the

page should contain a GIF image at this point, and that

the GIF image is called ``uk_map_lbl.gif''. Actually it's full

URL is:

http://www.cs.ucl.ac.uk/misc/uk/uk_map_lbl.gif

which Mosaic can figure out from the URL of the page the

image is to be contained in. Mosaic now sets up another

connection to www.cs.ucl.ac.uk to request the image called
``/misc/uk/uk_map_lbl.gif'', and when it has retrieved the image, it displays
it in the correct place in the text.

Now, if it wasn't for the ISMAP part of this HTML, that's all that would happen
- the image would be displayed, and our researcher could look at it. However,
in this case, the image is a map of the UK, and we put some intelligence behind
the map. The ISMAP part of the HTML tells our researcher's Mosaic that this
image is special, and it will allow her to click on the map to get more
information.

MAP Example

In actual fact, the full piece of HTML we used in this particular case was:

   * <a href=/cgi-bin/imagemap/uk_map>
   * <img src=uk_map_lbl.gif ISMAP>

  1. </a>
  2.
  3. [Image] WWW Maps

So, when our researcher sees London marked on the map, and she clicks on it,
her Mosaic does something a little different. It sets up a connection to
www.cs.ucl.ac.uk (that's where the map came from(7), and sends a request for
the URL:

http://www.cs.ucl.ac.uk/cgi-bin/imagemap/uk_map?404,451

Here 404,451 are the coordinates of the point she clicked within the map. The
ISMAP command associated with the image tells Mosaic to work out where the user
clicked, and send that information too.

At the server on www.cs.ucl.ac.uk, there are a number of data files for maps.
This special URL asks the server to look in its map data for ``uk_map'', and
find what the point 404,451 corresponds to .

The WWW server running on www.cs.ucl.ac.uk responds with the URL of the page
corresponding to London on this map - in this case the URL is:

http://www.cs.ucl.ac.uk/misc/uk/london.html

which happens to be on the same server as the map, though it didn't have to be.
Our researcher's Mosaic then sets up another connection to www.cs.ucl.ac.uk,
and requests the page ``/misc/uk/london.html''.

When this page is received, Mosaic parses the HTML text it gets back, and
discovers the following line in the retrieved text:

<img src=/uk/london/tower_bridge.gif>

  1. and so it then also requests

http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif

which is just a little picture of Tower Bridge here in London, which doesn't
have any special significance other than decorating the London page.

Uniform Resource Locators (URLs)

The above example presents quite a number of URLs. For instance the URL:

http://www.cs.ucl.ac.uk/misc/uk/intro.html

As we stated above, this says that the data called

``/misc/uk/intro.html'' can be retrieved from the server

running on a computer called ``www.cs.ucl.ac.uk'' using http

which is the HyperText Transfer Protocol.

This could equally well say:

http://www.cs.ucl.ac.uk:80/misc/uk/intro.html

The number 80 here is the TCP port on the machine

www.cs.ucl.ac.uk that the WWW server is listening on. TCP

ports are a way that several different kinds of server can

all listen on the same machine without getting confused about

which server the connection is being made to (think about

lots of letter boxes in an apartment block). Port 80 is the

default port for the HyperText Transfer Protocol, so if you

don't say which port to connect to, Mosaic and the other WWW

browsers will all assume you mean port 80. See chapter 5 for

more details about server ports and why you might sometimes

run a server on a different port.

URLs

   * http://www.cs.ucl.ac.uk/misc/uk/intro.html
   * data called``/misc/uk/intro.html'' can be retrieved

   * server running on a computer called ``www.cs.ucl.ac.uk''

   * using http which is the HyperText Transfer Protocol.

More about URLs

URLs don't just have to specify that you use HTTP. For

instance the URL:

ftp://cs.ucl.ac.uk/mice/index

   * says that to get this information, contact the ftp server
   * running on
   * cs.ucl.ac.uk

Most WWW browsers know how to talk to ftp servers

too, so they can set up an ftp connection, and request

``/mice/index'' using the much older File Transfer Protocol.

One of the biggest plus points for Mosaic and other WWW

browsers is that they are multiprotocol clients - that is

they know about quite a number of different protocols, and

so they can contact a number of different types of servers

for information. If the information is out there on the

Internet, no matter what type of server it's on, there is almost certainly a
way for a WWW browser to get it. The URL tells the browser what type of server
the data resides on, and thus how to go about getting it.

More About URLs

Protocols that WWW browsers know about include:

   * http: HyperText Transfer Protocol
   * ftp: File Transfer Protocol
   * gopher: the menu based information system predating WWW
   * wais: Wide Area Information System - an information system

allowing complex searching of databases

   * telnet: the protocol that allows you to log in to remote

systems.

   * archie: the indexing system that allows you to find out what

information is stored where on ftp servers.

An Introduction to HTML

HTML is the HyperText Markup Language. As indicated by

its name, it is a language (in other words it consists of

keywords and a grammar for using them) for marking up text that is hyper! HTML
is an extension to the fairly commonly used Standardized Generalized Markup
Language, SGML (8)

The pages in the World Wide Web are held in HTML format, and delivered from WWW
servers to clients in this form, albeit wrapped in MIME and conveyed by HTTP,
of which more below.

Marking up is an ancient skill developed in the Dark Ages of publishing by
guilds of printers, keen on presenting the written word in a pleasant and
effective way on the printed page. Typically, in recent years, the skill has
diminished with the advent of WYSIWYG (What You See Is What You Get, so called
whizzy wig) word processing packages and desktop publishing systems. This need
not daunt you, since you do not have to author or prepare material for the
World Wide Web in HTML directly, unless you really want to. Typically, an
author will write material using whatever word processor they are used to, and
then use a filter to translate the output into HTML. We will discuss some of
the various filters that are available in later chapters.

Getting Started with HTML

A simple example of HTML is:

   * <HTML>
   * <HEAD>
   * <TITLE>This is the Title</TITLE>
   * </HEAD>
   * <BODY>
   * <H1>This is the Page Heading</H1>
   * This is the first paragraph.
   * <P>
   * This is another paragraph,
   * with a sentence
   * that is split over several lines in HTML.
   * </BODY>
   * </HTML>

  1. When this is displayed by Mosaic, it will look like:
  2. As you can probably guess, commands are enclosed in angle-brackets <>, so
     that the HTML command <TITLE> means that the following text is part of the
     title.
  3. Commands beginning </ are the end of the equivalent command. For example
     to say that the text ``This is the Page Heading'' should be a level one
     heading (the largest type of heading), the complete sequence is:

<H1>This is the Page Heading</H1>

A break between paragraphs is denoted <P>. There is no need for a </P>
afterwards because the end of a paragraph is obvious from the start the next
paragraph, list, heading or whatever.

<HTML>

Strictly speaking a page should start <HTML> and should end </HTML>, but the
HTML specification also says that clients should perform correctly without
them, and so many people omit them. Similarly the header of a document (the bit
containing the title) should begin with <HEAD> and end with </HEAD> and the
body of a document should begin with <BODY> and end with </BODY>, but in
practice this isn't essential. The HEAD and BODY commands are newer additions
to HTML, which allow some of the fancier features to be used, but if you're not
using these features, you can safely omit both.

Documents written in HTML are not WYSIWYG - Mosaic and other WWW clients will
re-arrange the layout of your text so it fits properly on whatever size display
you try and display it on. So if you really want to break a line at a specific
place, you should use <P>, rather than a carriage return, as Mosaic will remove
the carriage return and replace it with a space, and then break your line of
text at a point that is convenient for the current page width. Hence the text:

   * <P>
   * This is another paragraph,
   * with a sentence
   * that is split over several lines.

Will get formatted as:

This is another paragraph, with a sentence that is split over several lines.

Headings and Typefaces

We've already seen one type of heading, a top level heading denoted by the
<H1>....</H1> pair. As you would expect, HTML supports many different levels of
headers, with H1 being the largest, getting progressively smaller with H2, H3
and so on down to H6. Exactly which font and size a particular heading will be
displayed with depends on which browser you use to view the text - some text
based browsers won't do anything, but more fancy graphical browsers such as
Mosaic will choose a sensible set of fonts. (9)

HTML also lets you specify that a piece of text should be in a bold typeface
using the <B> ... </B> combination, or in an italic typeface using the <I> ...
</I> combination.

Thus the HTML:

<I>The</I> <B>Guardian</B> newspaper titles look like this

results in:

The Guardian newspaper titles look like this

Lists of things

Lists of things are pretty useful in ordinary text, but in HTML, where you'll
often have lists of links to other places, they're even more useful. However
WWW servers just consisting of lists are pretty boring too, and with some
imagination, you'll find more interesting ways to present many things.

The simplest list is the bullet or unordered list, which is denoted by <UL>,
and the list items in it are denoted using <LI>. An example is:

Oxymorons:

   * <UL>
   * <LI>Military Intelligence
   * <LI>Plastic Glasses
   * <LI>Moral Majority
   * </UL>

This would be displayed as:

Oxymorons:

   * Military Intelligence
   * Plastic glasses
   * Moral majority

Another form of list is the numbered or ordered list denoted by <OL>. Ordered
lists have the same syntax as unordered lists except that OL replaces UL in the
list delimiters:

Oxymorons:

   * <OL>
   * <LI>Business ethics
   * <LI>Chilli
   * </OL>

This gets displayed as:

Oxymorons:

   * 1. Business ethics
   * 2. Chilli

Definition Lists

A more complex type of list is the definition list, denoted by <DL>. definition
terms are denoted using <DT> and actual definition data is denoted using <DD>,
so a typical list may be:

Population Statistics:

   * <DL>
   * <DT>Ireland
   * <DD>population 3 million
   * <DT>Scotland
   * <DD>population 5 million
   * <DT>England
   * <PP>population too many
   * </DL>

which would be presented as:

Population Statistics:

Ireland

population 3 million

Scotland

population 5 million

England

population too many

If you wish to have several paragraphs of definition data associated with one
definition term, simply use several <DD> entries.

Note that although the <DL> list must be finished with a </DL>, each <DT> or
<DD> list item is simply ended by the next definition.

Making it all look pretty

Horizontal Rules

HTML provides the <HR> command to create a horizontal line across the page -
judicious use of <HR> to split a page into sections can aid readability.

Pictures

However, when it comes to attractive layout, a picture is worth a thousand
words, which is fine, except for the fact that pictures generally also require
a thousand times as many bytes to be transferred.

A picture can be included using an HTML command such as:

<img src=a_thousand_words.gif>

In this case, this tells Mosaic that there is a picture called
``a_thousand_words.gif'' on the remote server in the same directory (or folder)
that this page of HTML was found in.

A more complex example is:

<img src=http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif>

In this case, the image is specified with a complete URL, which tells Mosaic
exactly where to go to fetch the picture. Note that the data for the picture
does not need to reside on the same server as the document that it is embedded
in.

Also note that we've omitted the quotes from around this URL - although it's
not a bad idea to add them for the sake of clarity, or for URLs containing odd
characters such as spaces, they're not strictly necessary in most
circumstances.

Displaying Images - Launching Applications

In order for an image to be displayed in a page of a document, it must be in
one of a small number of formats. However, not all formats are displayable on
all browsers.

   * gif - a compressed 8 bit image format. Viewable on most browsers that
     support images.
   * xbm - X Bitmap - two color uncompressed format.

  1. Viewable on most browsers that support images. The background and
     foreground colors on the image are typically displayed in the background
     and foreground colors of your browser.

   * xpm - X Pixmap - multicolour X format. Not viewable on all browsers - some
     versions of MacMosaic can't view

  1. this for example. The background color is displayed in the background
     color of your browser, which enables the image to merge into your document
     nicely.

Although many other image formats are viewable using an external viewer
program, they are not necessarily viewable as embedded images on your browser.

Linking it all together

We gave an example above of an image that can be stored on a different server
from the text page that it is to be embedded in - this is an example of a
hyper-link. Hyper links are what turn the Web from a not terribly good text
formatting system to the tangled Web of information that make the World Wide
Web interesting. They're both the mechanism by which you find things, and the
way of tying multiple media or data from multiple sources together.

The example we gave above was for an embedded image, and will be downloaded
automatically (10). However, in most cases you only want the hyper link to be
followed when the user clicks on it.

An example is:

Pictures of <A HREF=http://www.cs.ucl.ac.uk/people/mhandley.html> Mark</A> and
<A HREF=http://www.cs.ucl.ac.uk/people/jon.html>Jon</A> are available for those
with a strong stomach.

This will be displayed as:

   * Pictures of Mark_and Jon_are available for those
   * with a strong stomach.

  1. If now click on Mark, or on Jon you will be presented with a glorious full
     color picture of one of the authors.
  2. The <A>..</A> in the text above denote an anchor - in other words some
     additional information that has been associated with the text. In the case
     the anchor has a hypertext reference denoted by the keyword HREF and the
     URL corresponding to that reference. Other information can also be
     associated with an anchor - see later.

Hotlists

Users can construct indexes by creating lists of URLs. Most client programs
allow people to do this easily. Many users then advertise these hotlists by
adding them to their own pages in their own web servers. Some sites keep
hotlists or bookmarks organized by subject or by research interest. Some sites
even let users submit new entries for their indexes. This allows navigation
(although it doesn't really help searching) in the Web. Each hotlist or list of
bookmarks represents another tour or view of the places of interest to the
author of that hotlist. As more and more sites and users construct such lists,
the density or value of referenced information increases.

More Pretty Pictures

A picture is worth a thousand words. Unfortunately this is an understatement,
and it is often actually more like the equivalent of 50,000 words, or 250
KBytes. Thus embedding large pictures in pages of text is usually not a good
idea. More typical is to include a small copy of the image in the document,
with a hyper link to the larger version of the image. An example would be:

<A HREF=big_ben.gif><IMG SRC=little_ben.gif></A>

  1. In this case, it is an image little_ben.gif that has been given an anchor
     with a hyperlink to big_ben.gif. Mosaic will display the small image
     embedded in the page of text, and will only retrieve and externally
     display the large image big_ben.gif if the user should click on the small
     image.
  2. Images such as the one described are called external images to distinguish
     them from embedded or inline images. Most WWW browsers use a separate
     viewer program to display external images. On UNIX systems, the most
     common external viewer program is XV. On Apple Mac's the external viewer
     is called JPEG View. On Windows PC's it is called LVIEW. Generally
     external viewer programs do not come bundled with the WWW browser, and
     you'll have to obtain one separately. Usually external viewers can display
     a larger range of images than the WWW browser itself can, though this is
     changing as WWW browsers become more sophisticated.

Links Within a Page

The hyper links we've shown so far all take you to the top of the page at the
end of the link. However, it's useful to be able to jump to specific places
within a page too. For instance, where a page is quite long, it is useful to be
able to have a summary of the page at the top, with hyper links directly to the
summarized sections. This can be done by associating names with anchors as
follows.

If this course was called example.html and we wanted make it available online,
we might put a list of contents at the top:

<UL>

...

<LI> <A HREF=example.html"#links>Go to Section 1</A>

<LI> <A HREF=example.html"#more_pics>More Pretty Pictures</A>

<LI> <A HREF=example.html"#page_links>Links Within a Page</A>

...

</UL>

...

...

<A NAME="page_links"><H2>Links Within a Page</H2></A> The hyper links we've
shown so far.....

Now if you click in the ``Links Within a Page'' entry in the contents list,
your browser will jump to the document with the partial URL
example.html#page_links. As we're already viewing the document called
example.html, it doesn't bother to fetch the page again, but merely jumps
directly to the anchor named page_links.

Pre-Formatted Text

Often you'll come across some text that you wish to put on a WWW server that is
pre-formatted plain text. You could of course go through the text and insert
all the necessary HTML formatting commands, but often all you want to do is
stop a WWW browser re-formatting it for you. HTML provides the command pair
<PRE>...</PRE> to delimit text you don't want re-formatting.

   * this text will
   * be reformatted
   * by the browser
   * <PRE>
   * and this text
   * will not be
   * reformatted
   * </PRE>

would look like:

   * this text will be reformatted by the browser

and this text

will not be

reformatted

Note that the preformatted text will be displayed in a fixed width typewriter
style font. Typewriter style fonts are fixed width - i.e. all that characters
are the same width. Book fonts and the default fonts used by WWW clients such
as Mosaic are variable width

You should avoid overuse of <PRE>, as it doesn't allow WWW browsers any leeway
in doing anything clever about line wrapping, and because typewriter style
fonts are pretty ugly.

A note on links

In the examples above, we've shown two forms of links - an absolute URL such as
is used in this image link:

<img src=http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif> and relative links
such as:

<img src=tower_bridge.gif>

If this relative link were in a page of HTML with the URL

http://www.cs.ucl.ac.uk/uk/london/index.html

then the client assumes that the protocol (http), the remote computer
(www.cs.ucl.ac.uk) and the directory (/uk/london) are all the same as those in
the page containing the link, and so it actually requests the data with the
absolute URL

http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif

Another possibility is to specify relative URLs with the full directory and
filename - the client knows that you mean this because the directory name
begins with a slash (``/''). For example, the relative link above could have
also been written:

<img src=/uk/london/tower_bridge.gif>

You can even use relative directory names using UNIX style relative pathnames.
For example, an HTML page with the URL http://www.cs.ucl.ac.uk/uk/intro.html
could use the following link the same picture of Tower Bridge:

   * <img src=london/tower_bridge.gif>
   * and an HTML page with the URL:
   * http://www.cs.ucl.ac.uk/uk/london/east_end/docks.html
   * could use a link such as:
   * <img src=../tower_bridge.html>
   * Note that the ``../'' here refers to the parent(13) directory
   * of the current directory in the directory tree.

A WWW Server listens on a TCP port(1) for incoming connections from clients. It
expects a connecting client to speak a protocol called HTTP or HyperText
Transfer Protocol. The connecting client is usually a browser such as Mosaic,
which will request some information from the server, and the server will then
return the requested information to the client(2) .

HTTP is a pretty simple protocol. If you want to see what actually happens, you
can telnet to a WWW server and talk to it yourself(3) . The simplest HTTP
request is GET. An example of telnetting to a server and issuing a GET request
is:

   * telnet> open macpb1.cs.ucl.ac.uk 80
   * Connected to macpb1.cs.ucl.ac.uk
   * Escape character is '^]'.
   * telnet> GET /index.html HTTP/1.0

HTTP/1.0 200 OK

MIME-Version: 1.0

Server: MacHTTP

Content-type: text/html

<title>Mark's Powerbook on the Web</title>

<h1>Welcome to Mark's WWW server</h1>

This temporary server is running on an Apple Macintosh

Powerbook 180 using MacHTTP 1.3.

There's not much here right now, except for the

<a href=Default.html>HTTP documentation</a>.

The request I made was ``GET /index.html'' and additionally

told the server I spoke ``HTTP/1.0''. The server responded with the document
index.html, and with additional information. The first line of the response
says that the server is also speaking ``HTTP/1.0'', that the status code my
request returned

was ``200'', which in human terms means ``OK''. The next line

gives information about the version of MIME. Then there's a line

that says what type of server this was. And finally there's

a line that says the ``Content-Type'' is ``text/html''. This

last line is actually giving the MIME content type, which is how the server
tells the client what to do with the information that follows. In this case it
says that what follows is actually ``text'' (as opposed to an image, video,
audio or a whole host of other possibilities), and that this particular text is
in ``html'' format. If we'd asked for this information using a WWW client
instead of telnet, the client would have read the Content-Type line, and known
to feed the data following into it's HTML interpreter.

MIME

MIME stands for Multipurpose Internet Mail Extensions, and was originally
designed for sending multimedia electronic mail. The two main things it does
are specify in a standard way what type of media the contents of a message
actually are, and what form they've been encoded in for transmission.

When Tim Berners-Lee was originally designing what would go on to become the
world wide web, he had exactly this same requirement - he needed a server to be
able to specify to a client what a response contained and how it had been
encoded. The email people had got there first, and had already specified MIME,
so there was no need to re-invent the wheel.

MIME Content Types

MIME Content Types consist of a type (such as ``text'') and a subtype (such as
``html''). The most common MIME types relevant to the WWW are:

``text'' Content-Type, which is used to represent textual

information in a number of character sets and formatted text description
languages in a standardized manner. The two most likely subtypes are:

   * text/plain - text with no special formatting require-ments.
   * text/html - text with embedded HTML commands

``application'' Content-Type, which is used to transmit application data or
binary data. Two frequently used subtypes are:

   * application/binary - the data is in some unknown binary format, such as
     the results of a file transfer.
   * application/postscript - the data is in the postscript language, and
     should be feed to a postscript interpreter.

``image'' Content-Type, for transmitting still image

(picture) data. There are many possible subtypes, but the

ones used most often on the web are:

   * image/gif - an image in the GIF format.
   * image/xbm - an image in the X Bitmap format.
   * image/jpeg - an image in the JPEG format.

``audio'' Content-Type, for transmitting audio or voice data.

   * audio/basic - the data consists of 8KHz 8 bit mu-law audio samples.

``video'' Content-Type, for transmitting video or moving image data, possibly
with audio as part of the composite video data format.

   * video/mpeg - the data is MPEG format video
   * video/quicktime - the data is QuickTime formet video

Suffixes, Servers and MIME types

Now we know how a server tells a client what type of information is being
returned, but how does the server figure out this information?

In the UNIX and DOS world, files are usually identified using file name
suffixes. A file called london_zoo.gif is likely to be an image in the GIF
format. Servers typically have a set of built in suffixes that they assume
denote particular content types. They also let you specify the content types of
your own suffixes in case you have any local oddities, or something new that
the server designer hadn't thought of.

URLs and Server File Systems

WWW servers generally reside on machines with a file system(4) . The server's
job is to make part of that file system publicly available by responding to
HTTP requests. Its job is also to prevent the private parts of that file system
from becoming public.Most file systems can be thought of as a form of tree, and
the URLs used in the World Wide Web also use this model. Thus the

URL: http://www.cs.ucl.ac.uk/misc/uk/london.html

specifies the file called london.html which is in a directory called uk, which
in turn is in a directory called misc. misc is also a directory, and it resides
in the top level directory of the tree, which is sometimes simply called ``/''
(pronounced ``slash'').

When the URL above specifies /misc/uk/london.html, this does not usually mean
that the misc directory is really situated in the root directory of the entire
file system. Instead it is situated in the root directory of the subtree that
the WWW server makes public. Any documents situated in this subtree are
accessible to the server, and directories that are not in thissubtree are not
accessible .

However, most servers also allow you to provide some form of access control to
files and subdirectories of the visible subtree. This protection can take the
form of restrictions on which machines or networks a client can access a file
from, or it may take the form of password protection. Which mechanisms a server
provides depend on which server you choose, and we'll discuss a few of the
better servers later.

Multiuser sites

Another issue is raised where a server is running on a

Machine in a large multi-user environment such as a university.For instance,
each student in a university can write files to their own fluster, but not
anywhere else. However, we'd like our students to be able to create their own
WWW pages, despite not having access to the WWW server's default public tree.
UNIX servers usually make available files placed in a special directory in the
user's home directory. On NCSA and

CERN servers, this directory is called ``public_html'' by default.

Thus accesses to the URL

http://www.euphoric-state-uni.edu/"janet/research/index.html

would map onto the file:

/usr/home/janet/public_html/research/index.html

in the filesystem.

Once we start to allow the WWW server access to areas of our filesystem which
can be modified by users that we don't necessarily trust, a whole set of
security issues are raised.

For instance, Unix allows symbolic links from one place in the directory tree
to another to give the impression that files or directories are someplace else
(Mac's call symbolic links ``Aliases''). Letting the server follow links can be
useful, but it also can create problems. Just because a file is readable by
other users on your own system does not necessarily mean it should be readable
by users in other sites or countries!

Server Scripts

The ability to define new programs to be run in the server when a request is
made that really makes the Web flexible and Fun. An example is an active map,
where a user clicks on a map, and the place they clicked is sent to the server
along with their request. The server then runs a program or script which
figures out where those coordinates apply to, and, depending on where the user
clicked, it sends them the relevant next page of information. Another example
is Cambridge University's coffee machine - they have a video camera pointed at
the coffee pot, and a server script captures a picture of it using a video
framegrabber, and sends the image to you so that you can see whether there's
any coffee ready.

A standard called CGI or Common Gateway Interface has emerged for the writing
of server scripts, and is supported by most servers. This means that scripts
written for one server should be easily ported to another server.

Available Servers

There are many WWW servers available, and more seem to be

released each month.

At the time of writing CERN's ``list of available servers''

(6) lists the following servers. We don't give the individual

URLs here, as some of them would become out of date too quickly - instead we
encourage you to look at CERN's list.

CERN HTTPD Version 3.0

The CERN HTTPD server is probably the most fully featured WWW server. It
supports much the same range of features as NCSA's server, with the addition of
acting as a caching proxy server.

If you have used a WWW client such as Mosaic, you have probablyalready used a
proxy client. Mosaic and other clients built upon LibWWW can contact servers
for protocols such as ftp and gopher, and then convert the output of such
servers into HTML for formatting and display on your screen.

Proxy servers take this one step further - instead of your client contacting
remote servers directly, your client makes an HTTP request to a proxy server.
The proxy server then contacts the relevant FTP or GOPHER server, and converts
the results to HTML, before transferring them back to your client .

Proxy Cache Servers

A proxy server can also make connections to remote HTTP

servers. At first glance, this wouldn't appear to benefit you,

as the proxy then performs no conversion functionality, but it provides a way
to provide network services to machines on a secure subnet without those
machines having to have direct access to the outside world. Thus secure sites
can run a proxy server on their firewall machine, or SOCKSify only their proxy
server without needing to modify the WWW client programs for all their
different architectures.

Even if you do not need this level of security, CERN's

HTTPD can also provide caching facilities for clients using the

server as a proxy. Caching facilities in the World Wide Web are currently in
their infancy, as many servers do not return expiry date information with
documents, so deciding how long data should be cached before going back to look
at the original is not a clear cut issue. However, CERN's server uses whatever
information is available to it to make a decision about cache timeouts, and
although it doesn't always do the right thing, it does substantially improve
performance for frequently accessed pages, and most of the time it gets it
right.

A Proxy Server on a Firewall


   * 3.2 CERN HTTPD Configuration

The CERN HTTPD requires a single configuration file to

function. By default, CERN HTTPD looks for this file as

``/etc/httpd.conf'', but it can be held elsewhere and the server told where it
is using the -r command line flag.The list of configuration options that CERN
HTTPD supports is very extensive, and we encourage you to read the document
CERN HTTPD Reference Manual. Most of the default options are fine to get you
started.

Enabling Security on the CERN server

The CERN HTTPD server has a fairly sophisticated set of security

features that can be enabled. Basically, they fall into three

categories:

Restricting hosts that can access areas of the server.

Restricting users that can access areas of the server

Restricting access to individual files

Common Gateway Interface (CGI)

Before CGI, each server passed the query information into a script in its own
way. Unfortunately this made it difficult to write gateways that would work on
more than one type of server, so a few of the server developers got together
and CGI was the result. Some servers don't yet support CGI, but most of the
popular ones now do.

Writing CGI scripts

CGI passes the information a script needs into the script in environment
variables. The most important two are:

   * QUERY_STRING

The server will put the part of the URL after the first

``?'' in QUERY_STRING

   * PATH_INFO

The server will put the part of the path name after the

script name in PATH_INFO

For instance, if we sent a request to the server with the URL:

http://www.cs.ucl.ac.uk/cgi-bin/htimage/usr/www/img/uk_map?404,451 and we had
cgi-bin configured as a scripts directory, then the server would run the script
called htimage. It would then pass the remaining path information
``/usr/www/img/uk_map'' to htimage in the PATH_INFO environment variable, and
pass ``404,451'' in the QUERY_STRING variable. In this case, htimage is a
script for implementing active maps supplied with the CERN HTTPD.

The server expects the script program to produce some output

on its standard output. It first expects to see a short MIME

Header, followed by a blank line, and then any other output the script wants
returned to the client. The MIME header must have one or more of the following
directives:

   * Content-Type: type/subtype

This specifies the form of any output that follows.

   * Location: URL

This specifies that the client should request the given URL rather than display
the output. This is a redirection. Some servers may allow the URL to be a short
URL specifying only the file name and path - in this case the server will
usually return the relevant file directly to the client, rather than sending a
redirection.

The short MIME header can optionally contain a number of other MIME header
fields, which will also be checked by the server which will add any missing
fields before passing the combined reply to the client.

Under some circumstances, the script may want to create the entire MIME header
itself. For instance, you may want to do this if you want to specify expiry
dates or status codes yourself, and don't need the server to parse your header
and insert any missing fields. In this case, both the CERN and NCSA servers
recognize scripts whose name begins ``nph-'' as having a ``no parse header'',
and will not modify the reply at all. Under these circumstances your script
will need access to extra information to be able to fill out all the header
fields correctly, and so this information is also available via CGI environment
variables.

Handling Active Maps

One nice feature which is now supported by most graphical WWW clients is the
ISMAP active map command, which can be associated with an HTML inline image.
This tells the WWW client to supply the x and y coordinates of the point the
user clicks on within the image.

For example, this HTML tells the client this image is an active map:

   * <a href=/cgi-bin/imagemap/uk_map>
   * <img src=uk_map_lbl.gif ISMAP>
   * </a>

When the user clicks on the map at, say, point (404,451), her client will
submit a GET request to the server:

GET /cgi-bin/imagemap/uk_map?404,451

For this to do anything interesting, the server must interpret
``/cgi-bin/imagemap/uk_map'' as something special - a command to be executed
rather than a file to be retrieved. How the server decides this is a command
depends on the type of server, but whichever server you run, the ``404,451''
part will then be passed to the command as parameters.

When the command is executed, it could generate output that is to be returned
directly to the client - for instance the command could generate HTML directly
as output. However the usual way imagemaps are used is to access other existing
pages of HTML using HTTP redirection. This is where the server first returns to
the client the URL of the place to look for the page corresponding to the place
they clicked on the map, and then the client goes and requests this new URL
(usually without bothering to ask the user).

Handling Forms

Forms are one way the World Wide Web allows users to submit

information to servers. All the mechanisms described so far

allow users to choose from a set of available options. Forms

let the user type information into their web browser and then get the server to
run a program with their submission as input. Examples of things you might type
are keys to search a database (e.g. what films was Zazu Pits in?).

Laying Out Forms

HTML provides a number of commands for telling the client to do something
special. The first command is FORM which tells the client that everything
between one <FORM> command and the next </FORM> terminator is part of the same
form. The form command can take a number of attributes:

ACTION=http://www.host.name/cgi-bin/query

This gives the URL of the script to run when the form is submitted. You must
supply an ACTION attribute with the FORM command.

   * HMETHOD=GET

This is the default method for submitting a form. The contents of the form will
be added to the end of the URL that is sent to the server.

   * METHOD=POST

The post method causes the information contained in the form to be sent to the
server in the body of the request.

   * ENCTYPE=application/x-www-form-urlencoded

This specifies how the information the user typed into the form should be
encoded. Currently only the default, ``application/x-www-form-urlencoded'', is
allowed.

If your server supports the POST method, it is advisable to use it, as if you
use the GET method, it's possible that long forms will be truncated when
they're passed from the server to the script.

The INPUT command

Now you have an empty form, you probably want to provide some boxes and buttons
that the user can set. These are created using the INPUT tag. This is used in a
similar way to the IMG tag for images - there's no need for a terminating tag
as it doesn't surround anything. There are several types of INPUT tag, denoted
by the TYPE attribute:

<INPUT TYPE=text NAME=users_name>

This is a simple text entry field that we've called

``users_name''. The user never sees this NAME attribute

displayed on her client - it is purely so we can keep track of which field is
which when we come to process the form.

Text entries also allow you to specify:

   * VALUE="enter your name here"

This lets you specify the default text to appear in the entry box.

   * SIZE=60,3

This lets you specify the size of the entry box is

characters. For example the above says the entry box

should be 60 characters wide and three lines high.

   * MAXLENGTH=8

This lets you specify the maximum number of characters you'll allow to be
entered in a single line text entry box. For instance, you might only allow a
user to enter eight characters as their user name.

<INPUT TYPE=password NAME=users_passwd>

This is also a text entry field, but the characters the user types are
displayed as stars so that other people can't read the password from their
screen. Password fields also support the VALUE, SIZE and MAXLENGTH attributes.

<INPUT TYPE=checkbox NAME=veggie>

This is a single button which is either on or off.

Checkboxes also support the following attributes:

VALUE="true"

This is the value to return if the checkbox is set to

``on''. If it's set to ``off'', no value is returned.

CHECKED

This says that the checkbox is ``on'' by default.

<INPUT TYPE=radio NAME=food_style VALUE=indian>

<INPUT TYPE=radio NAME=food_style VALUE=chinese>

These are a collection of buttons. Radio buttons with the

same name are grouped together so that selecting one of them turns the others
off like the channel tuning buttons on some radios. Radio buttons also support
the VALUE and CHECKED attributes, but only one radio button can be specified as
CHECKED.

<INPUT TYPE=submit VALUE="Press Me to Submit">

This is a button that submits the contents of the form to the server using the
method in the surrounding FORM. Submit buttons don't have a NAME attribute, but
you can specify the label for the button using a VALUE attribut

H<INPUT TYPE=reset VALUE="Press Me to Start Again">

This is a button that causes the various boxes and buttons in the form to reset
to their default values. Reset buttons also don't have a NAME attribute, and
allow a VALUE attribute to label the button.

The SELECT Command

If you want to provide the user with a long list of items to choose from, it's
not very natural to use radio buttons, so HTML provides another command -
SELECT. Unlike INPUT, this does have a closing </SELECT> tag. Each option
within the list is denoted using the <OPTION> tag. Options must be plain text -
no embedded HTML commands are allowed:

   * <SELECT NAME="food style">
   * <OPTION> Chinese
   * <OPTION> South Indian
   * <OPTION> North Indian
   * <OPTION> Greek
   * </SELECT>

SELECT must have a NAME attribute, and also allows the following
attributes:SIZE=3 This says how many of the options are visible at once, in
this case, three.

MULTIPLE This allows the user to select more than one item from the

list. The default is that only one item can be selected at

once.

OPTION tags can also have a SELECTED attribute which says that

this option is selected by default. If the SELECT command has

a MULTIPLE attribute, then several OPTION tags are allowed to be pre-selected
in this way.

   * The TEXTAREA Command

If you want to allow the user to enter a large amount of text,

you could use an <INPUT TYPE=text> tag, but HTML also provides

another command - TEXTAREA. TEXTAREA fields automatically have scroll bars on
Mosaic, and any amount of text can be entered into them.

TEXTAREA fields must have a NAME attribute, and also must have ROWS and COLS
attributes specifying how large the visible area of the TEXTAREA is in
characters. TEXTAREA fields, like SELECT fields must have a closing tag:

   * <TEXTAREA NAME="address" ROWS=4 COLS=60>
   * Any default contents go here
   * </TEXTAREA>

The default contents must be ordinary text with no HTML

formatting commands.

An Example Form

An example of form that demonstrates many of these features is:

<HEAD><TITLE>Pub List Feedback</TITLE></HEAD>

<BODY>

<H1>Pub List Feedback</H1>

Please use this form to let us know about any good pubs you come

across in London.

<FORM ACTION=http://www.cs.ucl.ac.uk/cgi-bin/pubform METHOD=POST>

<HR>

<B>Pub Name:</B>

<INPUT TYPE=text NAME=pubname SIZE=40>

<P>

<B>Pub Address:</B>

<INPUT TYPE=text NAME=pubaddress SIZE=40,4>

<P>

<B>Area of London:</B>

<SELECT NAME=area SIZE=4>

<OPTION SELECTED>Bloomsbury

<OPTION>Theatreland

<OPTION>The City

<OPTION>Kensington and Chelsea

<OPTION>Out of the Centre

<OPTION>Further afield

</SELECT>

<HR>

<TEXTAREA NAME=description ROWS=6 COLS=80>Describe your pub here!

H</TEXTAREA>

<p>

<INPUT TYPE=radio NAME=grade VALUE=1>Average.

<INPUT TYPE=radio NAME=grade VALUE=2 CHECKED>Worth going to.

<INPUT TYPE=radio NAME=grade VALUE=3>Worth a detour.

<INPUT TYPE=radio NAME=grade VALUE=4>Worth a long detour!

<HR>

<B>Your Name:</B>

<INPUT TYPE=text NAME=username SIZE=40>

<P>

<B>Your email address:</B>

<INPUT TYPE=text NAME=useremail SIZE=40>

<P>

<I>(You can leave these blank if you don't want to be credited)</I>

<hr>

<INPUT TYPE=submit VALUE="I've finished now">

<INPUT TYPE=reset VALUE="Ooops, can I start again?">

</FORM>

</BODY>

Submitting Forms to the Server

As we mentioned above, there are two ways you can submit a form to a server -
the GET method and the POST method. These methods refer to the type of request
the browser makes to the client.

GET is the normal way a browser requests a page or an image from a server - the
URL gives the location of the file we way or the name of the script that will
produce the data we want. We saw with imagemaps that we could also send a small
amount of additional data to a script in the URL by putting it at the end of
the URL after a question mark. You can also send all the information contained
in your form to the server in the same way. However, it's possible that some
servers may have size limits in the amount of data they can pass to the script
in this way.

POST is an entirely different method from those we've seen so

far. A POST request consists of a URL which refers to the script

we wish to run, and then a URL encoded(7) body which contains the

If your server supports the POST method, it is recommended to use it - using
the GET method may work, but if your forms are large, it is really stretching
the intended purpose of GET somewhat.

The cgiparse approach on the CERN server

The CERN HTTPD server comes with a useful script called cgiparse, which does
most of the hard work. It will work with either the GET or POST methods, though
again we recommend using POST for forms of any length.

cgiparse reads the QUERY_STRING environment variable (as set if you use the GET
method) or, if QUERY_STRING is not set, reads CONTENT_LENGTH characters from
standard input (as set if you use the POST method). What it does next depends
on which flag you give it, but for now we're only interested in the ``-form''
flag.

``cgiparse -form'' outputs a string which, when evaluated by the Bourne
shell(8) , sets environment variables (with ``FORM_'' prepended to their names)
for each of the form elements. It also URL decodes the variables for you.

Thus a script to do the same task as above (now written in Bourne shell!) would
be:

   * #!/bin/sh
   * eval `cgiparse -form`
   * $filename=$FORM_pubname
   * $doc_root="/cs/research/www/www/htdocs"
   * $fullfilename=$doc_root"/misc/uk/london/pubs/auto-"$filename".html"

#Write the entry to a file in HTML

echo "<TITLE>"$FORM_pubname"</TITLE>" > $fullfilename

echo "<H1>"$FORM_pubname"</H1><HR>" >> $fullfilename

echo "<I>"$FORM_pubaddress"</I><P>" >> $fullfilename

echo "<h2>Area:</h2> "$FORM_area""n" >> $fullfilename

echo -n "<h2>Description:</H2>" >> $fullfilename

.....and so on...

cgiparse can take many other command line options to modify

it's behavior, and can be used for tasks other than form

processing - we recommend the CERN httpd Reference Manual which

is available on the web for a more detailed explanation.

Server Performance

Whilst this section should give you enough information to know

what the main issues are with setting up and running a WWW

server, it is still possible to end up with a server that doesn't

perform too well. If your server is lightly loaded, then this

is probably not a problem, so ignore this section. However, if you provide
interesting information, your server may be inundated with requests, and you
may need to consider it's performance. If you're used to systems
administration, then the following tips may be obvious, but many servers are
run by people with little systems administration experience, and as this group
of people is increasing rapidly, we've added a few pointers:

How does your server start?

WWW servers running on Unix (or similar) systems have two ways to start -
either from inetd (the Internet Daemon) when a connection arrives or at system
boot time and the application listens continuously for incoming connections.
Starting from inetd is much slower than starting a single copy at system boot
time.

Where are the files you're serving actually stored?

Although you may consider you LAN to be fast and your WAN to be slow, if your
server has to get your files from a fileserver on a different machine, this
will load your LAN unnecessarily, and slow down your server's response time.

On some systems, it may be possible to use a caching

filesystem to do this if you can't put all the files on the

WWW server machine.

How does your server resolve it's UID and GID?

Both the CERN and NCSA servers let you start the server as

root (so it can use port 80), and then set the user ID and

group ID to a none privileged account. This is a very good

idea from a security point of view. However, if your server

used NIS (Network Information System) to obtain the user and

group IDs (particularly the supplementary groups), it will

Have to pull your user and group database files across your

LAN from the NIS server for every WWW access. It may be a

good idea to run a slave NIS server on the same machine as

the WWW server to improve this performance.

Does your server have enough memory?

It's pretty obvious, but if a machine is thrashing (the

active processes use more memory than is available, so

you're paging from disc continuously), then it will perform badly.

On systems such as the Macintosh, there's only one running server, but the
memory it uses increases as the number of simultaneous connections increases.
Check the server at busy times to ensure its memory allocation is sufficient.

How efficient are your CGI scripts?

If your server is seeing a lot of accesses to CGI scripts, those scripts may
dominate the performance of your server. Scripts that do a lot of database
searching, or use a lot of files from another fileserver may cause you
problems. If this appears to be a problem, consider running another WWW server
on the database machine itself, moving the files to the WWW server, or using a
caching filesystem to improve the situation.

How stable are your CGI scripts?

It's not uncommon for badly written CGI scripts to contain bugs that cause them
to loop indefinitely. Whilst you would probably notice this immediately with a
normal program, a WWW user not getting a response from a server is quite likely
to try again, and hit the same bug again!

Some servers can be configured to kill CGI scripts after

a certain amount of time. Check how long you thing

your scripts should take to run, and set this value

accordingly. However, if you do this, you may never

discover malfunctioning scripts unless you set them up write a log entry when
they're started and remove the log when they finish correctly.

Have you turned off RFC1633 authentication?

Some servers (such as CERN HTTPD) permit you to enable

RFC1633 authentication, which attempts to query the client

Machine about the username of the person making the request. This information
is unreliable at best, and if your server is busy, it is best to disable this
feature.

Who are your users?

You can analyze who your users are (or rather where they come from). If many of
your users come from one area that is not local to you, consider setting up
copies of your data elsewhere.

Have you considered setting up a caching server?

If your server is busy with requests from elsewhere, your

local users will get poor performance to the rest of the

world. Consider setting up a caching server (such as CERN

HTTPD), and configuring your local user's clients to use

it. This will help them, and will also reduce the load

your own users cause elsewhere. If you can't set up your

own caching server (because there's not one available for your hardware), then
consider using someone else's. Enquire whether your network provider is running
a caching server, and if they're not then encourage them to do so.

Most of this is common sense if you know a little about how the WWW server
actually works. If you have access to any network monitoring facilities, it can
be enlightening to look at how much local network access your server is doing
when responding to a request. If it's more than you expect, then the server is
probably using resources from another local server (such as a fileserver or NIS
server), and if it is excessive, this will be detrimental to performance.

Where else to look

All the popular WWW Client programs have self-contained help systems, which are
usually just hotlists or bookmarks that point to the distribution sites.

Another place to look if you prefer the printed page, is a forthcoming book by
the authors of this article, entitled "World Wide Web: Beneath the Surf", to
appear in March 199

And what of the Users?

ERGONOMICS

-------------------------------------------------------------------------------

The primary goal of multimedia should be to increase the communications
bandwidth between people, and between computers and people. Hence it is a very
weighty decision to embark on adding multimedia input and output, even before
considering the cost.

Ergonomics is the science of human perception and cognition. Hence it has a lot
to offer in this area.

User interface design

User interfaces for multimedia systems are troublesome. For example, the amount
of text on a screen, and the ways that humans interact with text through
reading and typing are quite limited, even when you add a mouse and a
scrollbar.

An interface that includes graphics, video and audio can convey an enourmous
amount more, but only if the content is so authored.

However, such an interface also has a much higher potential event rate than a
text based one. Events such as

   * Start and end of audio
   * Start and end of video
   * Change in characteristic of video (e.g. resize)
   * Change in membership of a conference

A lot of the principles from good Graphical User Interface design apply here.
The problem with video in multi-way conferencing systems is simply using up
screen "Real Estate". There is no real answer for thios one, except perhaps
using screen projection technology - this tends to be expensive, and also
changes the nature of a conference from personal to impersonal.

User Interface Design

   * Problems commonly encountered include:

  1. Clutter
  2. Excess Events
  3. Interference

These are illustrated in the next 3 pictures

[Image] Notification

[Image] Interference

Placement of peripherals

Analog audio is far the most difficult aspect of multimedia - getting mikes and
speakers setup so that levels are compatible, and that there is no feedback,
echo

and so on...

Video is relatively easy nowadays as most cameras have automatic iris setups.
Having said that, it can be problematic for a compression scheme if the picture
changes in focus or contrast, as this is indistinguisable from motion. Thus for
heavily compressed video, we usually disable autofocus and auto-irising, and
have then to get the lighting levels and distances right.

If possible, cameras should be placed so that users attention naturally causes
them to look approximately straight into the camera - very high end systems use
half-silvered mirrors in front of the display so that the camer can be
effectively 'behind' the screen.

Placement of Peripherals

   * Needs some care
   * audio for levels, avoiding feedback, echo etc
   * Video, lighting, POV, Eye Contact etc

EVALUATION

-------------------------------------------------------------------------------

Since the 60s, there has been a lot of hype about multimedia. In fact, it is
not at all clear how useful it is. Currently, the investment has not paid off
outside of the professional grpahics and enterntainment industries. This may
simply be that it is early days yet.

However, another aspect of multimedia is also that there is user resistence to
using it,m just as there was, and still is, user resistence to using computers
at all.

It is important to evaluate your systems usage and compare it to legacy systems
in your organisation.

Psychological models

There are a variety of psychological models available to allow analysis of use
of multimedia systems:

   * Behavourist
   * Cognitive Modelling
   * Kleinian
   * etc

Psychological models of Tasks - Cognition & perception

The following two pictures show the results of a task based analysis of a video
compression scheme commonly used

[Image] Tasks 1

[Image]

Measuring user perception/acceptance.

Do your users use the system to replace other technology, to do more, or to d
othe same work more producitively, or with better quality?

If the system entails communicaiton, do they feel that it enhances or decreases
privacy?

Is the system hard to learn? Does it ressemble your previous systems? Does it
have built in self description?

How can you evaluate if tasks are carried out better or worse?

Sociological effects.

Do your users collaborate more, or more effecitively? Do they spend more time
playing, and less working? Is cohesion amongst the, increased or decreased? Is
the identity of your organisation more focused? Do the users have more fun, and
do they want to learn more about the systems? Are they used for extra-curricula
tasks?