Distributed Multimedia Jon Crowcroft, UCL CS with non-trivial assistence from Mark Handley, Steve Hailes, Nermeen Ismail, Angela Sasse and Ian Wakeman MULTIMEDIA-WHAT IS IT? ------------------------------------------------------------------------------- Throughout the 1960s, 1970s and 1980s, Computers have been restricted to dealing with two main types of data - words and numbers. Text and Arithmetic processing, Word Processing and Spreadsheets etc. etc. Codes for numbers (binary, BCD, Fixed point etc., IEEE floating point), are fairly well standardized. Codes for text (ASCII, EBCDIC, but also fonts, Kangi, ppt etc.) are also reasonably well understood. Higher level "codes" - links, indexes, references, and so on are the subject of such standards as HTML, HyTime and so forth. Now computers, disks and networks are fast enough to process, store and transmit audio and video and computer generated visualization material as well as text and graphics and data: hence the multimedia revolution One thing about multimedia that cannot be overstated: It is big, like space in the Hitchhiker's Guide to the Universe, it is much bigger than you can imagine. Of course, I am not talking about the hype here, I am talking about the storage transmission and processing requirements! MULTIMEDIA-WHAT IS IT? ------------------------------------------------------------------------------- * Anything beyond letters and numbers, text and arithmetic * Grapics, Still Photos, Audio, Video, Animation * VR, Hypertext, Hypermedia * Some has Time Structure for user * Some has non-linear sequences, or choice * All Costs more to create and to use * If used well, has greater value than traditional "mono"-media ------------------------------------------------------------------------------- Multimedia Source Characteristics * Spatially Self Similar * Temporally Self Similar * Amenable to compression * Large amounts of Redundancy * Similarity is structure - i.e. can use Compression to Aid searching ------------------------------------------------------------------------------- Multimedia Access Patterns * Traditional Data Access patterns have strong temporal and spatial correlation * i.e. if you look at first page of a document, probably look at rest. * Multimedia access is not necessarily like that * Zapping, Searching, Rewinding etc!!! * Hyper links, all contradict this model ------------------------------------------------------------------------------- "EVERY ENCODING IS A DECODING" ------------------------------------------------------------------------------- The word "encoding" is often used as a noun as well as a verb when talking about multimedia. The first thing to understand about multimedia is the vast range of encodings currently in use or development. There are a variety of reasons for this: Codes for audio, video depend on the quality of audio or video required. A very simple example of this is the difference between digital audio for ISDN telephones (64Kbps PCM see later) and for CD (1.4Mbps 16 bit oversampled etc.). Another reason for the range of encodings is that some encodings include linkages to other media for reasons of synchronization (e.g. between voice and lips). Yet another reason is to provide future proofing against any new media (holograms?). Finally, because of the range of performance of different computers, it may be necessary to have a "meta-protocol" to negotiate what is used between encoder and decoder. This permits programs to encode a stream of media according to whatever is convenient to them, while a decoder can then decode it according to their capabilities. For example, some HDTV (High Definition Television Standards) are actually a superset of current standard TV encoding so that a "rougher" picture can be extracted by existing TV receivers from new HDTV transmissions (or from paying back new HDTV videotapes). This principle is quite general. "EVERY ENCODING IS A DECODING" ------------------------------------------------------------------------------- * Even numbers and letters have an encoding: ASCII and IEEE Floating Point * Each new medium needs to be coded * The codings now involve possible relationships between different media * Compression, and Hierarchical encoding are also needed * Meta-languages (codes for codings) are required * First, lets look at some audio and video input forms and digital encodings. ANALOG AND DIGITAL ------------------------------------------------------------------------------- Digital Audio and Video all start life in the "Analog Domain". (Domain is used in this context just to mean before or after some particular conversion). It is important to understand the basic requirements of the media in time and space. The analog domain is usually best understood in terms of the range of frequencies in use for a particular quality. For sound, this means how low and high a note/sound is allowed. For video, this translates into the number of distinguishable colors. For video, we also have to consider the frame rate. Video is similar to film in that it consists of a number of discrete frames. You may recall seeing old films which were shot at a lower frame rate than is used nowadays, and flicker is visible. Both sound and image can be broken down at any instant into a set of basic frequencies. This is the so-called "waveform". We can record all of the frequencies present at anyone time, or we can choose to record only the "important" ones. If we choose to record less than all frequencies, we get less "fidelity" in our recording, so that the playback is less like the original. However, the less we record, the less tape/recording media we need. ANALOG AND DIGITAL ------------------------------------------------------------------------------- * Audio and Video start as waves * Waves need to be sampled digitally * We can do this "perfectly" by sampling twice as often digitally as the highest analog frequency * Or we can take advantage of human frailty and reduce the quality What we have to work with - Input and Output ANALOG BANDWIDTH ------------------------------------------------------------------------------- Analog Audio is in the range 50Hz to 20KHz. Human speech is typically in the range 1-3KHz, and the telephone networks have taken advantage of this since very early days by providing only limited quality lines. This has meant that they can use low quality speakers and microphones in the handset - the quality is similar to AM radio. The copper wires used for transmission were, in most systems luckily, over-engineered. They are capable of carrying a signal at up to 16 times the 'bandwidth' of that used by pure analog phones from the home to the exchange over a kilometer, and 300 times this bandwidth up to 100 meters. For the moment, though, the "last mile" or customer subscriber-loop circuits have boxes at the ends that limit this to what is guaranteed for ordinary audio telephony, while the rest of the frequencies are used for engineering work. ANALOG BANDWIDTH ------------------------------------------------------------------------------- * Transmission and storage of analog audio (and video) is reasonably familiar to all * Note though that we tolerate much lower quality audio transmission for the phone (3000 Hz) than for entertainment (30khz) * This brings home the economics of bandwidth * Only recently have transmission techniques got to the point where we might consider video down a telephone copper wire, and even then, only a few 100 metres TRANSFORMS ------------------------------------------------------------------------------- An analog signal can be broken down into component signal frequencies. There is a mathematical theorem due to Fourier, that in fact there are lots of ways of doing this, but that one particular one, a set of frequencies made up of sine and cosine waves, is sufficient to represent just about any real waveform. There are others based just on Cosine, etc. etc. If you transform a signal into base frequencies, then you can remove detail from the image, simply by removing high frequency components. For audio, this results in a lower quality sound, where the tone of the notes may have less timbre. For video, this results in loss of fine granularity in a picture. The discrete cosine transform is based on this idea and is fundamental to many video compression schemes. TRANSFORMS ------------------------------------------------------------------------------- * Fourier showed you could take any signal as a sum of base set of frequencies at given strengths * Leaving out components (zeroing coefficients for those frequencies) doesn't necessary degrade the result much. * This is the basis of many compression schemes * There are others - run length encoding and Huffman coding are two very simple ones. DIGITAL SAMPLING ------------------------------------------------------------------------------- You can take snapshots of a waveform as it changes in time, and represent what you see as a number (or set of numbers). The sequence of numbers is now something that a computer can store, process, transmit and receive. Such a sequence is really what we call "multimedia data". DIGITAL SAMPLING ------------------------------------------------------------------------------- * Snapshot of the input in time => sequence of values * If snapshot is sufficiently short, the values range can be small * Can be stored as a word or byte digitally * Au fond, this is digital multimedia - just more bits and bytes! AUDIO SAMPLING ------------------------------------------------------------------------------- Analog sound is made by creating waves in the air of compressed and rarefied air. When such waves in the right frequency range (roughly 1-20KHz) hit the human ear, we here notes with a particular timbre. By imposing other, complex modulations on sound, we can form all kinds of neat sounds like speech. Most speech is made up of sounds between 1 and 3KHz. There's a simple law due to Shannon that tells us how many bits we need to store or send per second to represent such a wave - so if we sample the analog signal that often, we have the simplest possible representation of sound - this is Pulse Code Modulation (PCM). Other techniques are possible - we could actually store a snapshot of the frequencies present every instant, and their strengths. (i.e. do a spectrum analysis of the incoming signal!). Other things we might want to store about sound are positions (e.g. stereo or quad image information for each source), and we might want some information about the resonance and reverberation of the room/space it was originally made in so that we can reproduce this for people in a different space relative to different listeners positions at playback time. This can all take quite a bit of data - the best standard for audio recording now in use, CD Digital Audio, takes 1.4 Mbps. AUDIO SAMPLING ------------------------------------------------------------------------------- * Audio quality ranges from a few Kbps, to 1.4Mbps (CD) * Spatial information can be costly (stereo could require twice the bandwidth) but can in some cases be stored more simply). * Source room resonance and qualities are usually abandoned, but may prove important in the future (VR, Games Telepresence, etc.) COLOR (OR COLOR) ------------------------------------------------------------------------------- There are several approaches to color processing: 1. Full Color 2. Pseudo Color 3. Grayscale 1. Color is very complex. Basically, light is from a spectrum (continuum), but we typically manipulate colors by manipulating discrete things like pens, or the colored dots of phosphor on a CRT which emit light of a given intensity but at a single color, when hit by an electron of a given energy. There are several ways of mixing discrete colors to get a new color that has the right appearance to the human eye. The human eye does not perceive a spectrum, but rather all colors as combinations of 3 so called primary colors, Red (435nm), Green (546nm) and Blue (700nm). 2. These primaries can be added to produce secondaries, magenta, cyan and yellow. [The roles of primary and secondary are reversed in pigments, over those in light, since the concern of a dyemaker is concerned with which color is absorbed, rather than which is transmitted]. COLOUR (OR COLOR) ------------------------------------------------------------------------------- * Colour is tricky stuff * Most MM users use it too much * In natural situations, it is very rich * Human perception is not of spectrum, but of approximately RGB * So are most cameras now * Human mental perception is of a spectrum, though... COLOR INPUT BY HUMANS ------------------------------------------------------------------------------- The human eye can perceive a very wide range of colors compared with grayscales. It actually has different sensors for color than for monochrome. Color is detected by "cones", cells in the retina that distinguish a range of different signals, while black and white (mono-chrome) is dealt with by rods. Rods are actually sensitive to much lower light levels (intensity/power), and are particularly good at handling motion. Cones are specialized to higher light levels (hence why color vision doesn't work in dim light levels such as during dawn/dusk/twilight). COLOR INPUT BY HUMANS ------------------------------------------------------------------------------- * Eye/retina has Rods and Cones * Rods see greys and motion * Cones see color * Respond to 3 wavelengths, and perceive a mix COLOR INPUT BY COMPUTERS ------------------------------------------------------------------------------- A color input device such as a video camera has a similar set of sensors to cones. These respond to different wavelengths with different strengths. Essentially, a video camera is a digital device, based around an array of such sensors, and a clock that sweeps across them the same way that the electron gun in the back of a TV or computer display is scanned back and forth, and up and down, to refresh the light emission from the dots on the screen. So, for a single, still frame, a scan produces an array of reports of intensity, one element for each point in the back of the camera. For a system with 3 color sensor types, you get an array of triples, values of intensity of light of each of the sensors at being a real. This is then converted into an analog signal for normal analog recording. Some devices are emerging where the values can be directly input to a computer rather than converted to analog, and then have to be converted to digital again by an expensive frame grabber or video card. Given the range of intensities the human eye can perceive isn't huge, they are usually stored digitally in a small number of bits - most usually 8 per color - hence a "true" color display has 24 bits, 8 bits each for R, G and B. RGB is the most commonly used computing color model. CMY is just [1] - [RGB], and vice, versa. [0,0,0] is black, and [255,255,255] is white. COLOR INPUT BY COMPUTERS ------------------------------------------------------------------------------- * Input is usually a 2D array of triples * RGB = Red, Green, Blue * YUV = Chrominance, Luminance, Value * Similar to HSV = Hue/Saturation/Value (or Intensity) * CMY = Cyan Magenta, Yellow COLOR OUTPUT BY COMPUTERS AND OTHER DEVICES ------------------------------------------------------------------------------- Image or video output is just the reverse of input. Thus an area of memory is set aside for the "framebuffer". Data written here will be read by the video controller, and used to control the signal to the display's electron gun for intensity of each of the colors for the corresponding pixel. By changing what is in the framebuffer once per scan time, you get motion/animation etc. So to play back digital video from disk, typically, you read it from disk to the framebuffer at the appropriate rate, and you have a digital VCR! "Video RAM" is not usually quite the same as other memory since it is targeted at good row then column scans rather than true RAM access. COLOR OUTPUT BY COMPUTERS ------------------------------------------------------------------------------- * Output to Framebuffer = VRAM, is n bits of each of RGB * If n=8, "True Color" * n < 8, can have color maps - values are indexes * Color maps lead to flicker or false color * n=1, monochrome * Greyscale displays can be hi-quality VIDEO FRAMES ------------------------------------------------------------------------------- An image received by the retina in the eye persists for a short while. A sequence of images or frames, with small changes that impinge on the eye sufficiently close together will give the illusion of a moving pictures. How much of the picture changes between one image and the next affects how smooth or how jerky the movement will appear. Frame rates of 10 per second and above are enough to give reasonably realistic rendition of natural scenes. In fact, the way that motion is perceived by the human brain means that less detail is required in fast moving segments of a picture. [Interlacing is a scan technique used to try to get the persistence of the image higher without increasing the scan rate - basically, each alternate frame time, odd or even lines are refreshed]. VIDEO FRAMES ------------------------------------------------------------------------------- * Eye and screen have persistence - image lasts a while * Screen is refreshed from framebuffer, can last "for ever" * Frame rates > 10 fps generally look 'smooth' * Frame rates > 20 fps capture fast motion * Eye perceives motion with less resolution than still images OTHER COLOR SCHEMES ------------------------------------------------------------------------------- There are other ways of storing color - rather than a set of discrete values that are "added" by the eye, the Hue/Saturation/Value (a.k.a. Hue/Saturation/Intensity) scheme stores three different values: 1. frequency (hue/true-color), 2. saturation, the amount a color is "diluted" by all the other colors or white 3. Intensity 1. This is useful, since we can process intensity separately. Conversion from RGB to HSV is pretty straightforward. OTHER COLOR SCHEMES ------------------------------------------------------------------------------- * Can store real values if input is from spectrum analyzer * E.g. HSV * Hue = frequency * Saturation = dilution * Value = intensity HYBRID ANALOG VIDEO SYSTEMS ------------------------------------------------------------------------------- Early video on computers was (and still is in some cases) provided by a hybrid approach. Basically, any computer with an bitmap display could have a dual port into the video controller, and the signal that was used to drive the display intercepted for portions of the scan of the CRT, and an external video signal used. This would result in perfect video in a sub-area of the display. The only problem is that the video was at no stage digitised, and is therefore not amenable to capture and processing. A later version of this trick (dare one say hack) is to digitise the external video, and write it into a dual ported framebuffer (the video memory that the controller scans to update the display). However, if the video card was a replacement for the actual standard computers video RAM, often it would be that the access to read the part of the framebuffer holding the video would be significantly slower than full video speed (so much so that even a single still frame grab might not be feasible from the CPU. Another hybrid approach to multimedia is where storage devices are used that have hybrid recording tracks - this (contrary to the hacks above) is(was) genuinely useful. Where a high quality film or sound track might be put onto very high density mag tape, a separate index track might be put alongside, digitally. This could be used by editing systems to create edit sequences, so that a mix-down of the analog track could be performed many times without making any generational copies, until the editor is satisfied., Then the final actual mix down from the master tape to a master copy (e.g. CD, or analog vinyl!) could be done automatically. If the density/quality of the master material is very high and precludes use of compression (say due to lack of technology or money) this is a very useful technique. HYBRID ANALOG VIDEO SYSTEMS ------------------------------------------------------------------------------- * Hybrid systems combine analog with digital * Often used to "dual port" a screen for external input * Sometimes use dual port VRAM/Framebuffer * Most useful for digital indexing of stored high quality analog * Can provide no-copy editing facilities STILL IMAGES ------------------------------------------------------------------------------- There are many many bitmap or other image formats. Since many are functionally equivalent, this has led to a plethora of tools to convert betwixt and between them - GIF, TIF, WMF, PPM, PBM, etc. etc. The main compressed still image form for quality multimedia is based on the JPEG standard, but since this is also used for Video ("motion JPEG") we discuss this below. STILL IMAGES ------------------------------------------------------------------------------- * Still digital image formats are many * GIF is currently proprietary, compressed form * TIF, TIFF, PPM and PBM are all Public Domain * WMF is commonly used * JPEG is standard, very good for Photos/Natural scenes, high quality compression, though lossy INPUT MEDIA FORMATS ------------------------------------------------------------------------------- There are two main audio encodings in common use in the world: CD (Compact Disk) and PCM (Pulse Code Modulation). PCM is from the telephony world, and is described with other audio encodings later. CD is from the entertainment business. Most common video encodings are based on those from the TV industry and standards world. INPUT MEDIA FORMATS ------------------------------------------------------------------------------- * Audio typically arrives as 64Kbps PCM or 1.4Mbps CD * Video typically in Common Interchange Format, but... * Differs in aspect ration (height * width) for: * NTSC, SECAM, PAL, etc. PAL/NTSC/SECAM Before you can digitize a moving image, you need to know what the analog form is, in terms of resolution and frame rate. Unfortunately, there are 3 main standards in use. PAL is used in the UK, while NTSC is used in the US and in a modified form in JAPAN, and SECAM is used in France and Russia. The differences are in number of lines, framerate, scan order and so forth. * PAL * NTSC * SECAM PAL/NTSC/SECAM * PAL used in UK * NTSC in USA and Japan * SECAM in France and Russia * Differ in lines, framerate, interlace order and so on HDTV High Definition TV has yet to make it into standards. One problem is that the technology has moved quite quickly, so although the Japanese and Americans were ready to roll with a double resolution standard a few years back, noone would accept this as they foresaw a short lifetime for an inferior technology. HDTV * Not wide standard yet * Too high data rate for current computer storage, processing or transmission * Standard TV as sub-sample would have been nice (DMAC etc.) DATA COMPRESSION ------------------------------------------------------------------------------- Devices that encode and decode as well compress and decompress are called CODECs or CODer DECoders. Sometimes, these terms are used for audio, but mainly they are for video devices. A video CODEC can be anything from the simplest A2D device, through to something that does picture pre-processing, and even has network adapters build into it (i.e. a videophone!). A CODEC usually does most of its work in hardware, but there is no reason not to implement everything (except the a2d capture:-), in software on a reasonably fast processor. The most expensive and complex component of a CODEC is the compression/decompression part. There are a number of international standards, as well as any number of proprietary compression techniques for video. DATA COMPRESSION ------------------------------------------------------------------------------- * Data (files etc.) typically compressed using Huffman codes or Run Length, or clever statistical rules such as Lempel-Ziv * Audio and Video are loss tolerant, so can use cleverer compression that discards some information * Compression of 400 times is possible on video - useful given the base uncompressed data rate of a 25 fps CIF image is 140Mbps * A lot of standards for this now * Some good proprietary techniques * Note that lossy compression of video is not acceptable to some classes of user (e.g. radiologist, or air traffic controller). Video compression VIDEO COMPRESSION ------------------------------------------------------------------------------- Video compression can take away the requirement for the very high data rates and move video transmission and storage into a very similar regime to that for audio. In fact, in terms of tolerance for poor quality, it seems humans are better at adapting to poor visual information than poor audio information. A simple minded calculation shows: 1024 x 1024 pixels, * 3 bytes per pixel (24 bit RGB) * 25 Frames per second yields 75Mbytes/second, or 600Mbps - this is right on the limit of modern transmission capacity. Even in this age of deregulation and cheaper telecoms, and larger, faster disks, this is profligate. On the other hand, for a scene with a human face in, as few as 64 pixels square, and 10 frames per second might suffice for a meaningful image. * 64x 64 pixels * 3 bytes per pixel (24 bit RGB) * 10 Frames per second yields 122KBytes/Second, or just under 1 Mbps - this is achievable on modern LANs and high speed WANs but still not friendly! Notice that in the last simple example, we did two things to the picture. * 1. We used less "space" for each frame by sending less "detail". * 2. We sent frames less frequently since little is moving. 1. This is a clue as to how to go about improving things. Basically, if there isn't much information to send, we avoid sending it. Spatial and temporal domain compression are both used in many of the standards. VIDEO COMPRESSION ------------------------------------------------------------------------------- 1024 x 1024 pixels, * 3 bytes per pixel (24 bit RGB) * 25 Frames per second yields 75Mbytes/second, or 600Mbps!!! * 1. We could use less "space" for each frame by sending less "detail". * 2. We could send frames less frequently since little is moving. LOSSY VERSUS LOSSLESS COMPRESSION ------------------------------------------------------------------------------- If a frame contains a lot of image that is the same, maybe we can encode this with less bits without losing any information (run length encode, use logically larger pixels etc. etc.). On the other hand, we can take advantage of other features of natural scenes to reduce the amount of bits - for example, nature is very fractal, or self-similar:- there are lots of features, sky, grass, lines on face etc., that are repetitive at any level of detail. If we leave out some levels of detail, the eye (and human visual cortex processing) end up being fooled a lot of the time. LOSSY VERSUS LOSSLESS COMPRESSION ------------------------------------------------------------------------------- * If area of input doesn't change, don't send it * If area of input doesn't change much, don't send it * If moving area is detailed, could send "fuzzy" version * If still area has detail, could send this slower than large features * All depends on human frailty! HIERARCHICAL CODING ------------------------------------------------------------------------------- Hierarchical coding is based on the idea that coding will be in the form of quality hierarchy where the lowest layer of hierarchy contains the minimum information for intelligibility. Succeeding layers of the hierarchy adds increasing quality to the scheme. This compression mechanism is ideal for transmission over packet switched networks where the network resources are shared between many traffic streams and delays, losses and errors are expected. Packets will carry data from only one layer, accordingly packets can be marked according to their importance for intelligibility for the end-user. The network would use these information as a measure of what sort of packets to be dropped, delayed and what should take priority. It should be noted that priority bits already exist in some protocols such as the IP protocol. Hierarchical coding will also be ideal to deal with multicasting transmission over links with different bandwidths. To deal with such problem in a non-hierarchical encoding scheme, either the whole multicasting traffic adapts to the lowest bandwidth link capabilities thus degrading the video/audio quality where it could have been better or causing the low link to suffer from congestion and thus sites affected will lose some of the intelligibility in their received video/audio. With hierarchical coding, low level packets can be filtered out whenever a low bandwidth link is encountered thus preserving the intelligibility of the video/audio for the sites affected by these links and still delivering a better quality to sites with higher bandwidth. Schemes that are now in relatively commonplace use include H.261 for videotelephony, MPEG for digital TV and VCRs and JPEG for still images. Most current standards are based on one simple technique, so first lets look at that. HIERARCHICAL CODING ------------------------------------------------------------------------------- * Last idea was that levels of detail can be sent at different rates or priorities * Can be useful if there are different users (e.g. in a TV broadcast, or Internet multicast) * Can be useful for deciding what to lose in the face of overload or lack of disk storage etc. * Many of the video encodings (and still picture standards) are well suited to this. JPEG ------------------------------------------------------------------------------- The JPEG standard`s goal has been to develop a method for continuous-tone image compression for both color and greyscale images. The standard define four modes: * Sequential In this mode each image is encoded in a single left-to-right, top-to-bottom scan. This mode is the simplest and most implemented one in both hardware and software implementation. * Progressive In this mode the image is encoded in multiple scans. This is helpful for applications in which transmission time is too long and the viewer prefers to watch the image building in multiple coarse-to-clear passes. * Lossless The image here is encoded to guarantee exact recovery of every source image sample value. This is important to applications where any small loss of image data is significant. Some medical applications do need that mode. * Hierarchical Here the image is encoded at multiple resolutions, so that low-resolution versions may be decoded without having to decode the higher resolution versions. This mode is beneficial when transmission over packet switched networks. Only the data significant for a certain resolution determined by the application can be transmitted, thus allowing more applications to share the same network resources. In real time transmission cases (e.g. an image pulled out of an information server and synchronized with a real-time video clip), a congested network can start dropping packets containing the highest resolution data resulting in a degraded quality of the image instead of delay. JPEG uses the Discrete Cosine Transform to compress spatial redundancy within an image in all of its modes apart from the lossless one where a predictive method issued instead. As JPEG was essentially designed for the compression of still images, it makes no use of temporal redundancy which is a very important element in most video compression schemes. Thus, despite the availability of real-time JPEG video compression hardware, its use will be quite limit due to its poor video quality. JPEG ------------------------------------------------------------------------------- * JPEG has 4 modes 1. Sequential scanned left to right, top to bottom 2. Progressive - coarse to clear 3. Lossless 4. Hierarchical * Uses the Discrete Cosine Transform to encode and compress blocks H261 ------------------------------------------------------------------------------- H261 is the most widely used international video compression standard for video conferencing. The standard describes the video coding and decoding methods for the moving picture component of a audiovisual service at the rates of p * 64 kbps where p is in the range of 1 to 30. The standard targets and is really suitable for applications using circuit switched networks as their transmission channels. This is understandable as ISDN with both basic and primary rate access was the communication channel considered within the framework of the standard. H.261 is usually used in conjunction with other control and framing standards such as H221, H230 H242 and H320, of which more later. H.261 ------------------------------------------------------------------------------- * ITU (was CCITT) standard for video telephony * Very commonly implemented now in hardware and software * aimed at ISDN, anything from 64Kbps to 2Mbps * PC cards to do video, audio and ISDN exist * Used with other standards for communications and conference control. H.261 SOURCE IMAGES format The source coder operates on only non-interlaced pictures. Pictures are coded as luminance and two color difference components(Y, Cb, Cr). The Cb and Cr matrices are half the size of the Y matrix. H261 supports two image resolutions, QCIF which is (144x176 pixels)and , optionally, CIF which is(288x352). H.261 SOURCE IMAGES format * [Image] * The diagram shows the sampling of Chrominance and Luminance. * H.261 supports two resolutions: 1. CIF = 288*352 pixels 2. QCIF = 144*176 pixels H.261 SOURCE CODER * There main elements in an H.261 encoder are 1. Prediction 2. Block Transformation 3. Quantization H.261 SOURCE CODER [Image] Encoder H.261 Prediction H261 defines two types of coding. INTRA coding where blocks of 8x8 pixels each are encoded only with reference to themselves and are sent directly to the block transformation process. On the other hand INTER coding frames are encoded with respect to another reference frame. A prediction error is calculated between a 16x16 pixel region (macroblock) and the (recovered) correspondent macroblock in the previous frame. Prediction error of transmitted blocks (criteria of transmission is not standardized) are then sent to the block transformation process. H.261 prediction * Blocks are inter or intra coded * Intra-coded blocks stand alone * Inter-coded blocks are based on predicted error between the previous frame and this one * Intra-coded frames must be sent with a minimum frequency to avoid loss of synchronisation of sender and receiver. H.261 Block transformation H261 supports motion compensation in the encoder as an option. In motion compensation a search area is constructed in the previous (recovered) frame to determine the best reference macroblock . Both the prediction error as well as the motion vectors specifying the value and direction of displacement between the encoded macroblock and the chosen reference are sent. The search area as well as how to compute the motion vectors are not subject to standardization. Both horizontal and vertical components of the vectors must have integer values in the range + 15 and 15 though In block transformation, INTRA coded frames as well as prediction errors will be composed into 8x8 blocks. Each block will be processed by a two-dimensional FDCT function. H.261 Block Transformation * Each Block (and prediction error) is an 8*8 pixel square * It is coded as a forward discrete cosine transform * If this sounds expensive, there are fast table driven algorithms * Can be done in s/w quite easily, as well as very easily in h/w H.261 Quantization & Entropy Coding The purpose of this step is to achieve further compression by representing the DCT coefficients with no greater precision than is necessary to achieve the required quality. The number of quantizers are 1 for the INTRA dc coefficients and 31 for all others. Entropy coding involves extra compression (non-lossy) is done by assigning shorter code-words to frequent events and longer code-words to less frequent events. Huffman coding is usually used to implement this step. H.261 Quantization * For a given quality, we can lose coefficients of the transform by using less bits than would be needed for all the values * Leads to a "coarser" picture * Can then entropy code the final set of values by using shorter words for the most common values and longer ones for rarer ones (like using 8 bits for three letter words in English:-) H.261 Multiplexing The video multiplexer structures the compressed data into a hierarchical bitstream that can be universally interpreted. The hierarchy has four layers : * Picture layer : corresponds to one video picture (frame) * Group of blocks: corresponds to 1/12 of CIF pictures or 1/3 of QCIF * Macroblocks : corresponds to 16x16 pixels of luminance and the two spatially corresponding 8x8 chrominance components. * Blocks: corresponds to 8x8 pixels H.261 Multiplexing. * Bitstream made up of 4 things: 1. Pictures (A video frame) 2. Groups of Blocks (1/3 of QCIF picture) 3. Macroblocks (16*16 luminence and 2 8*8 Chrominence) 4. Blocks (8*8 pixels) H.261 Error Correction Framing An error correction framing structure is described in the H261 standard. The frame structure is shown in the figure. The BCH(511,493) parity is used to protect the bit stream transmitted over ISDN and is optional to the decoder. The fill bit indicator allows data padding thus ensuring the transmission on every valid clock cycle H.261 Error Correction and Framing * The framing structure for H.261 is H.221, which includes a FEC scheme, as shown in the 3 diagrams below. [Image] H.261 FEC [Image] H221 Structure [Image] H221 Framing H.261 Summary Though H261 as mentioned before can be considered the most widely video compression standard used in the field of multimedia conferencing, it has its limitations as far as its suitability for transmission over PSDN. H261 does not map naturally onto hierarchical coding. A few suggestions has been made as to how this can happen but as a standard there is no support of that. H261 resolution is fine for conferencing applications. Once more quality critical video data need to be compressed, the upper limit optional CIF resolution can start showing inadequate. H.261 Summary * H.261 is good for Videotelephony and conferencing * Currently mainly used over ISDN, but could be used over packet nets. * Hierarchical use not part of the standard (yet) * At 2Mbps, it approximates to entertainment quality (VHS) video. MPEG ------------------------------------------------------------------------------- The aim of the MPEG-II video compression standard is to cater for the growing need of generic coding methods for moving images for various applications such as digital storage and communication. So unlike the H261 standard who was specifically designed for the compression of moving images for video conferencing systems at p * 64kbps , MPEG is considering a wider scope of applications. MPEG * Aimed at storage as well as transmission * Higher cost and quality than H.261 * Higher minimum bandwidth * Decoder is just about implementable in software * Target 2Mbps to 8Mbps really. * The "CD" of Video? MPEG SOURCE IMAGES format The source pictures consist of three rectangular matrices of integers: a luminance matrix (Y) and two chrominance matrices (Cb and Cr). The MPEG supports three format : * 4:2:0 format 1. In this format the Cb and Cr matrices shall be one half the size of the Y matrix in both horizontal and vertical dimensions. * 4:2:2 format 1. In this format the Cb and Cr matrices shall be one half the size of the Y matrix in horizontal dimension and the same size in the vertical dimension. * 4:4:4 format 1. In this format the Cb and Cr matrices will be of the same size as the Y matrix in both vertical and horizontal dimensions. MPEG Source Images Format * YUC sampling in 4 forms * 4:2:0, 4:2:2, 4:4:4 * Looking at some video capture cards (e.g. Intel's PC one) it may be hard to convert to this * But then this is targeted at digital video tape and video on demand really. MPEG frames The output of the decoding process, for interlaced sequences, consists of a series of fields that are separated in time by a field period. The two fields of a frame may be coded independently (field-pictures) or can be coded together as a frame (frame pictures). MPEG Frames The diagram shows the intra, predictive and bi-directional frames that MPEG supports: [Image] MPEG MPEG source coder An MPEG source encoder will consist of the following elements: * Prediction (3 frame times) * Block Transformation * Quantization and Variable Length Encoding MPEG Prediction 1. MPEG defines three types of pictures: 1. Intrapictures (I-pictures) These pictures are encoded only with respect to themselves. Here each picture is composed onto blocks of 8x8 pixels each that are encoded only with respect to themselves and are sent directly to the block transformation process. 2. Predictive pictures (P-pictures) These are pictures encoded using motion compensated prediction from a past I-picture or P-picture. A prediction error is calculated between a 16x16 pixels region (macroblock) in the current picture and the past reference I or P picture. A motion vector is also calculated to determine the value and direction of the prediction. For progressive sequences and interlaced sequences with frame-coding only one motion vector will be calculated for the P-pictures. For interlace sequences with field-coding two motion vectors will be calculated. The prediction error is then composed to 8x8 pixels blocks and sent to the block transformation 3. Bi-directional pictures (B-pictures) These are pictures encoded using motion compensates predictions from a past and/or future I-picture or P-picture. A prediction error is calculated between a 16x16 pixels region in the current picture and the past as well as future reference I-picture or P-picture. Two motion vectors are calculated. One to determine the value and direction of the forward prediction the other to determine the value and direction of the backward prediction. For field-coding pictures in interlaced sequences four motion vectors will thus be calculated. It must be noted that a B-picture can never be used as a prediction picture. The method of calculating the motion vectors as well as the search area for the best predictor is left to be determined by the encoder. MPEG prediction * I pictures are encoded as intra- w.r.t themselves only * P-pictures are coded w.r.t the last I-Picture (including any motion compensation) * B-Pictures use forward and backward predictions to encode w.r.t other I or P Pictures MPEG Block Transformation 1. In block transformation, INTRA coded blocks as well as prediction errors are processed by a two-dimensional DCT function. o Quantization 1. The purpose of this step is to achieve further compression by representing the DCT coefficients with no greater precision than is necessary to achieve the required quality. o Variable length encoding 1. Here extra compression (non-lossy) is done by assigning shorter code-words to frequent events and longer code-words to less frequent events. Huffman coding is usually used to implement this step. MPEG Block Transformation * As with H.261, frames are compressed using discrete cosine transforms * These are (again) quantized and the resulting values Huffman coded * There are, however, a few more things to MPEG 1. 2. MPEG Multiplexing 1. The video multiplexer structures the compressed data into a hierarchical bitstream that can be universally interpreted. 2. The hierarchy has four layers : o Videosequence 3. This is the highest syntactic structure of the coded bitstream. It can be looked at as a random access unit. o Group of pictures This is optional in MPEG II. This corresponds to a series of pictures. The first picture in the coded bitstream has to be an I picture. Group of pictures does assist random access. They can also be used at scenes cuts or other cases where motion compensation is ineffective. Applications requiring random access, fast-forwarder fast-reverse playback may use relatively short group of pictures. * Picture This would correspond to one picture in the video sequence. For field pictures in interlaced sequences, the interlaced picture will be represented by two separate pictures in the coded stream. They will be encoded in the same order that shall occur at the output of the decoder. * Slice This corresponds to a group of Macroblocks. The actual number of Macroblocks within a slice is not subject to standardization. Slices do not have to cover the whole picture. Its a requirement that if the picture was used subsequently for predictions, then predictions shall only be made from those regions of the picture that were enclosed in slices. * Macroblock 1. A macro block contains a section of the luminance component and the spatially corresponding chrominance components. A 4:2:0 macroblock consists of 6 blocks (4Y, 1 Cb, 1Cr) A 4:2:2 Macroblock consists of 8 blocks (4Y, 2 Cb, 2 Cr) A4:4:4 Macroblock consists of 12 blocks (4Y,4Cb, 4Cr) o Block 2. Corresponds to 8x8 pixels. MPEG multiplexing The structure of the MPEG bitstream is a tad more complex than that of H.261: * Video Sequence * Group of Pictures * Picture * Slice * Macroblock * Block 1. MPEG Picture Order It must be noted that in MPEG the order of the picture in the coded stream is the order in which the decoder process them. The reconstructed frames are not necessarily in the correct form of display. The following example shows such a case * At the encoder input, 12 3 4 5 6 78 9 10 11 12 13 IB B P B B PB B I B B P * At the encoder output, in the coded bitstream and at the decoder input, 14 2 3 7 5 610 8 9 13 11 12 IP B B P B BI B B P B B At the decoder output: 12 3 4 5 6 78 9 10 11 12 13 MPEG Picture order * The order of pictures at the decoder is not the display order, always * This leads to potential for delays in the encoder/decoder loop * This is also true of H.261 - at its highest compression ratio, it may incur as much as 0.5 seconds delay - not very pleasant for interactive use! SCALEABLE EXTENSIONS The scalability tools specified by MPEG II are designed to support applications beyond that supported by single layer video. In a scaleable video coding, it is assumed that given an encoded bitstream, decoders of various complexities can decode and display appropriate reproductions of coded video. The basic scalability tools offered are: data partitioning, SNR scalability, spatial scalability and temporal scalability. Combinations of these basic scalability tools are also supported and are referred to as hybrid scalability. In the case of basic scalability, two layers of video referred to as the lower layer and the enhancement layer are allowed. Whereas in hybrid scalability up to three layers are supported. MPEG Extensions * Spatial scalable extension 1. This involves generating two spatial resolution video layers from a single video source such that the lower layer is coded by itself to provide the basic spatial resolution and the enhancement layer employs the spatially interpolated lower layer and carries the full spatial resolution of the input video source. * SNR scalable extension 1. This involves generating two video layers of same spatial resolution but different video qualities from a single video source. The lower layer is coded by itself to provide the basic video quality and the enhancement layer is coded to enhance the lower layer. The enhancement layer when added back to the lower layer regenerates a higher quality reproduction of the input video. * Temporal scalable extension 1. This involves generating two video layers whereas the lower one is encoded by itself to provide the basic temporal rate and the enhancement layer is coded with temporal prediction with respect to the lower layer. These layers when decoded and temporally multiplexed yield full temporal resolution of the video source. * Data partitioning extension 1. This involves the partitioning of the video coded bitstream into two parts. One part will carry the more critical parts of the bitstream such as headers , motion vectors and DC coefficients). The other part will carry less critical data such as the higher DCT coefficients. * Profiles and levels Profiles and levels provide a means of defining subsets of the syntax and semantics and thereby the decoder capabilities to decode a certain stream. A profile is a defined sub-set of the entire bitstream syntax that is defined by MPEG II. A level is a defined set of constraints imposed on parameters in the bit stream. MPEG Extensions * Can encode different levels of spatial or temporal quality * Can partition the bitstream appropriately * Can profile an MPEG encoder. MPEG II Profiles Five profiles are defined : 1. Simple 2. Main 3. SNR scalable 4. Spatially scalable 5. High Along with four levels 1. Low 2. Main 3. High 1440 4. High MPEG Profiles * Important to realize specification is of encoded stream * Leaves lots of options open to the implementor * Profiles allow us to scope these choices (as in other standards, e.g. in telecommuncations) * This is important, as the hard work (expensive end) is the encoder, while the stream as specified, is generally easy however it Is implemented, to decode. * For information, the diagram shows a comparison of the data rate out of an H.261 and an MPEG coder [Image] h261 v mpeg MPEG II MPEG II is now an ISO standard. Due to the forward and backward temporal compression used by MPEG, a better compression and better quality can be produced. As MPEG does not limit the picture resolution, high resolution data can still be compressed using MPEG. The scaleable extensions defined by MPEG can map neatly on the hierarchical scheme explained in 2.1. The out-of- order processing which occurs in both encoding and decoding side can introduce considerable latencies. This is undesirable in video telephony and video conferencing. Prices for hardware MPEG encoders are quite expensive at the moment though this should change over the near future. The new SunVideo board (see below) does support MPEG I encoding. Software implementation of MPEG I DECoders are already available. MPEG II * MPEG II now an ISO standard * Slightly better than MPEG I * CODECs very very pricey right now * Software for decoders exists (in the public domain) and performs reasonably well for small pictures. MPEG III and IV MPEG III was going to be a higher quality encoding for HDTV. It transpired after some studies that MPEG II at higher rates is pretty good, and so MPEG III has been dropped. MPEG IV is aimed at the opposite extreme - that of low bandwidth or low storage capacity environments (e.g. PDAs). It is based around model-based image coding schemes (i.e. knowing what is in the picture!). It is aimed at UP TO 64kbps. MPEG III and IV * MPEG III was going to be High Definition MPEGII * Turns out MPEG II at higher rates is good enough * MPEG IV is for lower rates, such as a few 10s kbps SUBBAND CODING ------------------------------------------------------------------------------- Subband coding is given as an example of an encoding algorithm that can neatly map onto hierarchical coding. There are other examples of hierarchical encoding none of them is a standard or widely used as the international standards such as H261 and MPEG. Subband coding is based on the fact that the low spatial frequencies components of a picture do carry most of the information within the picture. The picture can thus be divided into its spatial frequencies components and then the coefficients are quantized describing the image band according to their importance; lower frequencies being more important. The most obvious mapping is to allocate each subband (frequency) to one of the hierarchy layers. If inter-frame coding is used, it has to be adjusted as not to create any upward dependencies. Subband Coding * Layered or subband coding uses a repeated application of the coder to different spatial frequencies in the picture * Similar to the ideas in H.261 and MPEG but applied more directly * Have to take care with inter-frame coding interactions with a subband coding scheme (areas change in detail...) DVI ------------------------------------------------------------------------------- Intel's Digital Video Interactive compression scheme is based on the region encoding technique. Each picture is divided into regions which in turn is split into subregions and so on, until the regions can be mapped onto basic shapes to fit the required bandwidth and quality. The chosen shapes can be reproduced well at the decoder. The data sent is a description of the region tree and of the shapes at the leaves. This is an asymmetric coding, which requires large amount of processing for the encoding and less for the decoding. DVI ,though not a standard, started to play an important role in the market. SUN prototype DIME board used DVI compression and it was planned to be incorporated in the new generation of SUN videopix cards. This turned out to be untrue. Intel canceled the development of the V3 DVI chips. SUN next generation of VideoPix, the SunVideo card does not support DVI. The future of DVI is all in doubt. DVI ------------------------------------------------------------------------------- * Region based coding scheme * Good compression * No loss tolerance * Chipset was developed by Intel * Not popular anymore CELLB COMPRESSION ------------------------------------------------------------------------------- CellB image compression is introduced by SUN and is supported by its new SunVideo cards. CellB is based on the techniques of block truncation and vector quantization. In vector quantization, the picture is divided into blocks and the coefficients describing the blocks are used as vectors. As the vector space in which the block vectors exist would not be evenly populated by the blocks, the vector space can be divided into subspaces selected to provide equal probability of a random vector being in any of the subspaces. A prototype vector will be then used to represent all blocks whose vectors fall into a certain subspace. The most processor intensive part of vector quantization is the generation of the codebook, that is the division of the vector space into subspaces. Then a copy of the codebook is sent to the other end. The image is then divided into blocks which is represented by the vector in the codebook that is closest to it and the label is sent. Decoding is done by looking up the labels in the code book and use the correspondent vector to represent the block. CellB uses two fixed codebooks. It takes 3-band YUV images as input, the width and height must be dividable by 4. The video is broken into cells of 16 pixels each arranged in 4x4 group. The 16 pixels in a cell are represented by a 16-bit mask and two intensities or colors. These values specify which intensity to place at each of the pixel positions. The mask and intensities can be chosen to maintain certain statistics of the cell, or they can be chosen to reduce contouring in a manner similar to ordered dither. This method is called Block Truncation Coding. It takes advantage of the primitives already implemented in graphics accelerators to provide video decoding. CELLB ------------------------------------------------------------------------------- * Proprietary Sun Microsystems * Implemented on their video cards * Good loss tolerance * based on vector quantization See the diagram [Image] VQ QUICKTIME AND VIDEO FOR WINDOWS ------------------------------------------------------------------------------- Apple and Microsoft have both defined standards for their respective systems to accommodate video. However, in both cases, they are more concerned with defining a usable API so that program developers can generate applications that interwork quickly and effectively. Thus, Video for Windows and QuickTIme both specify the ways that video can be displayed and processed within the framework of the GUI systems on MS-Windows and Apple systems. However, neither specifies a specific video encoding. Rather, they assume that all kinds of encodings will be available through hardware CODECs or through software and thus they provide meta-systems that allow the programmer to name the encoding, and provide translations. QUICKTIME & VIDEO FOR WINDOWS ------------------------------------------------------------------------------- * Apple and Microsoft rely on hardware manufacturers for processors * Neither specify a particular video format * Rather, specify a framework for accommodating many video formats * Also specify an API for manipulating and displaying video widgets AUDIO Compression standards THE CCITT AUDIO FAMILY ------------------------------------------------------------------------------- The fundamental standard upon which all videoconferencing applications are based is G.711 , which defines Pulse Code Modulation(PCM). In PCM, a sample representing the instantaneous amplitude of the input waveform is taken regularly, the recommended rate being 8000 samples/s (50 ppm). At this sampling rate frequencies up to 3400--4000Hz are encodable. Empirically, this has been demonstrated to be adequate for voice communication, and, indeed, even seems to provide a music quality acceptable in the noisy environment around computers (or perhaps my hearing is failing). The samples taken are assigned one of 212 values, the range being necessary in order to minimize signal-to-noise ratio (SNR) at low volumes. These samples are then compressed to 8 bits using a logarithmic encoding according to either of two laws (A-law and =-law). In telecommunications, A-law encoding tends to be more widely used in Europe, whilst =-law predominates in the US However, since most workstations originate outside Europe, the sound chips within them tend to obey =-law. In either case, the reason that a logarithmic compression technique is preferred to a linear one is that it more readily represents the way humans perceive audio. We are more sensitive to small changes at low volume than the same changes at high volume; consequently, lower volumes are represented with greater accuracy than high volumes. CCITT AUDIO FAMILY ------------------------------------------------------------------------------- * Based on G.711, Pulse Code Modulation * 8000 samples/second * Assigned one of 212 values (8 bits) either A or Mu law ADPCM ------------------------------------------------------------------------------- ADPCM (G.721) allows for the compression of PCM encoded input whose power varies with time. Feedback of a reconstructed version of the input signal is subtracted from the actual input signal, which is then quantised to give a 4 bit output value. This compression gives a 32 kbit/s output rate. This standard was recently extended in G.726 , which replaces both G.721 and G.723 , to allow conversion between 64 kbit/s PCM and 40, 32, 24, or 16 kbit/s channels. G.727 is an extension of G.726 and issued for embedded ADPCM on 40, 32, 24, or 16 kbit/s channels, with the specific intention of being used in packetised speech systems utilizing the Packetized Voice Protocol (PVP), defined in G.764. The encoding of higher quality speech (50Hz--7kHz) is covered in G.722 and G.725 , and is achieved by utilizing sub-band ADPCM coding on two frequency sub-bands; the output rate is 64 kbit/s. ADPCM ------------------------------------------------------------------------------- * Adaptive Differential Pulse Code Modulation * G.721 - compresses down to 16Kbps * Can be good quality LPC AND CELP ------------------------------------------------------------------------------- LPC (Linear Predictive Coding) is used to compress audio at 16 Kbit/s and below. In this method the encoder fits speech to a simple, analytic model of the vocal tract. Only the parameters describing the best-fit model is transmitted to the decoder. An LPC decoder uses those parameters to generate synthetic speech that is usually very similar to the original. The result is intelligible but machine-sound like talking. CELP (Code Excited Linear Predictor) is quite similar to LPC. CELP encoder does the same LPC modeling but then computes the errors between the original speech and the synthetic model and transmits both model parameters and a very compressed representation of the errors. The compressed representation is an index into a 'code book' shared between encoders and decoders. The result of CELP is a much higher quality speech at low data rate. LPC AND CELP ------------------------------------------------------------------------------- * Linear Predictive Coding * Code Excited Linear Prediction * Both achieve massive compression at expense of Dalek sounds * Lossy schemes - only use if desperate! MPEG AUDIO ------------------------------------------------------------------------------- High quality audio compression is supported by MPEG. MPEG I defines sample rates of 48 KHz, 44.1 KHz and 32 KHz. MPEG II adds three other frequencies , 16 KHz, 22,05 and 24 KHz. MPEG I allows for two audio channels where as MPEG II allows five audio channels plus an additional low frequency enhancement channel. MPEG defines three compression levels that is Audio Layer I, II and III. Layer I is the simplest, a sub-band coder with a psycho-acoustic model. Layer II adds more advanced bit allocation techniques and greater accuracy. Layer III adds a hybrid filterbank and non-uniform quantization. Layer I, II and III gives increasing quality/compression ratios with increasing complexity and demands on processing power. MPEG AUDIO ------------------------------------------------------------------------------- * High quality * 32Khz - 48khz * Based on psycho-acoustic model * Costly to encode (again!) Video Conference CONTROL standards H221 ------------------------------------------------------------------------------- H221 is the most important control standard when considered in the context of equipment designed for ISDN specially current hardware video CODECs. It defines the frame structure for audiovisual services in one or multiple B or H0 channels or single H11 or H12 channel at rates of between 64 and 1920 Kbit/s. It allows the synchronization of multiple 64 or 384 Kbit/s connections and dynamic control over the subdivision of a transmission channel of 64 to 1920 kbit/s into smaller subchannels suitable for voice, video, data and control signals. It is mainly designed for use within synchronized multiway multimedia connections, such as video conferencing. H221 was designed specifically for usage over ISDN. A lot of problems arise when trying to transmit H221 frames over PSDN. H.221 ------------------------------------------------------------------------------- * Used for framing H.261 video & audio * Targeted at low delay (ISDN) scenarios * Very cramped encoding * Bad for software and packet switched nets H242 ------------------------------------------------------------------------------- Due to the increasing number of applications utilizing narrow (3KHz) and wideband (7KHz) speech together with video and data at different rates, a scheme is recommended by this standard to allow a channel accommodates speech and optionally video and/or data at several rates and in a number of different modes. Signaling procedures for establishing a compatible mode upon call set-up, to switch between modes during a call and to allow for a call transfer, is explained in this standard. Each terminal would transfer its capabilities to the other remote terminal(s) at call set-up. The terminals will then proceed to establish a common mode of operation. A terminal capabilities consist of : Audio capabilities, Video capabilities, Transfer rate capabilities , data capabilities, terminals on restricted networks capabilities and encryption and extension-BAS capabilities. ------------------------------------------------------------------------------- H.242 ------------------------------------------------------------------------------- * A multiplexing protocol for carrying several lots of narrow band speech and video * Has a protocol for negotiation of capabilities between "terminals" H230 ------------------------------------------------------------------------------- This standard is mainly concerned with the control and indication signals needed for the transmission of frame-synchronous or requiring rapid response. Four categories of control and indication signals have been defined, first one related to video, second one related to audio, third one related to maintenance purposes and the last one is related to simple multipoint conferences control (signals transmitted between terminals and MCU's). H.230 ------------------------------------------------------------------------------- * Protocol for controlling and mixing and muxing video and audio * Aimed at simple multipoint extensions for point to point and ISDN videotelephony * Used by Multi-point Control Units with H.261 for n-way conferencing * More later... H320 ------------------------------------------------------------------------------- H.320 covers the technical requirements for narrow-band telephone services defined in H.200/AV.120-Series recommendations, where channel rates do not exceed 1920 kbit/s. Communication modes of visual telephone Channels ISDN Visual Coding telephone mode rate channel ISDN interface (kbit/s) (Note 2) Primary Basic Audio Video rate a ao 6 4 B Rec. G.711 Not applicable al Rec. H.200/ AV.254 b b, 128 2B Rec. G.711 b2 Rec. G.722 b3 Rec.H.200/ AV.254, AV.253 (Note 1) c 198 3B d 256 4B c 320 5B Rec. H.261 f 384 6B 9 384 ik Applicable Rec. G.722 h 768 2HO i 1152 3HD Not applicable 1536 4HO k 1536 HI, 1920 m 1920 H12 Normative references The following CCITT Recommendations and International Standards contain provisions which, through reference in this text, constitute provisions of this Recommendation. At the time of publication, the editions indicated were valid. All Recommendations and Standards are subject to revision, and parties to agreements based on this Recommendation are encouraged to investigate the possibility of applying the most recent edition of the Recommendations and Standards listed below. Members of IEC and ISO maintain registers of currently valid International Standards. The CCITT Secretariat maintains a list of the currently valid CCITT-F Recommendations. * CCITT Recommendation F.710 (19??), General Principles for Audiogiaphic Conference Services. * CCITT Recommendation T.35 (1988), Procedure for the Allocation of CCITT Member Codes * CCITT Recommendation T.50 (1988), International Alphabet No. 5 * ITU-T Recommendation T.1 20 (199x), Introduced to Audiographics and Audiovisual Conferencing * ITU-T Recommendation T.121 (199x), Audiographic Conferencing - in development * ITU-T Recommendation T.1 22 (1993), Multipoint Communications Service Audiographic * ITU-T Recommendation T. 1 23 (1993), Protocol Stack- Audiographics and Audiovisual Teleconferencing Applications * ITU-T Recommendation T. 1 25 (1994), Multipoint Comunications Service Protocol Specification * CCITT Recommendation H.22 1, Frame Structure for- a 64 to 1920 Kbps Channel ill AudioVisual Teleservices * CCITT Recommendation X.208 (1988), Specification of abstract Syntax: Notation One (ASN.1) * CCITT Recommendation X.209 (1988), Specification of Basic Encoding Rules for Abstract Syntax Notation One (ASN.1) T.GCC Within the context of the CCITT-T Audio-visual Conferencing Service (AVCS), a conference refers to a group of geographically dispersed nodes that are joined together and that are capable of exchanging audiographic and audiovisual information across various communication networks. Participants taking part in a conference may have access to various types of media handling capabilities such as audio only (telephony), audio and data (audiographics), audio and video (audiovisual), and audio, video, and data (multimedia). The F, H, and T Series Recommendations provide a framework for the interworking of audio, video, and graphics terminals on a point-to-point basis through existing, telecommunication networks. They also provide the capability for three or more terminals in the same conference to be interconnected by means of an MCU. This Recommendation provides a high-level framework for conference management and control of audiographics and audiovisual terminals, and MCUS. It coexist-; with companion Recommendations T.122 and T.125 (MCS) and T.123 (AVPS) to provide a mechanism for conference establishment and control. T.GCC also provides access to certain MCS functions and primitives, including tokens for conference conductorship. T.GCC, T.122, T.123, and T.125 form the minimum set of Recommendations to develop a fully functional terminal or MCU. This Recommendation includes the followin@- generic conference control (GCC) functional components: conference establishment and termination. maintenance, the conference roster, managing the application roster, remote actuation, conference conductorship, bandwidth control, and application registry services. The service definitions for the primitives associated with these functional components are contained later, as are the corresponding protocol definitions are The figure below shows an example of how GCC components are distributed throughout an MCS domain. The GCC components are shown in white. Each terminal or MCU contains a GCC Agent which provides GCC services to local Client Applications. [Image] The Top GCC Server conlains Application Registry information for the conference Example of GCC components distributed throughout an MCS Domain Each Node participating in a GCC conference consists of an MCS layer, a GCC layer, a Node Controller and may also include one or more Client Applications. The relationship between these components within a single node is illustrated in the figure below. [Image] The Node Controller is the controlling entity at a node, dealing with the aspects of a conference which apply to the entire node. The Node Controller interacts with GCC, but may not interact directly with MCS. Client Applications also interact with GCC, and may or may not interact with MCS directly. The services provided by GCC to Client Applications are primarily to enable peer Client Applications to communicate directly, via MCS. Communication between Client Applications or between Client Applications and the Node Controller may take place, but is a local implementation matter not covered by this Recommendation. The practical distinction between these Node Controller and the Client Applications is also a local matter not covered by this Recommendation. The service primitives as described in Recommendation apply to the GCC Service Interface as indicated in Node User Interface. An example is illustrated below: [Image] GCC Service [Image] System model showing GCC Service Interface and relationship with MCS Generic Conference Control Service 1. GCC abstract services Conference Establishment and Termination o GCC-Coriferencc-Join o GCC-Conference-Query o GCC-Conference-Create o GCC-Conference-Add o GCC-Conference-invite o GCC-Conference-Lock o GCC-Conference-Unlock o GCC-Conference-Disconnect o GCC-Conference-Terminate o GCC-Conference-Eject-User o GCC-Conference-Transfer o GCC-Conference-Time-Remaining o GCC-Conference-Time-Inquire o GCC-Conference-Extend o GCC-Conference-Ping 2. The Conference Roster: o GCC-conference-Announce-Presence o GCC-Conference-Roster-Inquire 3. The Application Roster: o GCC-Application-Enrol o GCC-Application-Attach o GCC-Application-User-ID o GCC-Application-Roster-Report o GCC-Application-Roster-Inquire 4. Remote Actuation o GCC-Action-List-Announce o GCC-Action-List-Inquire o GCC-Action-Actuate 5. Conference Conductorship o GCC-Conductor-Assign-@ o GCC-Conductor-Release o GCC-Conductor-Please o GCC-Conductor-Give o GCC-Conductor-Inquire Table GCC-Conference-Query - Types of primitives and their parameters Primitive/ Request Indication Response confirm Figure -Model of the MCS layer Services provided by the MCS layer The MCS protocol supports the services defined in ITU-T Rec. T.122. Information is transferred to and from the MCS. Table 5 - MCS primitives Functional Unit Primitives Associated MCSPDUs Domain Management MCS-CONNECT-F-PROVIDER request Connect-Initial MCS-CONNECT'-PROVII)ER indication Connect-Initial MCS-CONNECT-F-PROVIDER response Connect-Response MCS-CONNECT-PROVIDER confirm Connect-Response (side effects) Connect-Additional Connect-Result PDin EDrq MCrq MCcf PCin MTrq MTcf PTin MCS-DISCONNECT-PROVIDER request DPum MCS-DISCONNECF-PROVIDER iiidication DPum RJum MCS-Attach-USER request AUrq MCS-Attach-USER confirm AUcf MCS-DETACH-USER request DUrq MCS-DETACH-USER indication DUin MCcf PCin MTcf PTin Channel Management MCS-CHANNEL-JOIN request CJrq MCS-CHANNEL-JOIN conf-= Cjcf MCS-CHANNEL-LEAVE request CLrq MCS-CHANNEL-LEAVE indication MCcf PCin MCS-CHANNEL-CONVENE request CCrq MCS-CHANNEL-CONVENE conf= cccf MCS-CHANNEL-DISBAND request CDrq MCS-CHANNEL-DISBAND indication MCcf PCin MCS-CHANNEL,ADNUT request CArq MCS-CHANNEL-ADMTr indication CAin MCS-CHANNEL-EXPEL request CErq MCS-CHANNEL-EXPEL indication CEin CDin MCcf PCin Functional Unit Primitives Associated MCSPDUS Data Transfer MCS-SEND-DATA request SDrq SD3'-n MCS-SEND-DATA indication MCS-UNEFORM-SEND-DATA request USrq MCS-UNIFORM-SEND-DATA indication USin Token Management MCS-TOKEN-GRAB request TGrq MCS-TOKEN-GRAB confimi TGcf MCS-TOKEN-INHIBFF request TIrq mcs-TOKEN-INHIBr-F confirm TIcf MCS-TOKEN-GIVE request TVrq MCS-TOKEN-GIVE indication TVin MCS-TOKEN-GIVE response TVrs MCS-TOKEN-GIVE confirni TVcf MCS-TOKEN-PLEASE request TPrq TPin MCS-TOKEN-PLEASE indication MCS-TOKEN-RELEASE request TRrq MCS-TOKEN-RELEASE conf= TRcf MCS-TOKEN-TEST request TTrq MCS-TOKEN-TEST confirm TTcf Services assumed from the transport layer The MCS protocol assumes the use of a subset of the connection-oriented transport service defined in CCITT Rec.X.214, information is transferred to and from a TS provider as in the table above. MPEG SYSTEMS ------------------------------------------------------------------------------- MPEG systems part is the control part of the MPEG standard. It addresses the combining of one or more streams of video and audio as well as other data, into a single or multiple streams which are suitable for storage or transmission. The figure below shows a simplified view of the MPEG control system. Packetised Elementary Stream (PES) PES stream consists of a continuous sequence of PES packets of one elementary stream. The PES packets would include information regarding the Elementary clock reference and the Elementary stream rate. The PES stream is not defined for interchange and interoperability though. Both fixed length and variable length PES packets are allowed. MPEG SYSTEMS ------------------------------------------------------------------------------- The diagram illustrates the components of the MPEG Systems module: [Image] MPEG Sys TRANSPORT AND Program Streams There are two data stream formats defined: the Transport Stream, which can carry multiple programs simultaneously, and which is optimized for use in applications where data loss may be likely (e.g. transmission on a lossy network), and the Program stream, which is optimized for multimedia applications, for performing systems processing in software, and for MPEG-1 compatibility. Synchronization The basic principle of MPEG System coding is the use of time stamps which specify the decoding and display time of audio and video and the time of reception of the multiplexed coded data at the decoder, all in terms of a single 90kHz system clock. This method allows a great deal of flexibility in such areas as decoder design, the number of streams, multiplex packet lengths, video picture rates, audio sample rates, coded data rates, digital storage medium or network performance. It also provides flexibility in selecting which entity is the master time base, while guaranteeing that synchronization and buffer management are maintained. Variable data rate operation is supported. A reference model of a decoder system is specified which provides limits for the ranges of parameters available to encoders and provides requirements for decoders. Putting this on the Desktop, on the Internet DESKTOP SYSTEMS MODEL ------------------------------------------------------------------------------- So what happens when we want to put all of this onto our desktop system? There are impacts on the whole architecture, for processor, bus, I/O, storage devices and so on. At the time of writing this course, even with the massive advances in processor and bus speed (e.g. Pentium and PCI), we are still right on the limits of what can be handled for video. Desktop Systems Model - ISDN style [Image] Conferencing Desktop Performance Regime However, with judicious optimisation of the implementation of some of the compression schemes described above, it is now possible to encode, compress and transmit a single CIF video stream at 25 frames per second on a workstation with about 50MIPS processing power. The key is to look at the DCT transforms and realize that large chunks of them can be done in table lookup form, at the expense of memory utilization (but then if you are using a lot of memory for video anyhow, this is not that significant). Desktop Performance Regime * H.261 or MPEG in software take a lot of CPU * 50MIPS (fast 486) can code H.261 stream * 100MIPS (fast Pentium) can code a QCIF MPEG stream * But (big but) 1/10th of this to decode/display Encoding/Compression versus Decoding/Decompression The most expensive part of the transform in the encoder/transmitter side is the frame differencing (differencing of the DCT coded blocks), since this involves a complete pass over the data (frame) every frame time (say 25 times per second over nearly a Megabyte). It turns out that this, and motion prediction if employed, are really I/O intensive rather than strictly being CPU/Instruction intensive and are currently the main bottleneck. In the meantime, the receiver/decoder/decompression task is a lot easier, possibly as much as 10-25 times less work. This is simply because if there is no change in the video image, no data arrives, and if there is a change, data arrives, so the only work is in the inverse DCT (or other transform) plus copying the data fro the network to the framebuffer. Basically, a modest PC can sustain this task for several video streams simultaneously. Encoding/Compression versus Decoding/Decompression * Expensive part of compression is frame/block differencing * DCT can (both forward and back) be done largely by table lookup * Costs in memory * Decoder has no frame differencing to do * Irony - more the scene changes, less decoder has to do NETWORKED SYSTEMS MODELS ------------------------------------------------------------------------------- When we want to network our audio and video, again we are up against the limits of what can be done under software control now. There are implications for source, link, switch and sink processing, in terms of throughput, although for compressed video, most modest machines are now pretty capable of what's required. But in terms of reconstructing the timing of a multimedia stream, there are a few tricky problems. These can be solved as we'll see later, but there are basically two approaches: 1. Use a synchronous circuit switched network (e.g. ISDN or a leased line). 2. Use a packet network, but put in adaption to delay and loss (perhaps through redundancy in the encoding, or retransmission, or interpolation or extrapolation of a signal at the receiver). We will compare these approaches below. NETWORK SYSTEMS MODELS ------------------------------------------------------------------------------- Here we illustrate the two basic approaches - use of a constant bit rate CODEC and circuit based network: [Image] CODEC Usage And use of software and packetizers and a Packet Switched network: [Image] Packet Switch Conferencing HARDWARE ------------------------------------------------------------------------------- There is no doubt that special purpose hardware is needed for some multimedia tasks. The shear volume of data that must be dealt with, and the CPU intensive nature of much audio and video processing means that some special purpose devices are needed. Some of these are purely in the digital domain, some sit between the analog and the digital, and others are most cost effective in the analog realm. Digital Signal Processors and Graphics Co-processors DSPs are specially designed chips that are basically miniature vector processors good at the set of tasks that audio, and video, signal processing involve - typically, these involve a repetitive sequence of instructions carried out over an array of data - e.g. fast Fourier or other transform, or matrix multiplication (to rotate or carry out other POV transforms), or even to render a scene with a given light source. DSPs and other Co-processors * The serious graphics house will have these anyhow * Can help a lot with basics of video * Worth noting that a lot of video processing is similar to the compression task * Audio is less worth concerning oneself with special hardware * except if very heavy compression required Video/audio CODEC operation Coder/Decoder cards in workstations vary enormously in their interface to the a/v world, as well as their interface to the computer. Some CODECs do on card compression, some don't. Some replace a framebuffer, while others expect the CPU to copy video data to the framebuffer (or network). Some include a network interface (e.g. ISDN card in PC video cards). Some include audio with the ISDN network interface (the chipsets are often related or the same). Most that carry out some extra function like this are good for their alloted task, but poor as general purpose video or audio i/o devices. Nowadays, most UNIX and Apple workstations have good audio i/o, at least at 64kbps PCM, and sometimes even at 1.4 Mbps CD quality. Most PC cards are still poor (e.g. the soundblaster card is half duplex - not much use for interactive PC based network telephoning). CODEC Operation * Video and audio devices vary a lot * Some have onboard compression * Some even have on board ISDN * The more on the card, the less flexibility * The more on the card, the less CPU burden Frame Grabbers There are low price framegrabbers available, that often operate as low frame rate video cards. Mixers, Multiplexors It is often useful to be able to choose or mix audio (or video) input to a framegrabber or CODEC. However, by far the cheapest and most effective way to do this is by getting an analogue mixer. To mix n digital streams requires n codecs. Sometimes, within a building, one wishes to carry multiple streams (even of analog and digital) between different points. Again, appropriate broadband multiplexors may be cheaper than going to the digital domain and using general purpose networking - the current cost of the bandwidth you need is still quite high. If you want 4 pictures on a screen, an analog video multiplexor is an inexpensive way of achieving this, although this is the sort of transformation that might be feasible digitally very soon for reasonable cost. Mixers, Multiplexors * Software mixing of video is a way off yet at any reasonable price * Even captioning video in s/w is tricky * Use analog devices for this - they are cheap and effective * and available * Future work will result eventually in good transform domain video processing Mikes, Cameras Currently, most mikes and cameras are pure analog. Mikes are inexpensive and audio codecs becoming commonplace in any case. But cameras could easily be constructed that are pure digital, by simply extracting the signal from the scan across the CCD area in a video camera. There are a couple of such devices coming on to the market this year. Digital Mikes & Cameras * Are starting to appear * Still Cameras already around * Digital Video camera should be cheaper! * Mikes will cost more though * May make automatic calibration a lot easier Echo Cancellation Interactive audio is nigh on impossible if a user can hear their own voice more than a few 10s of milliseconds after they speak. Thus if you are speaking to someone over a long haul net, and your voice traverses it, turns around at the far end, and comes back, then you may have this problem. In fact, echo cancellors can be got which go between the Audio out and in, and sense the delay in the room between the output signal on speakers and the input on a mike. If they then simply introduce the same signal but with its phase reversed, with that delay, to the input, then the echo is (largely ) canceled. Unfortunately, it isn't quite that simple!! The signal arriving at the speaker is transformed by the room, and may not be easily recognized as the same as that picked up by the mike. However, this might not matter if a calibration signal can be used to set up the delay line. Failing this, many systems fall back on a conference control technology, using either a master floor control person who determines who may speak when (see below) or a simple manual "click to talk" interface which disables speakers in the users room. Echo Cancellors * A requirement if you want to avoid using headsets or click-to-talk * Analog devices limited in range (echo delay) * digital echo cancellors not widely integrated into voice capture systems yet * Generally a painful area! Multimedia conferencing CONFERENCING MODELS-CENTRALISED, DISTRIBUTED, ETC. ------------------------------------------------------------------------------- There are two fundamentally different approaches to video teleconferencing and multimedia conferencing that spring from two fundamentally different philosophies: 1. The Public Network Operators and ITU model of circuit based, resource reservation videoconferencing, with its incumbent complexity for multisite operations. 2. The Internet and Packet Switched adaptive approach using multicast (many-to-many packet distribution) facilities to achieve multisite operations. An overview of Internet Based Approach Conferencing models - centralised and distributed * PNO/ITU approach is circuit based * Resource reservation, and expensive * Internet approach is packet based * Unreliable, but cheap * There are emerging middle ways... ITU Model H.320/T.gcc This is based around the starting point of person to person video telephony, across the POTS (Plain Old Telephone System) or its digital successor, ISDN (Integrated Services Digital Networking). The Public Network Operators (PNOs, or telcos or PTTs), have a network already, and its based on a circuit model - you place a call using a signaling protocol with several stages - call request, call indication, call proceeding, call complete and so on. Once the call has been made, the resources are in place for the duration of the call. You are guaranteed (through expensive engineering, and you pay!), that your bits will get to the destination with: 1. Constant Rate 2. Constant Delay (plus or minus a few bit times in a million) 1. To achieve this, the talc has a complex arrangement of global clocks and an over resourced backbone network. 2. To match video traffic to such a service, the output from a video compression algorithm has to be padded out to a constant bit rate (i.e. its constant rate, not constant quality). The assumption is that you have a special purpose box that you plug cameras and mikes into, (a CODEC) and it plugs into the phone or ISDN line or leased line, and you conference with your equivalent at the far end of the call. 1. How is multisite conferencing achieved? ITU Model * Based on ISDN or leased lines * Constant Rate * Video padded out to fit * Access for Video "terminals" * Access from computers inconvenient Multisite Circuit Based Conferencing - MCUs There are two ways you could set up a multisite conference: 1. Have multiple CODECs at each site, and multiple circuits, one from each site to all the others. This would involve n*(n-1) circuits in all, and n CODECs at each site to decode the incoming video and audio. 2. Use a special purpose Multi-point Control Unit, which mixes audio signals, and chooses which video signal fro which site is propagated to all the others. With this latter approach, each site has a single CODEC, and makes a call to the MCU site. The MCU has a limit on the number of inbound calls that it can take, and in any case, needs at least n circuits, one per site. Typically, MCUs operate 4-6 CODECs/calls. To build a conference with more than this many sites, you have multiple MCUs, and there is a protocol between the MCUs, so that one build a hierarchy of them (a tree). Which site's video is seen at all the others (remember it can be only one, as CODECs for circuit based video can only decode one signal), is chosen through floor control, which may be based on who is speaking or on a chairman approach (human intervention). Multisite Circuit Based Conferencing The diagram illustrates the use of an MCU to link up 3 sites for a circuit based conference: [Image] H/W Should greater than basic rate ISDN be needed, it can be combined via a BONDing box as shown: [Image] Bonding Multicast Packet Based Multisite Conferencing In a packet switched network, all is very different from the ITU model. Firstly, on a Local Area Network (LAN), a packet sent can be received by multiple machines (multicast) at no additional cost. Secondly, as we pointed out earlier when looking at the performance of compression algorithms, it is possible for the same power machine to decode many more streams than it encodes. Hence we can send video and audio from each site to all the others whenever we like. A receiver can select which (possibly several or all) of the senders to view. Thirdly, an audio compression algorithm may well use silence detection and suppression. This can be used for rate adaption (as we will see later), but primarily, it means that, so long as only one person is speaking at any one time usually, the cost in terms of network utilization for audio at least is hardly any more if we send everything all the time. In many uses of such systems where there are a lot of participants it is common that a lot of them are audio senders only (e.g. a class, a seminar) so this can work very well. Multicast Packet based multisite conferencing * A packet net might support multicast - we look at this later * Then a more general interconnection strategy can be used * The architecture might look a bit like that in the figure: [Image] Protocols and so on Internet Based Multimedia Conferencing There is one remaining non-trivial difference between circuit based networks and packet based networks, currently, and that concerns resource reservation: 1. Most packet switched networks have no guarantees of throughput or delay. 2. Many packet switched networks (notably the Internet) have relatively high losses when they are busy. Provided that a network is not actually overcommitted, this is not necessarily a problem. We can still run packet based video and audio over the Internet quite easily. The key observations are: 1. Compressed audio and video are not fixed rate naturally. 2. Users may have a minimum acceptable quality (which may be very low) and above that may be happy to have free extra quality when available. Adapting compression schemes to available bandwidth is close to trivial. 3. Adapting to delay and loss with a compressed image or sound is not very compute intensive. Internet Based Multimedia Conferencing * Jitter will need dealing with - the figure illustrates this: [Image] Jitter If the overall use at minimum quality exceeds the capacity of the network, then this 'best effort' approach will not work. But within this constraint it works just fine. Even as the delay goes up, the sources and sinks adapt (as we'll see later) and the system proceeds correctly. At a certain point, either the throughput will fall below that which can sustain a tolerable quality audio and/or video, or else the delay will become too high for interactive applications (or both!). At this stage, we would need some scheme for establishing who has priority to use the network, and this would then be based on resource reservation, and potentially, on charging. FLOOR CONTROL ------------------------------------------------------------------------------- Floor control is the business of deciding who is allowed to talk when. We are all familiar with this in the context of meetings or natural face-to-face scenarios. People use all kinds of subtle clues, some less subtle, to decide when they or someone else can talk. In a video conference, the view of the other participants is often limited (or non-existent) so computer support for helping with floor control is necessary (just think of talking to someone who you don't know, maybe over a poor satellite phone call with a ½ second delay, then you get the idea, then add 5 other people on the same line!). Floor control systems can be nearly automatic, triggered simply by who speaks, or they can use the fact that the participants are in front of computers, and have a user interface to a distributed program (either packet based or MCU based) to request and grant the floor. FLOOR CONTROL ------------------------------------------------------------------------------- * The picture illustrates a possible protocol for floor control [Image] Floor Control ACCESS CONTROL AND PRIVACY ------------------------------------------------------------------------------- Access control in conferencing and multimedia in general is complex. In a circuit based system, it can just rely on trust with the phone company, and perhaps the addition of closed user groups, lists of numbers that are allowed to call in or out of the conferencing group. In a packet network , there are a number of other questions: 1. How do we determine who is in and who is allowed to be in a conference? 2. How do we stop people simply listening in? 3. How do we know someone is who they say they are (assuming we don't know them personally)? These are all dealt with by applying the principle of end-to-end security. Basically, if we encrypt the audio or video, perhaps signing it with some magic value before encrypting it with keys known only to the sender or receiver (or else using a public key crypto system more suitable to multipoint communication), then we can be assured that our communication I private. It turns out that encrypting compressed video and audio is really very simple for many compression schemes - in the case of H.261 for example, simply scrambling the Huffman codes used for carrying around the DCT coefficients might do! Public key cryptography is preferred over private key since one has a n easier key distribution problem. ACCESS CONTROL AND PRIVACY ------------------------------------------------------------------------------- * Who is in or out of a conference? * How do we stop eavesdroppers? * The basic security techniques apply: * End to end encryption (Public Key Cryptography best for n-way) * Authentication through passwords or Digital Signatures * PGP or RSA both viable PLAYOUT BUFFER ADAPTION FOR PACKET NETS ------------------------------------------------------------------------------- It has been asserted that you cannot run audio (or video) over the Internet due to * Delay variation due to other traffic through routers * Loss due to congestion In fact, both are tolerable up to a point. The delay budget for bearable interaction is often cited as around 200ms. However, for a lecture or broadcast of a seminar, any amount of delay might not matter. The key requirement is to adapt to delay variation, rather than the transit delay. Given that a sender and receiver are matched at the audio i/o rates, or even if they are slightly askew, a combination of an adaption buffer and silence suppression at the send side can accommodate this. The receiver estimates the interpacket arrival time variance, using exactly the same technique as TCP uses to estimate the RTT, an exponential weighted moving average calculated from: 1. The current packet arrival time and media sample timestamp 2. The previous packet arrival time and media timestamp. 1. This is rolled into a running mean variance: mi = mi-1 + g(vi - mi-1) Then depending whether interaction or a lecture mode are in use, the receiver buffers sound before playing out for 1 or more of these variances. When needing to adapt, silence is added or deleted (rather than actual sound) at the beginning of a talkspurt. A similar inter-arrival pattern can be used by a video receiver to adapt to a sender that is too fast, or by a decoder of compressed audio or video where the CPU times vary depending on the audio contents! Playout Buffer Adaption for Packet Nets * Delay and loss mean that some form of adaption must run at the receiver. The diagram shows this: [Image] Txmit MMCC - THE CENTRAL INTERNET MODEL ------------------------------------------------------------------------------- It has been argued that the problem with the Internet model of multimedia conferencing is that it doesn't support simple phone calls, or secure closed ("tightly managed") conferences. However, it is easy to add this functionality after one has built a scalable system such as the Mbone provides, rather than limiting the system in the first place. For example, the management of keys can provide a closed group very simply. If one is concerned about traffic analysis, then the use of secure management of IP group address usage would achieve the effect of limiting where multicast traffic propagated. Finally, a telephone style signaling protocol can be provided easily to "launch" the applications using the appropriate keys and addresses, simply by giving the users a nice GUI to a distributed calling protocol system. MMCC AND CMMC - CENTRAL INTERNET MODEL ------------------------------------------------------------------------------- * Can have it both ways - can interwork and have tightly coupled/controlled conferences * The diagram illustrates a Conference Management and multiplexing center * This would provide interworking of data (video and audio) as well as control (e.g. H.230 to MMCC) * It could also provide software mixing so that sites with only one decoder could still see multiple senders [Image] CMMC Picture: cmmc_sw_mix.ps CCCP - THE DISTRIBUTED INTERNET MODEL ------------------------------------------------------------------------------- * 1. The conference architecture should be flexible enough so that any mode of operation of the conference can be used and any application can be brought into use. The architecture should impose the minimum constraints on how an application is designed and implemented. * 2. The architecture should be scaleable, so that ``reasonable'' performance is achieved across conferences involving people in the same room, through to conferences spanning continents with different degrees of connectivity, and large numbers of participants. To support this aim, it is necessary explicitly to recognize the failure modes that can occur, and examine how they will affect the conference, and to design the architecture to minimise their impact. Currently, the IETF working group on Conference Control is liasing with the T.120 standards work in the ITU and have made some statements about partial progress. CCCP - DISTRIBUTED INTERNET MODEL ------------------------------------------------------------------------------- * Based on multicast * Based on packets * scalable * not yet standard, but basic idea the way forward CCCP Model We model a conference as composed of an unknown number of people at geographically separated sites, using a variety of applications. These applications can be at a single site, and have no communication with other applications or instantiations of the same application across multiple sites. If an application shares information across remote sites, we distinguish between the cases when the participating processes are tightly coupled- the application cannot run unless all processes are available and connectable - and when the participating processes are loosely coupled, in that the processes can run when some of the sites become unavailable. A tightly coupled application is considered to be a single instantiation spread over a number of sites, whilst loosely coupled and independent applications have a number of unique instantiations, although possibly using the same application specific information (such as which multicast address to use...). The tasks of conference control break down in the following way: * Application control - Applications as defined above need to be started with the correct initial state, and the knowledge of their existence must be propagated across all participating sites. Control over the starting and stopping can either be local or remote. * Membership control - Who is currently in the conference and has access to what applications. * Floor management - Who or what has control over the input to particular applications. * Network management - Requests to set up and tear down media connections between end-points (no matter whether they be analogue through a video switch, a request to set up an ATM virtual circuit, or using RSVP over the Internet), and requets from the network to change bandwidth usage because of congestion. * Meta-conference management - How to initiate and finish conferences, how to advertise their availability, and how to invite people to join. CCCP Model The diagram illustrates the CCCP Model [Image] CCCP CCCP Class Hierarchy We then take these tasks as the basis for defining a set of simple protocols that work over a communication channel. We define a simple class hierarchy, with an application type as the parent class and subclasses of network manager, member and floor manager, and define generic protocols that are used to talk between these classes and the application class, and an inter-application announcement protocol. We derive the necessary characteristics of the protocol messages as reliable/unreliable and confirmed/unconfirmed (where `unconfirmed' indicates whether responses saying ``I heard you'' come back, rather than indications of reliability). Its easily seen that both closed and open models of conferencing can be encompassed, if the communication channel is secure. To implement the above, we have abstracted a messaging channel, using a distributed inter-process communication system, providing confirmed/unconfirmed and reliable/unreliable semantics. The naming of sources and destinations is based upon application level naming, allowing wildcarding of fields such as instantiations (thus allowing messages to be sent to all instantiations of a particular type of application). The final section of paper briefly describes the design of the high level components of the messaging channel (named variously the CCC or the triple-C). Mapping of the application level names to network level entities is performed using a distributed naming service, based upon multicast once again, and drawing upon the extensive experience already gained in the distributed operating systems field in designing highly available name services. REQUIREMENTS on CCCP from tools Multimedia Integrated Conferencing has a slightly unusual set of requirements. For the most part we are concerned with workstation based multimedia conferencing applications. These applications include vat (LBL's Visual Audio Tool), IVS (INRIA Videoconferencing System), NV (Xerox's Network Video tool) and WB (LBL's shared whiteboard) amongst others. These applications have a number of things in common: * They are all based on IP Multicast. * They all report who is present in a conference by occasional multicasting of session information. * The different media are represented by separate applications (1) * There is no conference control, other than each site deciding when and at what rate they send. These applications are designed so that conferencing will scale effectively to large numbers of conferees. At the time of writing, they have been used to provide audio, video and shared whiteboard to conference with about 500 participants. Without multicast, this is clearly not possible. It is also clear that these applications cannot achieve complete consistency between all participants, and so they do not attempt to do so- the conference control they support usually consists of: * Periodic(unreliable) multicast reports of receivers. * The ability to locally mute a sender if you do not wish to hear or see them. 1. However, in some cases stopping the transmission at the sender is actually what is required. Requirements from tools 1. Common multicast channel used for control messages 2. Different media from different applications 3. Need session participant and other information to be added 4. Need control Common Control for Conferencing Thus any form of conference control that is to work with these applications should at least provide these basic facilities, and should also have scaling properties that are no worse that the media applications themselves. It is also clear that the domains these applications are applied to vary immensely. The same tools are used for small (say 20 participants), highly interactive conferences as for large (500 participants) disseminations of seminars, and the application developers are working towards being able to use these applications for ``broadcasts" that scale towards millions of receivers. It should be clear that any proposed conference control scheme should not restrict the applicability of the applications it controls, and therefore should not impose any single conference control policy. For example we would like to be able to use the same audio encoding engine (such as vat), irrespective of the size of the conference or the conference control scheme imposed. This leads us to the conclusion that the media applications (audio, video, whiteboard, etc.) should not provide any conference control facilities themselves, but should provide the handles for external conference control and whatever policy is suitable for the conference in question. Conferencing - special needs? We often have the slightly special needs of being able to support: * Multicast based applications running on workstations where possible. * Hardware codecs at rates up to 2Mb/sand the need to multiplex their output. * Sites connecting into conferences from ISDN. * Interconnecting all the above. These requirements have dictated that we build a number of Conference Management and Multiplexing Centres to provide the necessary format conversion and multiplexing to interwork between the multicast workstation based domain and unicast(whether IP or ISDN) hardware based domain.]. What we need for packet based conferencing: * Multicast based for scaling * Software codecs * interworking with circuit based * interworking with hardware codecs WHERE CURRENT CONFERENCE CONTROL SYSTEMS FAIL ------------------------------------------------------------------------------- The sort of conference control system we are addressing here cannot be: * CENTRALISED. This will not scale. * Fixed Policy. This would restrict the applicability. The important point here is that only the users can know what the appropriate policies a meeting may need. * Application Based. It is very likely that separate applications will be used for different media for the foreseeable future. We need to be able to switch media applications where appropriate. Basing the conference control in the applications prevents us changing policy simply for all applications. So what is wrong with current videoconferencing systems? * Don't scale * Fixed policies * Application based Specific requirements - Modularity Conference Control mechanisms and Conference Control applications should be separated. The mechanism to control applications (mute, unmute, change video quality, start sending, stop sending, etc.) should not be tied to any one conference control application in order to allow different conference control policies to be chosen depending on the conference domain. This suggests that a modular approach be taken, with for example, a specific floor control modules being added when required (or possibly choosing a conference manager tool from a selection of them according to the conference). Special Requirements: A single conference ctl user interface A general requirement of conferencing systems, at least for relatively small conferences, is that the participants need to know who is in the conference and who is active. Vat is a significant improvement over telephone audio conferences, in part because participants can see who is (potentially) listening and who is speaking. Similarly if the whiteboard program WB is being used effectively, the participants can see who is drawing at any time from the activity window. However, a participant in a conference using, say, vat (audio),IVS (video) and WB (whiteboard) has three separate sets of session information, and three places to look to see who is active. Clearly any conference interface should provide a single set of session and activity information. A useful features of these applications is the ability to ``mute" (or hide or whatever) the local playout of a remote participant. Again, this should be possible from a single interface. Thus the conference control scheme should provide local inter-application communication, allowing the display of session information, and the selective muting of participants. Taking this to its logical conclusion, the applications should only provide media specific features (such as volume or brightness controls), and all the rest of the conference control features should be provided through a conference control application. Special Requirements: flexible floor control policies Conferences come in all shapes and sizes. For some, no floor control, with everyone sending audio when they wish, and sending video continuously is fine. For others, this is not satisfactory due to insufficient available bandwidth for a number of other reasons. it should be possible to provide floor control functionality, but the providers of audio, video and workspace applications should not specify which policy is to be used. Many different floor control policies can be envisaged. A few example scenarios are: * Explicit chaired conference, with a chairperson deciding when someone can send audio and video. Some mechanism equivalent to hand raising to request to speak. Granting the floor starts video transmission, and enables the audio device. Essentially this is a schoolroom type scenario, requiring no expertise from end users. * Audio triggered conferencing. No chairperson, no explicit floor control. When someone wants to speak, they do so using ``push to talk". Their video application automatically increases its data rate from, for example, 10Kb/s to 256Kb/s as they start to talk. 20 seconds after they stop speaking it returns to 10Kb/s. * Audio triggered conferencing with a CMMC (3). The CMMC can mix four streams for decoding by participants with hardware CODECs. The four streams are those of the last four people to speak, with only the current speaker transmitting at a high data rate. Everyone else stops sending video automatically. * A background Mbone engineering conference that's been idle for 3 hours. All the applications are ionized, as the participant is doing something else. Someone starts drawing on the whiteboard, and the audio application plays an audio icon to notify the participant. Scaling from tightly to loosely coupled conferences CCCP originates in part as a result of experience gained from the CAR Multimedia Conference Control system. The CAR system was a tightly coupled centralised system intended for use over ISDN. The functionality it provided can be summarized up by listing its basic primitives: * Create conference * Join/Leave Conference * List members of conference * Include/exclude application in conference * Take floor In addition, there were a number of asynchronous notification events: * Floor change 1. Conference Management and Multiplexing Centre - essentially one or more points where multiple streams are multiplexed together for the benefit of people on unicast links, ISDN, hardware CODECs and the like o Participant joining/leaving o Application included/excluded Packet Conferencing Requirements * Modularity - separation of conference control and applications * Single user interface (API) for conference control * Flexibility (e.g. for floor control) The Conference Control Channel (CCC) To bind the conference constituents together, a common communication channel is required, which offers facilities and services for the applications to send to each other. This is akin to the inter process communication facilities offered by the operating system. The conference communication channel should offer the necessary primitives upon which heterogeneous applications can talk to each other. The first cut would appear to be a messaging service, which can support 1-to-many communication, and with various levels of confirmation and reliability. We can then build the appropriate application protocols on top of this abstraction to allow the common functionality of conferences. We need an abstraction to manage a loosely coupled distributed system, which can scale to as many parties as we want. In order to scale we need the underlying communication to use multicast. Many people have suggested that one way of thinking about multicast is as a multifrequency radio, in which one tunes into particular channels in which we are interested in. We take this one step further and use it as a handle on which to hang the Inter Process Communications model we offer to the protocols used to manage the conference. Thus we define an application control channel. Conference Control Channel, continued CCCP originates in the observation that in a reliable network, conference control would behave like an Ethernet or bus - addressed messages would be put on the bus, and the relevant applications will receive the message, and if necessary respond. In the Internet, this model maps directly onto IP multicast. In fact the IP multicast groups concept is extremely close to what is required. In CCCP, applications have a tuple as their address: (instantiation, application type, address). We shall discuss exactly what goes into these fields in more detail later. In actual fact, an application can have a number of tuples as its address, depending on its multiple functions. CCC Model * Network is a bus for control messages * Messages are directed to groups, but these are class based * Classes bind() to the appropriate groups to receive all the messages for that function Examples of CCC use of this would be: DESTINATION TUPLE Message * (1,audio, localhost) * (*,activity`management, localhost) ADDRESS * (*,session`management, *) NAME * (*,session`management, *) {application list} * (*,session`management, *) {participantlist} * (*,floor`control, *) * (*,floor`control, *) and so on. The actual messages carried depend on the application type, and thus the protocol is easily extended by adding new application types. Unreliability CCCP would be of very little use if it were merely the simple protocol described above due to the inherent unreliable nature of the Internet. Techniques for increasing the end-to-end reliability are well known and varied, and so will not be discussed here. However, it should be stressed that most (but not all) of the CCCP messages will be addressed to groups. Thus a number of enhanced reliability modes may be desired: * None. Send and forget. (an example is session management messages in a loosely coupled system) * At least one. (an example is a request floor message which would not be ACKed by anyone except the current floor folder). * n out of m. (an example may be joining of a semi-tightly coupled conference) * all. (an example may be ``join conference" in a very tightly coupled conference) It makes little sense for applications requiring conference control to re-implement the schemes they require. As there are a limited number of these messages, it makes sense to implement CCCP in a library, so an application can send a CCCP message with a requested reliability, without the application writer having to concern themselves with how CCCP sends the message(s). The underlying mechanism can then be optimized later for conditions that were not initially foreseen, without requiring a re-write of the application software. Reliable Multicast There are a number of ``reliable" multicast schemes available. It may be desirable to incorporate such a scheme into the CCC library, to aid support of small tightly coupled conferences. We believe that sending a message with reliability all to an unknown group is undesirable. Even if CCCP can track or obtain the group membership through it's distributed nameserver, which requires explicit application messages to the nameserver, we believe that the application should explicitly know who it was addressing the message to. It does not appear to be meaningful to need a message to get to all the members of a group if we can't find out who all those members are, as if the message fails to get to some members, the application can't sensible cope with the failure. Thus we intend to only support the all reliability mode to an explicit list of fully qualified (i.e. no wildcards)destinations. Applications such as joining a secure(and therefore externally anonymous) conference which requires voting can always send a message to the group with "at least one" reliability, and then an existing group member initiates a reliable vote, and returns the result to the new member. Ordering Of course loss is not the only reliability issue. Messages from a single source may be reorderedor duplicated and due to differing delays, messages from different sources may arrive in ``incorrect'' order. SINGLE SOURCE Reordering Addressing reordering of messages from a single source first; there are a few possible schemes, almost all of which require a sequence number or a timestamp. A few examples are: * 1. Ignore the problem. A suitable example is for session messages reporting presence in a conference. * 2. Deal with messages immediately. Discard any packets that are older than the latest seen. Quite a number of applications may be able to operate effectively in the manner. However, some networks can cause very severe reordering, and it is questionableas to whether this is desirable. * 3. Using the timestamp in a message and the local clock, estimate the perceived delay from the packet being sourced that allows (say) 90% of packets to arrive. When a packet arrives out of order, buffer it for this delay minus the perceived trip time to give the missing packet(s) time to arrive. If a packet arrives after this timeout, discard it. A similar adaptive playout buffer is used in vat for removal of audio jitter. This is useful where ordering of requests is necessary and where packet loss can be tolerated, but where delay should be bounded. * 4. Similar to above, specify a fixed maximum delay above minimum perceived trip time, before deciding that a packet really has been lost. If a packet arrives after this time, discard it. * 5. A combination of both of the above. Some delay patterns may be so odd that they upset the running estimate in [3]. Many conference control functions fall into this category, i.e. time bounded, but tolerant of loss. o 6. Use a sliding window protocol with retransmissions as used in TCP. Only useful where loss cannot be tolerated, and where delay can be unbounded. Very tightly coupled conferences may fall into this category, but will be very intolerant to failure. Should probably only be used along with application level timeouts in the transmitting application. 1. It should be noted that all except [1] require state to be held in a receiver for every source. As not every message from a particular source will be received at a particular receiver due to CCCP's multiple destination group model, receiver based mechanisms requiring knowing whether a packet has been lost will not work unless the source and receivers use a different sequence space for every (source, destination group) pair. If we wish to avoid this (and I think we usually do!), we must use mechanisms that do not require knowing whether a packet has been lost. 2. Reliability and ordering of multicast control messages * 1. Have CCCP ignore the problem. Let the application sort it out. * 2. Have CCCP deal pass messages to the application immediately. Discard any packets that are older than the latest seen. * 3. As above, estimate the perceived delay within which (say) 90% of packets a particular source arrive, but delay all packets from this source by the perceived delay minus the perceived trip time. * 4. As above, calculate the minimum perceived trip time. Add a fixed delay to this, and buffer all packets for this time minus their perceived trip time. * 5. A combination of [3] and [4], buffering all packets by the smaller of the two amounts. * 6. Explicitly ack every packet. Do not use a sliding window. MULTIPLE SOURCE Ordering In general we do not believe that CCCP can or should attempt to provide ordering of messages to the application that originate at different sites. CCCP cannot predict that a message will be sent by, and therefore arrive from, a particular source, so it cannot know that it should delay another message that was sent at a later time. The only full synchronization mechanism that can work is an adaptation of [3]..[5] above, which delays all packets by a fixed amount depending on the trip time, and discards them if they arrive after this time if another packet has been passed to the user in the meantime. However, unlike the single source reordering case, this requires that clocks are synchronised as each site. CCCP does not intend to provide clock synchronization and global ordering facilities. If applications require this, they must do so themselves. However, for most applications, a better bet is to design the application protocol to tolerate temporary inconsistencies, and to ensure that these inconsistencies are resolved in a finite number of exchanges. An example is the algorithm for managing shared teleconferencing state proposed by Scott Shenker, Abel Weinrib and Eve Schooler [she]. For algorithms that do require global ordering and clock synchronization, CCCP will pass the sequence numbers and timestamps of messages through to the application. It is then up to the application to implement the desired global ordering algorithm and/or clock synchronization scheme using one of the available protocols and algorithms such as NTP [lam],[fel],[bir]. CCC Addresses As already mentioned, a CCC destination is a tuple of the following form: (instantiation, type, address) An application registers itself with its CCC library (and possibly with a distributed nameserver - more of that is a later version of this paper), specifying one or more tuples that it considers describe itself. Note that there is no conference identifier specified - it is presumed that a control group address or control host address or address list are specified at startup, and that Meta-conferencing (i.e., allocation and discovery of conference addresses) is outside the scope the CCC itself. Is this too restrictive? May not if we let allow CCC lib to open multiple CCC's simultaneously, but this may complicate the applications The parts of the tuple are: * Address * Type * Instantiation CCCP Address The address field will normally be registered as one of the following: * hostname * username@hostname When other applications wish to send a message to a destination group (a single application is a group of size 1), they can specify the address field as one of the following: username@hostname * hostname * username@*.domain * username@* The CCC library is responsible for ensuring a suitable multicast group (or other means) is chosen to ensure that all possible matching applications are potentially reachable (though depending on the reliability mode, it does not necessarily ensure the message got to them all). It should be noted that in any tuple containing a wildcard (*) in the address, specifying the instantiation (as described below) does not guarantee a unique receiver, and so normally the instantiation should be wildcarded too. CCCP type The type field is a class hierarchy that can be literally anything. However, some guidelines are needed to ensure that common applications can communication with each other. Normally an application would register itself under the name of the application to ensure that an message specific to that application can be delivered - for example vat would register itself under the type vat. An application will also register itself under any types it wishes to receive messages on. As a first pass, the following types have been suggested: * audio. send- the application is interested in messages about sending audio * audio. recv- the application is interested in messages about receiving audio * video. send- the application is interested in messages about sending video * video. recv- the application is interested in messages about receiving video * workspace- the application is a shared workspace application, such as a whiteboard * session. remote - the application is interested in knowing the existence of remote applications(exactly which ones depends on the conference, and the session manager) * session.l ocal - the application is interested in knowing of the existence of local applications * media-ctrl - the application is interested in being informed of any change in conference media state (such as unmuting of a microphone). * floor.manager- the application is a floormanager * floor.slave - the application is interested in being notified of any change in floor, but not (necessarily) in the negotiation process. It should be noted that types can be hierarchical, so (for example) any message addressed to audio would address both audio.send and audio.recv applications. It should also be noted that an application expressing an interest in a type does not necessarily mean that the application has to be able to respond to all the functions that can be addresses to that type, although (if required) the CCC library will acknowledge receipt on behalf of the application. Examples of the types existing applications would register under are: * vat- vat, audio.send, audio.recv * IVS- IVS, video.send, video.recv * NV- NV, video.send, video.recv * WB- WB, workspace * a conference manager - confman, session.local, session.remote, media-ctrl, floor.slave * a floor ctrl agent - floor agent, floor.manager,floor.slave CCCP instantiation The instantiation field is purely to enable a message to be addressed to a unique application. When an application registers, it does not specify the instantiation - rather this is returned by the CCC library such that it is unique for the specified type at the specified address. It is not guaranteed to be globally unique - global uniqueness is only guaranteed by the triple of (instantiation, type, address) with no wildcards in any field. When an application sends a message, it uses one of its unique triples as the source address. Which one it chooses should depend on to whom the message was addressed. A few examples Before we describe what should comprise CCCP, we will present a few simple examples of CCCP in action. There are a number of ways each of these could be done- this section is not meant to imply these are the best ways of implementing the examples over CCCP. Unifying user interfaces - session messages in a ``small'' conference Applications: * An Audio Tool (at), registers as types: at, audio.send, audio.recv * A Video Tool (vt), registers as types: vt, video.send, video.recv * A Whiteboard (wb), registers as types: wb, workspace * A Session Manager (sm), registers as types: sm, session.local, session.remote The local hostname is x. There are a number of remote hosts, one of which is called y. A typical exchange of messages may be as follows: * The following will be sent periodically: (1,audio.recv,x) (*,sm.local,x) KEEPALIVE (1,video.recv,x) (*,sm.local,x) KEEPALIVE (1,wb,x) (*,sm.local,x) KEEPALIVE * The following will be sent periodically with interval (1,sm,x) (*,sm.remote,*) I_HAVE_MEDIA text_user_name audio.recv video.recv wb * An audio speech burst arrives at the audio application from y (1,audio.recv,x) (*,sm.local,x) MEDIA_STARTED audio y * session manager highlights the name of the person who is speaking * speech burst finishes (1,audio.recv,x) (*,sm.local,x) MEDIA_STOPPED audio y * session manager de-highlights the name of the person who was speaking * video starts from z (1,video.recv,x) (*,sm.local,x) MEDIA_STARTED video z * periodical reports: (1,audio.recv,x) (*,sm.local,x) KEEPALIVE (1,video.recv,x) (*,sm.local,x) MEDIA_ACTIVE video z (1,wb,x) (*,sm.local,x) KEEPALIVE * someone restarts the session manager: (1,sm,x) (*,*,x) WHOS_THERE (1,audio.recv,x) (*,sm.local,x) KEEPALIVE (1,video.recv,x) (*,sm.local,x) MEDIA_ACTIVE video z (1,wb,x) (*,sm.local,x) KEEPALIVE * and so on...this is illustrated in the diagram below [Image] Unification A voice controlled video conference In this example, the desired behavior for participants to be able to speak when they wish. A user's video application should start sending video when their audio application starts sending audio. No two video applications should aim to be sending at the same time, although some transient overlap can be tolerated. Applications: * An Audio Tool (at), registers as types: at, audio.send, audio.recv * A VIdeo Tool (vt), registers as types: vt, video.send, video.recv * A Session Manager (sm), registers as types: sm, session.local, session.remote * A Floor Manager (fm), registers as types: fm, floor.master There are hosts x and y, amongst others. Its assumed that session control messages are being sent, as in the example above. * The user at x starts speaking. Silence suppression cuts out, and the audio tool starts sending audio data: (1,audio.send, x) (*,sm.local,x),(*,floor.master,x) MEDIA_STARTED audio x * ...this causes the sm to highlight the ``you are sending audio'' icon it also causes the floor manager to report to the other floor managers: (1,floor.master,x) (*,floor.master, *) MEDIA_STARTED audio x * and also it requests the local video tool to send video: (1,floor.master,x) (*,video.send, x) START_SENDING video * ...this causes the video tool to start sending (1,video.send, x) (*, sm.local, x),(*.floor.master, x)MEDIA_STARTED video x * ...which, in turn, causes the sm to highlight the ``you are sending video ''icon * the user at x stops speaking. Silence suppression cuts in, , and the audio tool stops sending audio data (1,audio.send, x) (*,sm.local,x),(*,floor.master,x) MEDIA_STOPPED audio x * ...this causes the sm to de-highlight the ``you are sending audio'' icon * ...the session manager starts a timeout procedure before it will stop sending video ... * a user at y starts sending audio and video data. * The local audio and video tools report this to the session manager: (1,audio.recv,x) (*,sm.local,x) MEDIA_STARTED audio y (1,video.recv,x) (*,sm.local,x) MEDIA_STARTED video y * ...as in previous example, sm highlights sender's name and the floor manager reports what's happening: (1,floor.master, y) (*,floor.master,*) MEDIA_STARTED audio y (1,floor.master, y) (*,floor.master,*) MEDIA_STARTED video y the local floor manager tells the local video tool to stop sending (1,floor.master,x) (*,video.send, x) STOP_SENDING video * ...this causes the video tool at x to stop sending (1,audio.send, x) (*,sm.local,x),(*,floor.master,x) MEDIA_STOPPED video x ... More complex needs Dynamic style-group membership Many potential applications require to be able to contact a server or a token holder reliably without necessarily knowing the location of that server. An example may be a request for the floor in a conference with one roaming floor holder. The application requires that the message gets to the floor holder if it is at all possible, which may require retransmission and will require acknowledgement from the remote server, but the application writer should not have to write the re-transmission code for each new application. CCCP supports "at least one" reliabilty, but to address such a REQUEST_FLOOR message to all floor managers is meaningless. By supporting dynamic type-groups CCCP can let the application writer address a message to a group which is expected to have only one(or a very small number) of members, but whose membership is changing constantly. In the example described, the application requiring the floor sends: (1,floor.master, x) (*, floor.master.holder, *) REQUEST`FLOOR with "at least one" reliability. retransmissions continue until the message is acknowledged or a timeout occurs. When the floor holder receives this message, it can then either send a grant floor or a deny floor message: (1,floor.master, y) (1, floor.master, x) GRANT`FLOOR This message is sent reliably (i.e., retransmitted by CCCP until an ACK is received). On receiving the GRANT_FLOOR message, the floor manager at x expresses an interest in the type-group floor.master.holder. On sending the GRANT_FLOOR message, the floor manager at y also removes it's interest in the type-group floor.master.holder to prevent spurious acking of other REQUEST_FLOOR messages. However, if the GRANT_FLOOR message retransmissions time out, it should re-express an interest. This is illustrated in the diagram below: [Image] Floor Ctl Eg Conference Membership Discovery CCCP will support conference membership discovery by providing the necessary functions and types. However, the choice of discovery algorithm, loose or tight control of the conference membership and so forth, are not within the scope of CCCP itself. Instead these algorithms should be implemented in a Session Manager on top of the CCC. Network support and protocols MULTIMEDIA COMMUNICATION ------------------------------------------------------------------------------- The Information Superhighway needs network protocols that can carry multimedia around with the sorts of guarantees. At least that is what the communications companies say. In fact, if a network is provisioned at the right level (its lines are fast enough), it may not be necessary to impose any special protocols. Even with networks running close to capacity, the requirement is really merely for a way for routers and switches to distinguish different traffic types, and give them the appropriate forwarding priority. The difference between different networks comes down (as do many in computer science) to when the binding is done between the flow of traffic and the state instantiated (together with resources) in a router to support the traffic class that this flow needs. Two extreme examples of multiservice network architectures illustrate this: Two network approaches to multiservice * IP + RSVP + Flows * ATM + Q.2931 * Both classify packets either one at a time, or per call * Illutrated below [Image] CBQ IP, RSVP, Flow Ids In the Internet, the RSVP protocol can be used by a recipient to request a specified a quality of call for a flow that they require. This request is periodically resent, so there is no binding necessarily between the call and the route (i.e. rerouting can happen between 1 packet and the next). If no special quality is required, or else the routers already know about this traffic class and have capacity, then no RSVP is needed. A source chooses a unique flow id for the traffic, which can be used by routers as a fast lookup for route and quality requirement for the traffic. In the absence of an entry for this flow in a router, the rest of the packet's IP header can be consulted and the packet forwarded with some default quality anyhow. If the required quality varies, then it can simply be latched by the next RSVP refresh. RSVP carries a flow specification and a filter specification. The flow specification is a list of parameters to do with throughput, delay and errors that will be needed to meet the flow's requirements for reasonable delivery. The filter specification is a patter that is used to match the flow when it arrives at a router. A filter can be turned on and off without removing the flow specification, so that intermittent flows (e.g. video or voice in a floor controlled video conference) can be quickly turned on and off within the net. This is important for multicast. RSVP interacts with the Routing protocol by possibly locking routes while any reservation is in progress to avoid looping. Filters can be wildcard, which is shared amongst all senders to a group, fixed or dynamic. Dynamic filters are sets of fixed flowspecs that can be chosen between on demand. IP and RSVP and Flows * The IP model is being enhanced to add soft state * This refers to use of RSVP to establish traffic classes * Can do this for 1 or a group or a type * IP6 carries a flow id, like a connection id for fast lookup * RSVP permits specification of leaky bucket parameters for this * For non-delay bounded traffic, just don't this! ATM, Q.2931 and VCI/VPIs. With an ATM network, before any packet can be sent, the call setup protocol (recently standardised as Q.2931 by the ITU), is invoked to setup path, call and resources needed. The binding of all these is needed up front. There isn't yet any way of aggregating calls, as there is in RSVP through clever filters. Long term calls might be configured through network management to use PVCs, to deal with the intermittent bandwidth problem above (a PVC allows a receiver to control specification of a flow, counter-intuitively). ATM and Q.2931 * ATM is the telco/PNOs approach * Packet (cell) oriented * Resources reserved by sender * Allows fine grain allocation * Application must know its needs Quality of Service Parameters: How Many The ATM and Q.2931 specifications list a huge number of QoS parameters including: * Mean, Sustainable and Peak Cell (packet) Rate * Cell Loss Tolerance * Burst Tolerance * Cell Delay Variance 1. The Internet Community is working on the basis of a much simpler formulation of quality for an application. Basically, there is a minimum throughput, and a delay tolerance. The delay variation is only necessary to specify for either tightly bounded conferences in overloaded networks, or else to support legacy eequipment (CODECs that don't tolerate time variation beyond some bound). QOS - how many parameters * Q.2931 permits a plethora of parameters * How many are really needed? Depends on application * Mean throughput and delay tolerance are probably about it Real Time Protocol and Real Time Control Protocol The Internet Community has developed a standard protocol for audio and video and other image distribution applications to use to carry their data around and provide a common platform to express some of the timing and session information needed by real time applications. This is RTP and its associated control protocol, RTCP. RTP is simply a framing protocol. It contains no comprex exchanges of messages (handshaking), but rather leaves any conference control matters to higher levels. RTP packets contain media types and media specific timestamps. These are used in adapative playout buffer schemes. RTCP packets carry source and receiver reports that describe the users, and the reception quality. RTP uis usually multicast (even when there is only one sender and recipient) using the User Datagram Protocol (UDP) over IP multicast. RTP and RTCP * Internet packet format/protocol for carrying audio and video * Used now for several years * carries media time stamp and not a lot else * RTCP performs some of the CCC functions ------------------------------------------------------------------------------- RTP Packet Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | contributing source (CSRC) identifiers | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The first twelve octets are present in every RTP packet, while the list of CSRC identifiers is present only when inserted by a mixer. version (V): 2 bits This field identifies the version of RTP. The version defined by this specification is two (2). padding (P): 1 bit If the padding bit is set, the packet contains one or more additional padding octets at the end which are not part of the payload. extension (X): 1 bit If the extension bit is set, the fixed header is followed by exactly one header extension, with a format defined in Section 5.2.1. CSRC count (CC): 4 bits The CSRC count contains the number of CSRC identifiers that follow the fixed header. marker (M): 1 bit The interpretation of the marker is defined by a profile. It is intended to allow significant events such as frame boundaries to be marked in the packet stream. payload type (PT): 7 bits This field identifies the format of the RTP payload and determines its interpretation by the application. sequence number: 16 bits The sequence number increments by one for each RTP data packet sent, and may be used by the receiver to detect packet loss and to restore packet sequence. timestamp: 32 bits The timestamp reflects the sampling instant of the first octet in the RTP data packet. The sampling instant must be derived from a clock that increments monotonically and linearly in time to allow synchronization and jitter calculations SSRC: 32 bits The SSRC field identifies the synchronization source. CSRC list: 0 to 15 items, 32 bits each The CSRC list identifies the contributing sources for the payload contained in this packet. The number of identifiers is given by the CC field. If there are more than 15 contributing sources, only 15 may be identified. CSRC identifiers are inserted by mixers, using the SSRC identifiers of contributing sources. RTCP Packet Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P| RC | PT=SR=200 | length | header +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SSRC of sender | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | NTP timestamp, most significant word | sender +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ info | NTP timestamp, least significant word | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sender's packet count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sender's octet count | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | SSRC_1 (SSRC of first source) | report +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ block | fraction lost | cumulative number of packets lost | 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | extended highest sequence number received | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | interarrival jitter | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | last SR (LSR) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | delay since last SR (DLSR) | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | SSRC_2 (SSRC of second source) | report +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ block : ... : 2 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | profile-specific extensions | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ SR or RR: The first RTCP packet in the compound packet must always be a report packet to facilitate header validation as described in Appendix A.2. This is true even if no data has been sent nor received, in which case an empty RR is sent, and even if the only other RTCP packet in the compound packet is a BYE. Additional RRs: If the number of sources for which reception statistics are being reported exceeds 31, the number that will fit into one SR or RR packet, then additional RR packets should follow the initial report packet. SDES: An SDES packet containing a CNAME item must be included in each compound RTCP packet. Other source description items may optionally be included if required by a particular application, subject to bandwidth constraints (see Section 6.2.2). BYE or APP: Other RTCP packet types, including those yet to be defined, may follow in any order, except that BYE should be the last packet sent with a given SSRC/CSRC. Packet types may appear more than once. Middle Layers ST-II/PVP/etc An alternative to IP+RSVP, and under some active research by IBM amongst others, is the ST-II protocol. ST is like IP, but has two added functions, one subsumes RSVP (its called the ST Control Message Protocol, SCMP) and the other is t support multidestination calls. ST is not yet widely available, but it is used on one very large network called the DSINet (Defense Simulation Internet), which runs all around the world and allows NATO countries to use videoconferncing and access remote computer warfare simulations, hence it is quite important. ST and PVP and so on... * ST was an experimental version of IP for flows * Still in use on some research nets * may still come onstream... MULTICAST ------------------------------------------------------------------------------- * So why is IP Style Multicast is Better than Unicast? * The internet has provided multicast for about 7 years now. * Only recently have products included this facility * No more so than with routers! * The graph illustrates past growth in Internet multicast reacaility ------------------------------------------------------------------------------- Multicast Routing Protocols * Reverse Path Unicast routes: * DVMRP, MOSPF, PIM * Own Routes: * CBT * Single Tree: * CBT and PIM Sparse mode * Source Tree: DVMRP, Dense Mode ... Reliable Multicast Transport * RMP - from Berkeley * Uses virtual Ring to circulate token * Scales well for small numbers of sources * Bad for video/audio with lots of sources and sinks * Better to distribute relaibility and ordering Internet MM Applications * Video: IVS, Nv, Vic, CuSeeMe * Audio: IVS, Vat, Bat, Maven, Nevot * Whiteboards: ShowMe, Wb, MScrawl * Other: Imm, etc * Reliability and Ordering, distributed * Not a problem for human-to-human - * Consistency only needed for "data" Multicast Coordination * With CCCP or MMCC, also need Session * Session Directory Tools allocate addresses for groups to avoid clashes * SD Tool provides user with "bboard" style navigation of topics * Protocol is draft for now.... VR: General Tools * Few - probably only VRML (Virtual Reality Modelling Language, Web like, from SGI) * Also Vidl (Video Extensions to Tcl) * See also MUDs, MOOs and: * Jupiter - Xerox multimedia MUD. [Image] Mbone Growth Multicast Lesson 1. * If S sends to D unicast, but D is a replicated service, S sends n packets. * If S multicasts, and all the D's recognize multicast destination as themselves (and dont lose it), S sends 1 packet * N-fold decrease in bandwidth! * The figure shows a multicast stream * [Image] Mcast Model Multicast Lesson 2. * R recognizes multicast from S, knows there are members of D on "other side", N-fold decrease in bandwidth. * Especially useful since path through R probably slower line! * D's group members may come and go. * R has to now keep track of this - in unicast case, it did something similar through ARP ....now, group memberships changes made by D's sending (IGMP) better approach. * [note similarity to mobile host problem] * Count cost in terms of number of times packet traverses each link. Multicast Lesson 3. * Now routers must either exchange D location information, or forward all messages to all D's, or exchange D non-location information. * Key tradeoffs; o a number of groups at site and not at others (sparseness of group distribution). o b rate of groups appearing and disappearing, members appearing and disappearing affects size and frequency of updates. o c sources appearing and disappearing affects tradeoff in choosing distribution by default with "pruning" versus distribution of joins Scaling 1. 1. Simple optimization for join or prune - "aggregate"different Ds... * if possible. Note: Aggregation implies * a routes for different Ds are same and * b at the same time... * include cost of router exchange information, and unnecessary visits to links without members.... Multicast Lesson 4. * Single tree (centered = steiner [NP-Hard] problem) versus * Tree per source on reverse path tree from unicast SPF from D to S. versus * SPF tree from each source to D leaves Scaling 2. * Alternative approach altogether - * Group = list of unicast addresses (or 'site' router' address). * Count an optimization in paths in terms of two (usually conflicting) factors - a number of times a link is visited by different sources for same destinations D. b minimizing delay (or other metric) Multicast IP (DVMRP, CBT, PIM, MOSPF) For datagram networks, (be they IP, Novell or CLNP), there are two basic approaches to calculating multicast routes. * 1. Calculate a Reverse Path Multicast tree from each source - this is essentially just the tree made up of the unicast routes from the destination to the source, and can be built on demand when sources start, and removed when they stop, by extracting it from the unicast routing tables. * There are two variants of this, based on which unicast routing paradigm is in use o i. If the underlying scheme is a Link-state one o (c.f. OSPF etc), then a link state multicast tree can be built from this (in the case of MOSFP, as specified in RFC1584, as supported in Proteon Routers, this is made more scaleable by using aggregation). o ii. If the underlying system is a distance vector one, then o RFC 1075 describes how to build source based trees. It uses pruning to achieve better scaling. * 2. Find the Center of the group, and build a tree from it to them (and them to it, and thence to each other). This scheme is used by CBT, and is also part of the basis of the new Protocol Independent Multicast routing protocol (Cisco support), which switches between mode 1/ and mode 2/ depending whether a group is sparse or dense in terms of how its membership is distributed over the Internet. 1. The only problem with this latter approach is that "finding the center" is a well known non-computable problem (the Steiner Tree NP-Hard problem). Luckily, there are quite a few heuristics emerging. 2. Finally, a single tree is good for minimizing the delay amongst ALL the participants, whilst a source specific tree may be better in terms of optimal use of links and may result in better source specific delay. Different multicast schemes differ in their delay and cost tradeoffs This is illustrated in the picture: [Image] Delay Versus Cost Multicast ATM In a virtual circuit based network such as X.25 or ATM, it is still possible to build point-multicast calls (albeit a lot less efficient than the IP many to many model, if you have a large number of sources in the group). To route the call from a caller to a number of callees is easy - it would simply rely on the standard call routing in ATM switches. However, if we want the same support for "receiver join" or "leaf join", then we need a rendezvous point. This means that CBT or PIM would make good candidates for multicast call routing in circuit networks. ATM Multipoint/Multicast Call Routing This may well look like this: [Image] ATM Mcast Multi-point Control Units for ISDN In networks built out of physical digital circuits such as ISDN, (in the absence of multipoint physical circuits!) we need some other mechanism for multiparty calls - this ends up being a question of building a higher layer entity to do the fanout. For ISDN, and for ISDN Video, this has been defined as an application layer unit, called a Multipoint Control Unit, and we'll talk about its protocols more below. QoS Based Routing. Multimedia communication can entail multiple, heterogeneous networking requirements. This can interact with unicast, just as it did for multicast routing. For example, if I transfer a large document including video from one machine t another while I am talking to someone over the same network, it maybe that my voice call is best routed over a modest bandwidth, but low delay path, while the file transfer better moved over a high throughput, but relatively high delay path (e.g. satellite, for example). In general, multi-metric routing is a very hard problem. However, it is usually easier if first you separate metrics into traffic independent and dependent ones - for example, the linespeeds and propagation delays are not subject to other traffic, while the packet store and forward times and rates are subject to queuing. ADSL (SUBSCRIBER LOOP VIDEO DELIVERY) ------------------------------------------------------------------------------- An exciting new development in the transmission systems world has been that of high bandwidth transmission over existing copper into the home via a digital subscriber loop protocol. It has been discoverd that the POTS cable plant is good enough to get as much as 8Mbps into a home over the distance fro mthe phone exchange. This may be used by cable tv or Video rental companies to deliver video in to the home, and use a low bandwidth output path to allow the user access to other services and to control this one - it remains to be seen if there really is going to be a demand for video in this form. WHAT WILL IT COST AND WHO WILL SELL IT TO US? ------------------------------------------------------------------------------- The current tarrifing of networks and multimedia is a thorny question: ISDN based networks are relatively inexpensive in terms of perfermance versus call charges in Europe, but relatively non-existent in the US. However, for multiparty international calls, even I neurope, the tarrifs rapdily become less attractive than leased lines - 5 minutes a day between 4 different coutries at basic rate will cost more than the same bandwidth leased. This means that in terms of multiparty conferencing, it is likely that packet based networks built on top of leased lines will be more attractive, esprcially since they can be used for data when no in use for conferencing. The fact that the Internet does not currently have the capacity for much of these types of use is simply because it is early days for desk top distrbiuted conferencing yet. The end system capabilities are rapidly becoming marginal - the cost of a video and audio card adds around 10% to the price of a mid-range PC now. It seems likely that multiservice packet networks, and in particular those with good multicast support, will eventually be the way forward. However, it is also likely that the last mile ("subscriber loop" may well (as with BT's Internet Service) be provided through basic rate ISDN. However, with IP, or ATM based end systems at the end of such a hop, there is no reason not to take full advantage of distrubuted conferencing. As the take up gets larger, even just for data access, the backbone bandwidths will have to increase. It may well be that the bottlenecks we see today are just a figment of the current tariff structures that are required to fund the grwoth in the superhighway. Once the capacity is in place, the prices will tumble However, thjere will always be usedrs who can overload the backbone - if not video, then HDTV, or 3D motioj holography, or multi-player VR r something. So reservation will be needed, together with some enforcement, whether through priority or charging or both, is needed. Note however that it is only needed for these heavy duty customers. There may come a time when line rental will be all you pay even if you spend 2 hours a day in 5 way videoconferences with prople in 5 different countries. What will it cost and who will sell it? * The telcos and entertainment cos would love to own it * Truth Is that the net is broader and more radical than that * Anyone can be an 'author' or 'performer' * Leads to a different model of 1. billing 2. security 3. dimensioning * The overcapacity needed to permit busniess to work will mean that idle time capacity will be very large... * Likely to see strange traffic patterns! Operating system DEVICE DRIVERS ------------------------------------------------------------------------------- Just as within the network, within a host computer you need to control delay and avoid ignoring or starving a multimedia device. To some extent, this might be the job of the system scheduler (see below), but this can be saved a lot of work by device drivers providing adaquate buffering and timeing support. Device drivers operate out of hardware interrupt levels, so priorities can be set approprirately for the appropriate input or output urgency, combined with buffers appropriate to the next task in hand. For example, an audio device might sample its input 8 bits at a time (e.g. 8mhz, 8-bit mu or a-law samples). But if we are going to process these in an application, or perhaps send them over a packet net, it may make more sense to ask for 40msec samples at a time (i.e. 40 *8000/1000, or 320 bytes), since this is a unit that can more easily be processed or packetised. This will depend on the exact deveice hardware whether one can program it in a driver to deliver audio/video dma to some particular buffer (or from) and only interrupt at finish. Then the driver has to turn around the buffer and give the device an new buffer before it runs out (actually, this is easily done using a circular buffer that is sized to be twice the size of the application reads, plus the amount of storage necessary for the arrival rate for the time the read itself takes). Video devices vary a lot more than this, but ideally would lok like VRAM to the application. OPERATING SYSTEMS ------------------------------------------------------------------------------- * Need same timeliness and throughput control in system as in network * devicedrivers may take most the strain * If, and only if, deveices and systems have good clock access REAL TIME SCHEDULING ------------------------------------------------------------------------------- Realtime scheduling is not neessary in hybrid systems, nor is it necessary in the operating system for networked applications that use adaptive playout schemes. However, it may be necessary to support multiple priorities (or hierarcahil round robin scheduling) in a system that supported multiple multimedia applciations simulataneously, otherwise one might starve out the others. One concrete exampel would be a general purpose computer used to support video on demand. In a System V Unix like system (e.g. Solaris, HP-UX, AIX and NT), you might implement the data and control paths from a multimedia I/O device to the network i/o devie entirely within the operating system rather than developing a special applciation and relying o nthe operating system (or having to put up with the kernel to user level scheduling overheads!), or you mighjt be able to use one of the more modern programming facilities such as kernel threads to program such an application more flexibly. REAL TIME SCHEDULING ------------------------------------------------------------------------------- * Advent of continuous media may need real time scheduling * May not, though - can overprovision system * Note that priorities would then be a sufficient system SYNCHRONIZATION ------------------------------------------------------------------------------- There are three plaes where synchronisation is important: 1. Within a stream, we need to make sure that transmitter and receiver are in synchronisation. This entails encoding a clock in the data, or else using a network that conveys the clock, or both. 2. Between separate streams, e.g. the video from two people in a videoconference, we might want to make sure that the relative timeing perceived by one view at one site of the two streams is the same as that perceived by a different viewer at a different site - for example, a videoconference with 4 people, A,a, B, b, where A and B are sending, and a and b are watching, and a is near A and b is near B. A delay needs to be added to the stream from B to b, and to the stream from A to a to create a level playing field. In a multicast situation, this delay is incorporated in to the playout buffer as a baseline. 3. Between different media - e.g. lip synch Synchronisation * Intra-stream synch - inside a stream, need to know where in the "time structure" a bit goes * Inter-stream - e.g. we are watching two different people and want to see their reactions to what they see of a third * Inter-media - this is just lip-synch! Intra-stream Synch Intra-stream synchronisation is a base part of the H.261 and MPEG coding systems. H.221 and MPEG systems, specify an encapsulation of multiple streams, but also how to carry timing information in the stream. In the Internet, the RTP media specific timestamp provides a general purpose way of carrying out the same function. Intra-stream Synch * Part of H.261 and MPEG and so on * Also in the RTP Internet Profocol spec Inter-Stream Synch The easiest way of synchrinisaing between streams at different sites is based on providing a globally synchronised clock. There are two ways this might be done: 1. Have the network provide a clock. This is used in H.261/ISDN based systems. A single clock is propogated aroudn a set of CODECs and MCUs. 2. Have a clock synchrnisation protocol, such as NTP (the network Time Protocol) or DTS (Digital Time Service). This operates between all the computers in a data network, and continually exchanges messages between the computers to monitor: 1. Clock Offsets 2. Network Delays Alternatively, the media streams between sites could carry clock offset information, and the media timestamps with arrival tiems could be used to measure network delays, and clocks adjusted accordingly, and then used to insert a baseline delay into the adaptive playout algorithms so that the different streams are all synchronised. Inter-media Synch * Could have global clock fro mnetwork * Culd use clock synch between computers * Could carry clock in all packets and use for clock synch calculation a la NTP/DTS Inter-Media Synch There are two basic ways of synchronising different media: 1. Encapsualte the media in the same transmission stream. This is very effective but may entail computationally expensive labour at the recipieitn unravelling the streams - for example, H.221 works like this, but since it is designed to introduce only minimal delay in doing so, is a bit level framing protcool and is very hard to decode rapidly. 2. Use much the same scheme as is used to synchronise different sources from different places. However, since media from the same source are timestampeld by the same clock, the offset calculation is a lot simpler, and can be done I nthe receiver only - basically messages between an audio decoder and a video decoder can be ecxchanged inside the receiver and used to synchronise the playout points. This latter approach assumes that the media are timestamped at the "real" source (i.e. at the point of sampling, not at the point of transmission) to be accurate. Inter-media synch * Can multiplex different media in a single data stream, or * Can carry media timestamps, same as for inter-stream synchronisation * Not difficuly, but may not be necessary either - depends on quality and delay bound requirements! Storage Media COMPACT DISK FORMATS (CD, CD-I, CD-I VIDEO ETC) ------------------------------------------------------------------------------- CD was developed by Phillips as a digital replacement for the old Vinyl long-playing album, which was expensive, error prone and highly variable in quality. CD-ROM stands for "Compact Disc Read Only. It is physically the same as a Music CD (In fact , just about all CD ROM computer drives will play music CDs if only through the headphone output, but sometimes even by retrieving the music as if it were data, and then directing it to a digital audio output device (e.g. soundblaster card on a PC!). A CD-ROM can hold about 650 megabytes of data (i.e. a few thousand floppies worth) , and is impervious to magnets and xrays and even modest physical impact. However, CDs are a lot slower than most magnetic storage technology, and what is more, you cannot write a CD-ROM no matter how hard you try (although a machine for mastering them is not that expensive - typically around 10k, and most shops that have one will take your data and produce it on CD-ROM for around 1k for 1, and 1 dollar a disk thereafter!) CD-ROMs are exactly as good as CDs for reading sequential data (i.e. sustained 1.4Mbps) but for any random access, the heads have to be moved. Unfortunately, so does the disk speed, since it is designed to generate a constant rate at a constant physical recording density, so the disk moves faster when you are reading near the middle than at the outside (i.e. linear velocity is constant, so angular velocity is in inverse proportion to radius). So far, it has defeated technical design to make seeking on such a device at all reasonable. CD-ROM File formats are usually based on the old IBM High- Sierra design, now ratified as ISO 9660. This is fine for DOS machines, but is a bit limiting for UNIX systems, so people tend to use the Rock Ridge extensions. CD is not particularly flexible or high performance for anything but it is a base piece of technology for the multimedia world. CD-I stands for Compact Disc Interactive. It was designed to provide a single format for all multimedia(especially educational) packages. However, it tries to capitalize on CD, but adds an entire system (CPU, etc) to provide the interactive side to access. Unfortunately, this means it is very tied to media, performance, and understanding of structure of such systems as and when it was designed, and it is also tied very much to one manufacturer, Philips, and is unlikely to be picked up by that many others. Storage Media * Conventional media are catching up (mag disks are 1 K per Gigabyte) * CD based technology a useful stop gap * CD very poor for random access, but fine for sequential * DAT and Video 8 also useful stopgaps ------------------------------------------------------------------------------- CD * "Color Books": * Red:CD DA (Digital Audio) * Yellow: CD ROM (Read Only Memory) * Green: CD-i (interactive) * Orange: CD-R (Recordable) * White: CD-v (video): MPEG-1 CD-i * Multiple Media: Audio - multilevel * Video - CD-v based * Text&Graphics: Up to applciation * Player: Specification Includes CPU * Includes ADPCM Audio and separate video coder/decoder... Digital Video Interactive: DVI * DVI includes video, audio, image, text. * Two level: Real Time and Production Quality * Media organised into streams, interleaved in single file * Products based on Intel chipsets (i750/ActionMedia boards) DVI Operations * AVSS: Audio Video Support System * supports interaction between playback and display and host OS * Very PC Specific * No multi-stream operations, so: * not good for conferencing. Quicktime * Media includes MPEG1 video and lower quality specs such as Road Pizza. * Photos are JPEG compressed * Organisation is into: {data, media, track, movie} * Hierarchy of types * Also interfaces to MIDI QuickTime Orgaisation * Data = file * Media = media type+start time + duration * Track = ordering of a media item - like an edit * Movie = group of tracks Multimedia PC:- MPC * Media as per other schemes: * Audio - WAVE (Waveform Audio File format) * Music via MIDI * Image: based on DIB * Text+Graphics: RTF * Video: VfW - multiple codecs - e.g. Indeo MME * Multimedia Extenstions for MPC: * RIFF: Resource Interchange File Format * Metaformat for describing contents in terms of media types. * Operations are a bit richer than in Quicktime or DVI MME Operations * Capability, Open/Close, Info, Pause, play, resume, seek, set, status, stop * Capaiblity: e..g Can Play, Can Eject * or Has Audio, Has Video etc * Architecture permits separate intelligence in Controllers and Device Drivers Director * Macromind director is typical authoring tool * scores include channels * channel includes tempo, palette, transisions, sounds etc and scripts * Scripts are like edit sequences Use of World Wide Web HTTP, HTML and MIME WWW - HYPERMEDIA ------------------------------------------------------------------------------- The World Wide Web makes all previous network services look like stone tablets and smoke signals. In fact, the Web is better than that! It can read stone tablets and send smoke signals too! The World Wide Web service is made up of several components. Client programs (e.g. Mosaic, Lynx etc) access servers (e.g. HTTP Daemons) using the protocol HTTP. Servers hold data, written in a language called HTML. HTML is the HyperText Markup Language. As indicated by its name, it is a language (in other words it consists of keywords and grammar for using them) for marking up text that is hyper!(9) The pages in the World Wide Web are held in HTML format, and delivered from WWW servers to clients in this form, albeit wrapped in MIME (Multipurpose Internet Mail Extensions) and conveyed by HTTP. HTTP is the HyperText Transfer Protocol. What is WWW? * Distributed Hypermedia Database * Contents are described in MIME, Multipurpose Internet Mail Extensions * Servers hold data in HTML - HyperText Markup Language * Links are Universal Resource Locators * Acess protocol is HTTP - HyperText Transfer Protocol A Note on Stateless Servers Almost all Information Servers above are described as stateless. State is what networking people call memory. One of the important design principles in the Internet has always been to minimize the number of places that need to keep track of who is doing what. In the case of stateless information servers this means that they do not keep track of which clients are accessing them. In other words, between one access and the next, the server and protocol are constructed in such a way that they do not care who, why, how, when or where the next access comes from. This is essential to the reliability of the server, and to making such systems work in very large scale networks such as the Internet with potentially huge numbers of clients: if the server did depend on a client, then any client failure, or network failure would leave the server in the lurch, possibly not able to continue, or else serving other clients with reduced resources. Having said this, the idea of being stateless does not necessarily mean that the servers do not keep information about clients. For example: * Logging how many clients and from where they access 1. This can be useful even for sites that do not recoup funds for serving information, but so that they can point at the effectiveness of their information service. * Keeping track of most frequently accessed material This can be useful to age and remove unaccessed information. It can also be used to decide to put frequently accessed information onto faster servers, or even move the information to servers nearest the most frequent clients (called load balancing). * Using Access Control Lists to limit who can retrieve which information 1. Some servers allow the configuration of lists of Internet Addresses, or even client users who are (or are not) permitted access to all or particular information. * Using authentication stages before permitting access, and also to allow billing. 1. While we would not recommend using the Internet to actually carry out billing yet, you can certainly employ secure authentication techniques that would identify a user beyond doubt. This can then be used with each access log, to calculate a bill which can then be sent out-of-band, e.g. by post. * Sharing out information on heavily loaded servers or networks, differentially, depending where clients are. 1. Some sites offer a wealth of information, but have less good long-haul Internet access. They will then distribute data more frequently in favor of local, site or national clients, above non-local , or international ones. Stateless Servers * Do not track clients * Essential to scaling - * Can log clients (but in persistent store) * Can authenticate clients * Can load balance if no memory between one client access and another Caching Another use of the term stateless is to describe whether or not the server keeps note of the actual data from each access by a client (irrespective of whether it notes who the client was). This is called server caching. [Cache is usually, but not always pronounced the same way as cash. It is nothing to do with money, or even ATM, whether ATM stands for Automatic Teller Machine, or Asynchronous Transfer Mode, or even Another Terrible Mistake] Server Caching is a way of improving the response time of a server. Usually, servers keep data on disk. If they keep a copy of all the most frequently or most recently accessed data in memory, they may be able to respond to new (or repeating) clients more quickly. Such caching is usually configurable, and depends largely on measuring a whole lot of system parameters: * Disk speed and capacity versus Memory speed and capacity. 1. Obviously, if there isn't much memory in a system then a cache say of one item would have little effect. * Network speed versus disk speed 1. A memory cache is pointless if the network is always slower than the worst disk search!. * Client access patterns. 1. Clients may repeatedly access the same information. Different clients may tend to access the same information. Even if clients access different information over time, it may be that at one time, most people tend to access the same information (this is especially true of news servers or share information servers for example) Caching is also employed in client programs. In other words, a client program may well not only hand each piece of information to the user - it may also squirrel away a copy of recently accessed items to avoid having to bother the server again for subsequent repeat requests for the same items. In both server and client caching, the system should make sure that the actual master copy hasn't changed since the cache copy was taken. This can be quite complex! Caching * Trade off network, disk and memory speeds * Can optimise servers for client access patterns * Can cache in clients as welll as servers So, what is the World Wide Web? From the user point of view, the World Wide Web is information, a great tangled web of information. The user doesn't care anything (well, almost anything) about where the information is stored, about how it's stored, or about how it gets to her screen - she just says ``Oh, that looks interesting'', clicks the mouse, and (after a short time, or a long time if your link is slow and the file is large), the information arrives. Here's a short example: Example of using WWW A researcher is coming to London for a conference, and she needs information on hotels to stay in. Starting with the ``Internet Starting Points'', which is available directly from the ``Navigate'' menu on the screen, she might follow the a sequence like this: * Selecting ``Internet Starting Points'' fetches a * list of possible sources of information. 1. She sees that there's a highlighted phrase which says ``Web Servers Directory'', and she thinks ``aha maybe there's a WWW server in London''. She clicks on ``Web Servers Directory'', and after short delay the page arrives... ...On the Web Servers Directory, she searches down the list of countries until she finds the entry for the United Kingdom. One entry listed is ``Country Info'', and she wonders what info is provided. She clicks on it and... * ...``Country Info'' turns out to be an active map of * the UK. She clicks on London ... * ...and gets a guide to London, including an entry * labeled ``Hotels in central London''. She clicks on this and finds the information she was looking for. She didn't need to know very much information, other than where to start, and on most browsers there are a few suggested starting points built in. There are hundreds of other paths she could have followed to get to the same eventual destination. Example of WWW [Image] WWW Beneath the Surf Mosaic has a few well know places to look for data built in. One of these is specified by the URL: * http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/- * ,!StartingPoints/NetworkStartingPoints.html A URL is a Uniform Resource Locator. This specifies what a piece of information is called (``/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html''), where to find it (in this case the machine called www.ncsa.uiuc.edu), and which protocol to use to get the information (in this case http, or HyperText Transfer Protocol). When our researcher selects ``Internet Starting Points'', her Mosaic makes a TCP connection to the World Wide Web server running on www.ncsa.uiuc.edu. It then uses this connection to send a request for the data called ``NetworkStartingPoints.html''. The WWW server at NCSA uses the connection to send back the requested data, and then closes down the connection. Next, Mosaic reads various embedded commands in the data that was retrieved, and creates a nicely laid out page of text which it presents to our researcher. Some parts of the text she sees are highlighted (on Mosaic for X, they are underlined and coloured blue). One entry she sees is: Web_Servers_Directory_: The central listing of known World Wide Web servers. She simply clicks on the highlighted text, and the associated page of information is fetched ``as if by magic''. Of course, what actually happened was that the text she saw on screen was not the whole story. The page of data that was retrieved from NCSA was actually in a language called HTML or Hyper Text Markup Language. Before her copy of Mosaic laid out the text nicely, it actually looked something like: Web Servers Directory : The central listing of known World Wide Web servers. Thus the highlighted text she clicked on was associated with the URL: http://info.cern.ch/hypertext/DataSources/WWW/Servers.html and clicking on this text causes her Mosaic to make a connection to info.cern.ch to request the page called ``/hypertext/DataSources/WWW/Servers.html'' Our researcher may be sitting in Melbourne, Australia. The NCSA server is in Illinois, USA, and the CERN server is near Geneva in Switzerland, but none of this concerns our researcher - she just clicks on the highlighted items, and the hyper-links(4) behind them take her from server to server around the world. Unless she pays close attention to the URLs being requested, she will not know or care where the data is actually stored (except that some places have slower links than others). Another Example On the list of places she retrieved from the CERN server, she sees the entry: United Kingdom (sensitive_map_, country_info_) The HTML behind(5) this entry is actually: United Kingdom ( sensitive map, country info) She clicks on country info, thus requesting the HTML text with the URL: 1. http://www.cs.ucl.ac.uk/misc/uk/intro.html As before, her Mosaic sets up a connection, this time to www.cs.ucl.ac.uk, and retrieves the page called ``/misc/uk/intro.html''. However, this time the HTML her Mosaic gets back contains the command: Ignoring the ``ISMAP'' bit for a second, this says that the page should contain a GIF image at this point, and that the GIF image is called ``uk_map_lbl.gif''. Actually it's full URL is: http://www.cs.ucl.ac.uk/misc/uk/uk_map_lbl.gif which Mosaic can figure out from the URL of the page the image is to be contained in. Mosaic now sets up another connection to www.cs.ucl.ac.uk to request the image called ``/misc/uk/uk_map_lbl.gif'', and when it has retrieved the image, it displays it in the correct place in the text. Now, if it wasn't for the ISMAP part of this HTML, that's all that would happen - the image would be displayed, and our researcher could look at it. However, in this case, the image is a map of the UK, and we put some intelligence behind the map. The ISMAP part of the HTML tells our researcher's Mosaic that this image is special, and it will allow her to click on the map to get more information. MAP Example In actual fact, the full piece of HTML we used in this particular case was: * * 1. 2. 3. [Image] WWW Maps So, when our researcher sees London marked on the map, and she clicks on it, her Mosaic does something a little different. It sets up a connection to www.cs.ucl.ac.uk (that's where the map came from(7), and sends a request for the URL: http://www.cs.ucl.ac.uk/cgi-bin/imagemap/uk_map?404,451 Here 404,451 are the coordinates of the point she clicked within the map. The ISMAP command associated with the image tells Mosaic to work out where the user clicked, and send that information too. At the server on www.cs.ucl.ac.uk, there are a number of data files for maps. This special URL asks the server to look in its map data for ``uk_map'', and find what the point 404,451 corresponds to . The WWW server running on www.cs.ucl.ac.uk responds with the URL of the page corresponding to London on this map - in this case the URL is: http://www.cs.ucl.ac.uk/misc/uk/london.html which happens to be on the same server as the map, though it didn't have to be. Our researcher's Mosaic then sets up another connection to www.cs.ucl.ac.uk, and requests the page ``/misc/uk/london.html''. When this page is received, Mosaic parses the HTML text it gets back, and discovers the following line in the retrieved text: 1. and so it then also requests http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif which is just a little picture of Tower Bridge here in London, which doesn't have any special significance other than decorating the London page. Uniform Resource Locators (URLs) The above example presents quite a number of URLs. For instance the URL: http://www.cs.ucl.ac.uk/misc/uk/intro.html As we stated above, this says that the data called ``/misc/uk/intro.html'' can be retrieved from the server running on a computer called ``www.cs.ucl.ac.uk'' using http which is the HyperText Transfer Protocol. This could equally well say: http://www.cs.ucl.ac.uk:80/misc/uk/intro.html The number 80 here is the TCP port on the machine www.cs.ucl.ac.uk that the WWW server is listening on. TCP ports are a way that several different kinds of server can all listen on the same machine without getting confused about which server the connection is being made to (think about lots of letter boxes in an apartment block). Port 80 is the default port for the HyperText Transfer Protocol, so if you don't say which port to connect to, Mosaic and the other WWW browsers will all assume you mean port 80. See chapter 5 for more details about server ports and why you might sometimes run a server on a different port. URLs * http://www.cs.ucl.ac.uk/misc/uk/intro.html * data called``/misc/uk/intro.html'' can be retrieved * server running on a computer called ``www.cs.ucl.ac.uk'' * using http which is the HyperText Transfer Protocol. More about URLs URLs don't just have to specify that you use HTTP. For instance the URL: ftp://cs.ucl.ac.uk/mice/index * says that to get this information, contact the ftp server * running on * cs.ucl.ac.uk Most WWW browsers know how to talk to ftp servers too, so they can set up an ftp connection, and request ``/mice/index'' using the much older File Transfer Protocol. One of the biggest plus points for Mosaic and other WWW browsers is that they are multiprotocol clients - that is they know about quite a number of different protocols, and so they can contact a number of different types of servers for information. If the information is out there on the Internet, no matter what type of server it's on, there is almost certainly a way for a WWW browser to get it. The URL tells the browser what type of server the data resides on, and thus how to go about getting it. More About URLs Protocols that WWW browsers know about include: * http: HyperText Transfer Protocol * ftp: File Transfer Protocol * gopher: the menu based information system predating WWW * wais: Wide Area Information System - an information system allowing complex searching of databases * telnet: the protocol that allows you to log in to remote systems. * archie: the indexing system that allows you to find out what information is stored where on ftp servers. An Introduction to HTML HTML is the HyperText Markup Language. As indicated by its name, it is a language (in other words it consists of keywords and a grammar for using them) for marking up text that is hyper! HTML is an extension to the fairly commonly used Standardized Generalized Markup Language, SGML (8) The pages in the World Wide Web are held in HTML format, and delivered from WWW servers to clients in this form, albeit wrapped in MIME and conveyed by HTTP, of which more below. Marking up is an ancient skill developed in the Dark Ages of publishing by guilds of printers, keen on presenting the written word in a pleasant and effective way on the printed page. Typically, in recent years, the skill has diminished with the advent of WYSIWYG (What You See Is What You Get, so called whizzy wig) word processing packages and desktop publishing systems. This need not daunt you, since you do not have to author or prepare material for the World Wide Web in HTML directly, unless you really want to. Typically, an author will write material using whatever word processor they are used to, and then use a filter to translate the output into HTML. We will discuss some of the various filters that are available in later chapters. Getting Started with HTML A simple example of HTML is: * * * This is the Title * * *

This is the Page Heading

* This is the first paragraph. *

* This is another paragraph, * with a sentence * that is split over several lines in HTML. * * 1. When this is displayed by Mosaic, it will look like: 2. As you can probably guess, commands are enclosed in angle-brackets <>, so that the HTML command means that the following text is part of the title. 3. Commands beginning </ are the end of the equivalent command. For example to say that the text ``This is the Page Heading'' should be a level one heading (the largest type of heading), the complete sequence is: <H1>This is the Page Heading</H1> A break between paragraphs is denoted <P>. There is no need for a </P> afterwards because the end of a paragraph is obvious from the start the next paragraph, list, heading or whatever. <HTML> Strictly speaking a page should start <HTML> and should end </HTML>, but the HTML specification also says that clients should perform correctly without them, and so many people omit them. Similarly the header of a document (the bit containing the title) should begin with <HEAD> and end with </HEAD> and the body of a document should begin with <BODY> and end with </BODY>, but in practice this isn't essential. The HEAD and BODY commands are newer additions to HTML, which allow some of the fancier features to be used, but if you're not using these features, you can safely omit both. Documents written in HTML are not WYSIWYG - Mosaic and other WWW clients will re-arrange the layout of your text so it fits properly on whatever size display you try and display it on. So if you really want to break a line at a specific place, you should use <P>, rather than a carriage return, as Mosaic will remove the carriage return and replace it with a space, and then break your line of text at a point that is convenient for the current page width. Hence the text: * <P> * This is another paragraph, * with a sentence * that is split over several lines. Will get formatted as: This is another paragraph, with a sentence that is split over several lines. Headings and Typefaces We've already seen one type of heading, a top level heading denoted by the <H1>....</H1> pair. As you would expect, HTML supports many different levels of headers, with H1 being the largest, getting progressively smaller with H2, H3 and so on down to H6. Exactly which font and size a particular heading will be displayed with depends on which browser you use to view the text - some text based browsers won't do anything, but more fancy graphical browsers such as Mosaic will choose a sensible set of fonts. (9) HTML also lets you specify that a piece of text should be in a bold typeface using the <B> ... </B> combination, or in an italic typeface using the <I> ... </I> combination. Thus the HTML: <I>The</I> <B>Guardian</B> newspaper titles look like this results in: The Guardian newspaper titles look like this Lists of things Lists of things are pretty useful in ordinary text, but in HTML, where you'll often have lists of links to other places, they're even more useful. However WWW servers just consisting of lists are pretty boring too, and with some imagination, you'll find more interesting ways to present many things. The simplest list is the bullet or unordered list, which is denoted by <UL>, and the list items in it are denoted using <LI>. An example is: Oxymorons: * <UL> * <LI>Military Intelligence * <LI>Plastic Glasses * <LI>Moral Majority * </UL> This would be displayed as: Oxymorons: * Military Intelligence * Plastic glasses * Moral majority Another form of list is the numbered or ordered list denoted by <OL>. Ordered lists have the same syntax as unordered lists except that OL replaces UL in the list delimiters: Oxymorons: * <OL> * <LI>Business ethics * <LI>Chilli * </OL> This gets displayed as: Oxymorons: * 1. Business ethics * 2. Chilli Definition Lists A more complex type of list is the definition list, denoted by <DL>. definition terms are denoted using <DT> and actual definition data is denoted using <DD>, so a typical list may be: Population Statistics: * <DL> * <DT>Ireland * <DD>population 3 million * <DT>Scotland * <DD>population 5 million * <DT>England * <PP>population too many * </DL> which would be presented as: Population Statistics: Ireland population 3 million Scotland population 5 million England population too many If you wish to have several paragraphs of definition data associated with one definition term, simply use several <DD> entries. Note that although the <DL> list must be finished with a </DL>, each <DT> or <DD> list item is simply ended by the next definition. Making it all look pretty Horizontal Rules HTML provides the <HR> command to create a horizontal line across the page - judicious use of <HR> to split a page into sections can aid readability. Pictures However, when it comes to attractive layout, a picture is worth a thousand words, which is fine, except for the fact that pictures generally also require a thousand times as many bytes to be transferred. A picture can be included using an HTML command such as: <img src=a_thousand_words.gif> In this case, this tells Mosaic that there is a picture called ``a_thousand_words.gif'' on the remote server in the same directory (or folder) that this page of HTML was found in. A more complex example is: <img src=http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif> In this case, the image is specified with a complete URL, which tells Mosaic exactly where to go to fetch the picture. Note that the data for the picture does not need to reside on the same server as the document that it is embedded in. Also note that we've omitted the quotes from around this URL - although it's not a bad idea to add them for the sake of clarity, or for URLs containing odd characters such as spaces, they're not strictly necessary in most circumstances. Displaying Images - Launching Applications In order for an image to be displayed in a page of a document, it must be in one of a small number of formats. However, not all formats are displayable on all browsers. * gif - a compressed 8 bit image format. Viewable on most browsers that support images. * xbm - X Bitmap - two color uncompressed format. 1. Viewable on most browsers that support images. The background and foreground colors on the image are typically displayed in the background and foreground colors of your browser. * xpm - X Pixmap - multicolour X format. Not viewable on all browsers - some versions of MacMosaic can't view 1. this for example. The background color is displayed in the background color of your browser, which enables the image to merge into your document nicely. Although many other image formats are viewable using an external viewer program, they are not necessarily viewable as embedded images on your browser. Linking it all together We gave an example above of an image that can be stored on a different server from the text page that it is to be embedded in - this is an example of a hyper-link. Hyper links are what turn the Web from a not terribly good text formatting system to the tangled Web of information that make the World Wide Web interesting. They're both the mechanism by which you find things, and the way of tying multiple media or data from multiple sources together. The example we gave above was for an embedded image, and will be downloaded automatically (10). However, in most cases you only want the hyper link to be followed when the user clicks on it. An example is: Pictures of <A HREF=http://www.cs.ucl.ac.uk/people/mhandley.html> Mark</A> and <A HREF=http://www.cs.ucl.ac.uk/people/jon.html>Jon</A> are available for those with a strong stomach. This will be displayed as: * Pictures of Mark_and Jon_are available for those * with a strong stomach. 1. If now click on Mark, or on Jon you will be presented with a glorious full color picture of one of the authors. 2. The <A>..</A> in the text above denote an anchor - in other words some additional information that has been associated with the text. In the case the anchor has a hypertext reference denoted by the keyword HREF and the URL corresponding to that reference. Other information can also be associated with an anchor - see later. Hotlists Users can construct indexes by creating lists of URLs. Most client programs allow people to do this easily. Many users then advertise these hotlists by adding them to their own pages in their own web servers. Some sites keep hotlists or bookmarks organized by subject or by research interest. Some sites even let users submit new entries for their indexes. This allows navigation (although it doesn't really help searching) in the Web. Each hotlist or list of bookmarks represents another tour or view of the places of interest to the author of that hotlist. As more and more sites and users construct such lists, the density or value of referenced information increases. More Pretty Pictures A picture is worth a thousand words. Unfortunately this is an understatement, and it is often actually more like the equivalent of 50,000 words, or 250 KBytes. Thus embedding large pictures in pages of text is usually not a good idea. More typical is to include a small copy of the image in the document, with a hyper link to the larger version of the image. An example would be: <A HREF=big_ben.gif><IMG SRC=little_ben.gif></A> 1. In this case, it is an image little_ben.gif that has been given an anchor with a hyperlink to big_ben.gif. Mosaic will display the small image embedded in the page of text, and will only retrieve and externally display the large image big_ben.gif if the user should click on the small image. 2. Images such as the one described are called external images to distinguish them from embedded or inline images. Most WWW browsers use a separate viewer program to display external images. On UNIX systems, the most common external viewer program is XV. On Apple Mac's the external viewer is called JPEG View. On Windows PC's it is called LVIEW. Generally external viewer programs do not come bundled with the WWW browser, and you'll have to obtain one separately. Usually external viewers can display a larger range of images than the WWW browser itself can, though this is changing as WWW browsers become more sophisticated. Links Within a Page The hyper links we've shown so far all take you to the top of the page at the end of the link. However, it's useful to be able to jump to specific places within a page too. For instance, where a page is quite long, it is useful to be able to have a summary of the page at the top, with hyper links directly to the summarized sections. This can be done by associating names with anchors as follows. If this course was called example.html and we wanted make it available online, we might put a list of contents at the top: <UL> ... <LI> <A HREF=example.html"#links>Go to Section 1</A> <LI> <A HREF=example.html"#more_pics>More Pretty Pictures</A> <LI> <A HREF=example.html"#page_links>Links Within a Page</A> ... </UL> ... ... <A NAME="page_links"><H2>Links Within a Page</H2></A> The hyper links we've shown so far..... Now if you click in the ``Links Within a Page'' entry in the contents list, your browser will jump to the document with the partial URL example.html#page_links. As we're already viewing the document called example.html, it doesn't bother to fetch the page again, but merely jumps directly to the anchor named page_links. Pre-Formatted Text Often you'll come across some text that you wish to put on a WWW server that is pre-formatted plain text. You could of course go through the text and insert all the necessary HTML formatting commands, but often all you want to do is stop a WWW browser re-formatting it for you. HTML provides the command pair <PRE>...</PRE> to delimit text you don't want re-formatting. * this text will * be reformatted * by the browser * <PRE> * and this text * will not be * reformatted * </PRE> would look like: * this text will be reformatted by the browser and this text will not be reformatted Note that the preformatted text will be displayed in a fixed width typewriter style font. Typewriter style fonts are fixed width - i.e. all that characters are the same width. Book fonts and the default fonts used by WWW clients such as Mosaic are variable width You should avoid overuse of <PRE>, as it doesn't allow WWW browsers any leeway in doing anything clever about line wrapping, and because typewriter style fonts are pretty ugly. A note on links In the examples above, we've shown two forms of links - an absolute URL such as is used in this image link: <img src=http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif> and relative links such as: <img src=tower_bridge.gif> If this relative link were in a page of HTML with the URL http://www.cs.ucl.ac.uk/uk/london/index.html then the client assumes that the protocol (http), the remote computer (www.cs.ucl.ac.uk) and the directory (/uk/london) are all the same as those in the page containing the link, and so it actually requests the data with the absolute URL http://www.cs.ucl.ac.uk/uk/london/tower_bridge.gif Another possibility is to specify relative URLs with the full directory and filename - the client knows that you mean this because the directory name begins with a slash (``/''). For example, the relative link above could have also been written: <img src=/uk/london/tower_bridge.gif> You can even use relative directory names using UNIX style relative pathnames. For example, an HTML page with the URL http://www.cs.ucl.ac.uk/uk/intro.html could use the following link the same picture of Tower Bridge: * <img src=london/tower_bridge.gif> * and an HTML page with the URL: * http://www.cs.ucl.ac.uk/uk/london/east_end/docks.html * could use a link such as: * <img src=../tower_bridge.html> * Note that the ``../'' here refers to the parent(13) directory * of the current directory in the directory tree. A WWW Server listens on a TCP port(1) for incoming connections from clients. It expects a connecting client to speak a protocol called HTTP or HyperText Transfer Protocol. The connecting client is usually a browser such as Mosaic, which will request some information from the server, and the server will then return the requested information to the client(2) . HTTP is a pretty simple protocol. If you want to see what actually happens, you can telnet to a WWW server and talk to it yourself(3) . The simplest HTTP request is GET. An example of telnetting to a server and issuing a GET request is: * telnet> open macpb1.cs.ucl.ac.uk 80 * Connected to macpb1.cs.ucl.ac.uk * Escape character is '^]'. * telnet> GET /index.html HTTP/1.0 HTTP/1.0 200 OK MIME-Version: 1.0 Server: MacHTTP Content-type: text/html <title>Mark's Powerbook on the Web

Welcome to Mark's WWW server

This temporary server is running on an Apple Macintosh Powerbook 180 using MacHTTP 1.3. There's not much here right now, except for the HTTP documentation. The request I made was ``GET /index.html'' and additionally told the server I spoke ``HTTP/1.0''. The server responded with the document index.html, and with additional information. The first line of the response says that the server is also speaking ``HTTP/1.0'', that the status code my request returned was ``200'', which in human terms means ``OK''. The next line gives information about the version of MIME. Then there's a line that says what type of server this was. And finally there's a line that says the ``Content-Type'' is ``text/html''. This last line is actually giving the MIME content type, which is how the server tells the client what to do with the information that follows. In this case it says that what follows is actually ``text'' (as opposed to an image, video, audio or a whole host of other possibilities), and that this particular text is in ``html'' format. If we'd asked for this information using a WWW client instead of telnet, the client would have read the Content-Type line, and known to feed the data following into it's HTML interpreter. MIME MIME stands for Multipurpose Internet Mail Extensions, and was originally designed for sending multimedia electronic mail. The two main things it does are specify in a standard way what type of media the contents of a message actually are, and what form they've been encoded in for transmission. When Tim Berners-Lee was originally designing what would go on to become the world wide web, he had exactly this same requirement - he needed a server to be able to specify to a client what a response contained and how it had been encoded. The email people had got there first, and had already specified MIME, so there was no need to re-invent the wheel. MIME Content Types MIME Content Types consist of a type (such as ``text'') and a subtype (such as ``html''). The most common MIME types relevant to the WWW are: ``text'' Content-Type, which is used to represent textual information in a number of character sets and formatted text description languages in a standardized manner. The two most likely subtypes are: * text/plain - text with no special formatting require-ments. * text/html - text with embedded HTML commands ``application'' Content-Type, which is used to transmit application data or binary data. Two frequently used subtypes are: * application/binary - the data is in some unknown binary format, such as the results of a file transfer. * application/postscript - the data is in the postscript language, and should be feed to a postscript interpreter. ``image'' Content-Type, for transmitting still image (picture) data. There are many possible subtypes, but the ones used most often on the web are: * image/gif - an image in the GIF format. * image/xbm - an image in the X Bitmap format. * image/jpeg - an image in the JPEG format. ``audio'' Content-Type, for transmitting audio or voice data. * audio/basic - the data consists of 8KHz 8 bit mu-law audio samples. ``video'' Content-Type, for transmitting video or moving image data, possibly with audio as part of the composite video data format. * video/mpeg - the data is MPEG format video * video/quicktime - the data is QuickTime formet video Suffixes, Servers and MIME types Now we know how a server tells a client what type of information is being returned, but how does the server figure out this information? In the UNIX and DOS world, files are usually identified using file name suffixes. A file called london_zoo.gif is likely to be an image in the GIF format. Servers typically have a set of built in suffixes that they assume denote particular content types. They also let you specify the content types of your own suffixes in case you have any local oddities, or something new that the server designer hadn't thought of. URLs and Server File Systems WWW servers generally reside on machines with a file system(4) . The server's job is to make part of that file system publicly available by responding to HTTP requests. Its job is also to prevent the private parts of that file system from becoming public.Most file systems can be thought of as a form of tree, and the URLs used in the World Wide Web also use this model. Thus the URL: http://www.cs.ucl.ac.uk/misc/uk/london.html specifies the file called london.html which is in a directory called uk, which in turn is in a directory called misc. misc is also a directory, and it resides in the top level directory of the tree, which is sometimes simply called ``/'' (pronounced ``slash''). When the URL above specifies /misc/uk/london.html, this does not usually mean that the misc directory is really situated in the root directory of the entire file system. Instead it is situated in the root directory of the subtree that the WWW server makes public. Any documents situated in this subtree are accessible to the server, and directories that are not in thissubtree are not accessible . However, most servers also allow you to provide some form of access control to files and subdirectories of the visible subtree. This protection can take the form of restrictions on which machines or networks a client can access a file from, or it may take the form of password protection. Which mechanisms a server provides depend on which server you choose, and we'll discuss a few of the better servers later. Multiuser sites Another issue is raised where a server is running on a Machine in a large multi-user environment such as a university.For instance, each student in a university can write files to their own fluster, but not anywhere else. However, we'd like our students to be able to create their own WWW pages, despite not having access to the WWW server's default public tree. UNIX servers usually make available files placed in a special directory in the user's home directory. On NCSA and CERN servers, this directory is called ``public_html'' by default. Thus accesses to the URL http://www.euphoric-state-uni.edu/"janet/research/index.html would map onto the file: /usr/home/janet/public_html/research/index.html in the filesystem. Once we start to allow the WWW server access to areas of our filesystem which can be modified by users that we don't necessarily trust, a whole set of security issues are raised. For instance, Unix allows symbolic links from one place in the directory tree to another to give the impression that files or directories are someplace else (Mac's call symbolic links ``Aliases''). Letting the server follow links can be useful, but it also can create problems. Just because a file is readable by other users on your own system does not necessarily mean it should be readable by users in other sites or countries! Server Scripts The ability to define new programs to be run in the server when a request is made that really makes the Web flexible and Fun. An example is an active map, where a user clicks on a map, and the place they clicked is sent to the server along with their request. The server then runs a program or script which figures out where those coordinates apply to, and, depending on where the user clicked, it sends them the relevant next page of information. Another example is Cambridge University's coffee machine - they have a video camera pointed at the coffee pot, and a server script captures a picture of it using a video framegrabber, and sends the image to you so that you can see whether there's any coffee ready. A standard called CGI or Common Gateway Interface has emerged for the writing of server scripts, and is supported by most servers. This means that scripts written for one server should be easily ported to another server. Available Servers There are many WWW servers available, and more seem to be released each month. At the time of writing CERN's ``list of available servers'' (6) lists the following servers. We don't give the individual URLs here, as some of them would become out of date too quickly - instead we encourage you to look at CERN's list. CERN HTTPD Version 3.0 The CERN HTTPD server is probably the most fully featured WWW server. It supports much the same range of features as NCSA's server, with the addition of acting as a caching proxy server. If you have used a WWW client such as Mosaic, you have probablyalready used a proxy client. Mosaic and other clients built upon LibWWW can contact servers for protocols such as ftp and gopher, and then convert the output of such servers into HTML for formatting and display on your screen. Proxy servers take this one step further - instead of your client contacting remote servers directly, your client makes an HTTP request to a proxy server. The proxy server then contacts the relevant FTP or GOPHER server, and converts the results to HTML, before transferring them back to your client . Proxy Cache Servers A proxy server can also make connections to remote HTTP servers. At first glance, this wouldn't appear to benefit you, as the proxy then performs no conversion functionality, but it provides a way to provide network services to machines on a secure subnet without those machines having to have direct access to the outside world. Thus secure sites can run a proxy server on their firewall machine, or SOCKSify only their proxy server without needing to modify the WWW client programs for all their different architectures. Even if you do not need this level of security, CERN's HTTPD can also provide caching facilities for clients using the server as a proxy. Caching facilities in the World Wide Web are currently in their infancy, as many servers do not return expiry date information with documents, so deciding how long data should be cached before going back to look at the original is not a clear cut issue. However, CERN's server uses whatever information is available to it to make a decision about cache timeouts, and although it doesn't always do the right thing, it does substantially improve performance for frequently accessed pages, and most of the time it gets it right. A Proxy Server on a Firewall * 3.2 CERN HTTPD Configuration The CERN HTTPD requires a single configuration file to function. By default, CERN HTTPD looks for this file as ``/etc/httpd.conf'', but it can be held elsewhere and the server told where it is using the -r command line flag.The list of configuration options that CERN HTTPD supports is very extensive, and we encourage you to read the document CERN HTTPD Reference Manual. Most of the default options are fine to get you started. Enabling Security on the CERN server The CERN HTTPD server has a fairly sophisticated set of security features that can be enabled. Basically, they fall into three categories: Restricting hosts that can access areas of the server. Restricting users that can access areas of the server Restricting access to individual files Common Gateway Interface (CGI) Before CGI, each server passed the query information into a script in its own way. Unfortunately this made it difficult to write gateways that would work on more than one type of server, so a few of the server developers got together and CGI was the result. Some servers don't yet support CGI, but most of the popular ones now do. Writing CGI scripts CGI passes the information a script needs into the script in environment variables. The most important two are: * QUERY_STRING The server will put the part of the URL after the first ``?'' in QUERY_STRING * PATH_INFO The server will put the part of the path name after the script name in PATH_INFO For instance, if we sent a request to the server with the URL: http://www.cs.ucl.ac.uk/cgi-bin/htimage/usr/www/img/uk_map?404,451 and we had cgi-bin configured as a scripts directory, then the server would run the script called htimage. It would then pass the remaining path information ``/usr/www/img/uk_map'' to htimage in the PATH_INFO environment variable, and pass ``404,451'' in the QUERY_STRING variable. In this case, htimage is a script for implementing active maps supplied with the CERN HTTPD. The server expects the script program to produce some output on its standard output. It first expects to see a short MIME Header, followed by a blank line, and then any other output the script wants returned to the client. The MIME header must have one or more of the following directives: * Content-Type: type/subtype This specifies the form of any output that follows. * Location: URL This specifies that the client should request the given URL rather than display the output. This is a redirection. Some servers may allow the URL to be a short URL specifying only the file name and path - in this case the server will usually return the relevant file directly to the client, rather than sending a redirection. The short MIME header can optionally contain a number of other MIME header fields, which will also be checked by the server which will add any missing fields before passing the combined reply to the client. Under some circumstances, the script may want to create the entire MIME header itself. For instance, you may want to do this if you want to specify expiry dates or status codes yourself, and don't need the server to parse your header and insert any missing fields. In this case, both the CERN and NCSA servers recognize scripts whose name begins ``nph-'' as having a ``no parse header'', and will not modify the reply at all. Under these circumstances your script will need access to extra information to be able to fill out all the header fields correctly, and so this information is also available via CGI environment variables. Handling Active Maps One nice feature which is now supported by most graphical WWW clients is the ISMAP active map command, which can be associated with an HTML inline image. This tells the WWW client to supply the x and y coordinates of the point the user clicks on within the image. For example, this HTML tells the client this image is an active map: * * * When the user clicks on the map at, say, point (404,451), her client will submit a GET request to the server: GET /cgi-bin/imagemap/uk_map?404,451 For this to do anything interesting, the server must interpret ``/cgi-bin/imagemap/uk_map'' as something special - a command to be executed rather than a file to be retrieved. How the server decides this is a command depends on the type of server, but whichever server you run, the ``404,451'' part will then be passed to the command as parameters. When the command is executed, it could generate output that is to be returned directly to the client - for instance the command could generate HTML directly as output. However the usual way imagemaps are used is to access other existing pages of HTML using HTTP redirection. This is where the server first returns to the client the URL of the place to look for the page corresponding to the place they clicked on the map, and then the client goes and requests this new URL (usually without bothering to ask the user). Handling Forms Forms are one way the World Wide Web allows users to submit information to servers. All the mechanisms described so far allow users to choose from a set of available options. Forms let the user type information into their web browser and then get the server to run a program with their submission as input. Examples of things you might type are keys to search a database (e.g. what films was Zazu Pits in?). Laying Out Forms HTML provides a number of commands for telling the client to do something special. The first command is FORM which tells the client that everything between one
command and the next
terminator is part of the same form. The form command can take a number of attributes: ACTION=http://www.host.name/cgi-bin/query This gives the URL of the script to run when the form is submitted. You must supply an ACTION attribute with the FORM command. * HMETHOD=GET This is the default method for submitting a form. The contents of the form will be added to the end of the URL that is sent to the server. * METHOD=POST The post method causes the information contained in the form to be sent to the server in the body of the request. * ENCTYPE=application/x-www-form-urlencoded This specifies how the information the user typed into the form should be encoded. Currently only the default, ``application/x-www-form-urlencoded'', is allowed. If your server supports the POST method, it is advisable to use it, as if you use the GET method, it's possible that long forms will be truncated when they're passed from the server to the script. The INPUT command Now you have an empty form, you probably want to provide some boxes and buttons that the user can set. These are created using the INPUT tag. This is used in a similar way to the IMG tag for images - there's no need for a terminating tag as it doesn't surround anything. There are several types of INPUT tag, denoted by the TYPE attribute: This is a simple text entry field that we've called ``users_name''. The user never sees this NAME attribute displayed on her client - it is purely so we can keep track of which field is which when we come to process the form. Text entries also allow you to specify: * VALUE="enter your name here" This lets you specify the default text to appear in the entry box. * SIZE=60,3 This lets you specify the size of the entry box is characters. For example the above says the entry box should be 60 characters wide and three lines high. * MAXLENGTH=8 This lets you specify the maximum number of characters you'll allow to be entered in a single line text entry box. For instance, you might only allow a user to enter eight characters as their user name. This is also a text entry field, but the characters the user types are displayed as stars so that other people can't read the password from their screen. Password fields also support the VALUE, SIZE and MAXLENGTH attributes. This is a single button which is either on or off. Checkboxes also support the following attributes: VALUE="true" This is the value to return if the checkbox is set to ``on''. If it's set to ``off'', no value is returned. CHECKED This says that the checkbox is ``on'' by default. These are a collection of buttons. Radio buttons with the same name are grouped together so that selecting one of them turns the others off like the channel tuning buttons on some radios. Radio buttons also support the VALUE and CHECKED attributes, but only one radio button can be specified as CHECKED. This is a button that submits the contents of the form to the server using the method in the surrounding FORM. Submit buttons don't have a NAME attribute, but you can specify the label for the button using a VALUE attribut H This is a button that causes the various boxes and buttons in the form to reset to their default values. Reset buttons also don't have a NAME attribute, and allow a VALUE attribute to label the button. The SELECT Command If you want to provide the user with a long list of items to choose from, it's not very natural to use radio buttons, so HTML provides another command - SELECT. Unlike INPUT, this does have a closing tag. Each option within the list is denoted using the