iWarp: Anatomy of a Parallel Computing System
Table of Contents

Foreword (Gordon Bell)
Preface

1 Introduction

2 iWarp Overview
    2.1 Node Architecture
    2.2 Interprocessor Communication
            2.2.1 Logical Channels
                2.2.2 Pathways
    2.3 iWarp Component Architecture
            2.3.1 Major Functional Units
            2.3.2 Instruction Set
            2.3.3 Coupling between Communication and Computation
    2.4 iWarp Software
            2.4.1 Program Development
            2.4.2 Program Execution
    2.5 Chapter Summary

3 Logical Channels
    3.1 The Basic Idea
    3.2 Hardware Structures for Logical Channels
            3.2.1 Communication Buses
            3.2.2 Queues
            3.2.3 State Variables
    3.3 Defining and Undefining Logical Channels
    3.4 Enabling and Disabling Logical Channels
    3.5 Reading and Writing Logical Channels
            3.5.1 Gates
            3.5.2 Spools
    3.6 Moving Data across Logical Channels
    3.7 Quiescence
    3.8 Shared Input Queues
    3.9 Multiple Logical Channels
            3.9.1 Fairness
    3.10 Chapter Summary

4 Pathways
    4.1 The Basic Idea
    4.2 Hardware Structures for Pathways
            4.2.1 Configuration Sets
            4.2.2 Control Words
    4.3 Pathway Sequences
    4.4 Message Sequences
    4.5 Initializing the Pathway Structures
    4.6 Allocating Input Queues
    4.7 Creating Pathways
            4.7.1 Activity on the Source Node
            4.7.2 Activity on an Intermediate Node
            4.7.3 Activity on the Sink Node
    4.8 Destroying Pathways
    4.9 Routing Messages along Pathways
    4.10 Stop Conditions
    4.11 Pathway Issues
            4.11.1 Node and Bus Congestion
            4.11.2 DQ Conflicts
            4.11.3 Flying Dutchmen
            4.11.4 Routing Deadlock
    4.12 Chapter Summary

5 The Processing Agent
    5.1 Overview
    5.2 Basic Processor Operation
    5.3 Instruction Repertoire
            5.3.1 Instruction Formats
            5.3.2 Instruction Level Parallelism
            5.3.3 Memory Access
            5.3.4 Encoding
            5.3.5 Double Precision Arithmetic
    5.4 Processor Control
    5.5 Events
            5.5.1 Event Hierarchy
            5.5.2 Raising an Event
            5.5.3 Discussion
    5.6 Interface to the Communication Agent
            5.6.1 Explicit Transfers
            5.6.2 Control
            5.6.3 Spooling
    5.7 Chapter Summary

6 The iWarp Parallel System
    6.1 iWarp Node
    6.2 Card Cages and Cabinets
    6.3 External Interfaces
            6.3.1 Interface to a Host
            6.3.2 High-Speed Networks
    6.4 Internode Signaling
    6.5 System Integration
            6.5.1 System Reporting Unit
            6.5.2 Safety Nets
    6.6 Chapter Summary

7 Program Development Tool Chain
    7.1 Overview
    7.2 Array and Node Tools
    7.3 Array Modules
            7.3.1 Ports
            7.3.2 Nodeprograms and Modules
            7.3.3 Naming Connections
            7.3.4 Loading: Connections and Placement
            7.3.5 Operations Involving Ports
            7.3.6 The C Interface
            7.3.7 Hierarchy and Modules
            7.3.8 Extensions
    7.4 Chapter Summary

8 Compilers
    8.1 Choice of Language
    8.2 Compiler Model
    8.3 Extensions to C
    8.4 Gates
    8.5 Other Issues
            8.5.1 Byte Order
    8.6 Code Generation
    8.7 Machine Model
    8.8 Code Selection
    8.9 Scheduling
    8.10 Code Generation Summary
    8.11 Chapter Summary

9 Runtime System
    9.1 Overview
    9.2 iwRTS Communication
            9.2.1 Resources
            9.2.2 Buffer Space Management
            9.2.3 Datamover Services
    9.3 Services
            9.3.1 User Communication: imsg
            9.3.2 Core Services
            9.3.3 File System Services
            9.3.4 System Calls
            9.3.5 Event Handlers
    9.4 Booting
    9.5 Chapter Summary

10 Communication Styles
    10.1 Connections
    10.2 Connection Management
            10.2.1 Managing Connections Collectively
            10.2.2 Managing Connections Individually
            10.2.3 Mixing Connection Management Styles
    10.3 Message Management
            10.3.1 Determining the Endpoints of a Message
            10.3.2 Producing and Consuming Messages at the Endpoints
            10.3.3 Traffic Pattern over a Connection
            10.3.4 Forwarding Messages
    10.4 Chapter Summary

11 Communication Operations
    11.1 Basic Performance Constants
    11.2 Connecting iWarp Nodes
    11.3 Pair-wise Communication Operations
            11.3.1 Remote Copy
            11.3.2 Remote Exchange
    11.4 Collective Communication Operations
            11.4.1 Barrier Synchronization
            11.4.2 Subset Barrier Synchronization
            11.4.3 Broadcast
            11.4.4 Scatter/Gather
            11.4.5 Reduction
            11.4.6 Hypercube Algorithms
            11.4.7 Block Transpose
            11.4.8 All-to-all Personalized Communication
    11.5 Message Passing
            11.5.1 The Basic Idea
            11.5.2 Direct Deposit Message Passing
    11.6 Chapter Summary

12 Applications
    12.1 Phase-Rotation FFT
            12.1.1 Overview
            12.1.2 Parallel Implementation
            12.1.3 Performance
    12.2 Multidimensional FFT
            12.2.1 Overview
            12.2.2 Parallel Implementation
            12.2.3 Performance
    12.3 Finite Element Simulations
            12.3.1 Overview
            12.3.2 Parallel Implementation
            12.3.3 Performance
    12.4 Multibaseline Stereo Imaging
            12.4.1 Overview
            12.4.2 Parallel Implementation
            12.4.3 Performance
    12.5 Airborne Sonar
            12.5.1 Overview
            12.5.2 Parallel Implementation
            12.5.3 Performance
    12.6 Chapter Summary

13 iWarp Project
    13.1 The Warp Project
    13.2 From Warp to iWarp
    13.3 The CMU/Intel Relationship
    13.4 Project Management
            13.4.1 Component Design
            13.4.2 Software Design
            13.4.3 Board Design
    13.5 The Bugs
            13.5.1 Nuisances
            13.5.2 Serious Bugs
            13.5.3 Killer Bugs
    13.6 Commercialization
    13.7 Concluding Remarks

A Evolution of Systolic Computers
    A.1 Introduction
    A.2 Systolic Computing
            A.2.1 Balance
            A.2.2 Scaling
            A.2.3 Balance and Scaling
    A.3 The First Generation: 1977-1982
            A.3.1 Systolic Convolution Chip
            A.3.2 ESL Systolic Processor
    A.4 Programmable Systolic Arrays
            A.4.1 NOSC Systolic Array Testbed
            A.4.2 PSC -- Programmable Systolic Chip
            A.4.3 GAPP
            A.4.4 Warp
    A.5 Integrated Systolic Systems
    A.6 Concluding Remarks

B Instruction Summary

C System summary
    C.1 Component
    C.2 Node
    C.3 Board
    C.4 Card cage
    C.5 Cabinet
    C.6 Array

Afterword (H. T. Kung)
References
Index