Next: About this document
Up: Tolerating Latency Through Software-Controlled
Previous: Future Work
References
- 1
-
W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie.
Automatic program transformations for virtual memory computers.
Proc. of the 1979 National Computer Conference, pages 969-974,
June 1979.
- 2
-
S. Adve and M. Hill.
Weak ordering - A new definition.
In Proceedings of the 17th Annual International Symposium on
Computer Architecture, pages 2-14, May 1990.
- 3
-
A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz.
April: A processor architecture for multiprocessing.
In Proceedings of the 17th Annual International Symposium on
Computer Architecture, pages 104-114, May 1990.
- 4
-
S. P. Amarasinghe and M. S. Lam.
Communication optimization and code generation for distributed memory
machines.
In Proceedings of the SIGPLAN '93 Conference on Programming
Language Design and Implementation, pages 126-138, June 1993.
- 5
-
J. M. Anderson and M. S. Lam.
Global optimizations for parallelism and locality on scalable
parallel machines.
In Proceedings of the SIGPLAN '93 Conference on Programming
Language Design and Implementation, pages 112-125, June 1993.
- 6
-
J. Archibald and J.-L. Baer.
Cache coherence protocols: Evaluation using a multiprocessor
simulation model.
ACM Transactions on Computer Systems, 4(4):273-298, 1986.
- 7
-
J.-L. Baer and T.-F. Chen.
An effective on-chip preloading scheme to reduce data access penalty.
In Proceedings of Supercomputing '91, 1991.
- 8
-
D. Bailey, J. Barton, T. Lasinski, and H. Simon.
The NAS Parallel Benchmarks.
Technical Report RNR-91-002, NASA Ames Research Center, August 1991.
- 9
-
D. Callahan, K. Kennedy, and A. Porterfield.
Software prefetching.
In Proceedings of the Fourth International Conference on
Architectural Support for Programming Languages and Operating Systems, pages
40-52, April 1991.
- 10
-
A. Carle, K. Kennedy, U. Kremer, and J. Mellor-Crummey.
Automatic data layout for distributed-memory machines in the D
programming environment.
In Proceedings of AP'93 International Workshop on Automatic
Distributed Memory Parallelization, Automatic Data Distribution and Automatic
Parallel Performance Prediction, Saarbrücken, Germany, March 1993.
- 11
-
B. Chapman, P. Hehrota, and H. Zima.
Programming in vienna fortran.
In Third Workshop on Compilers for Parallel Computers, pages
121-160, July 1992.
- 12
-
S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng.
Automatic array alignment in data-parallel programs.
In Proceedings of the Twentieth Annual ACM Symposium on the
Principles of Programming Languages, January 1993.
- 13
-
W. Y. Chen, S. A. Mahlke, P. P. Chang, and W. W. Hwu.
Data access microarchitectures for superscalar processors with
compiler-assisted data prefetching.
In Proceedings of Microcomputing 24, 1991.
- 14
-
R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman.
A vliw architecture for a trace scheduling compiler.
In Proc. Second Intl. Conf. on Architectural Support for
Programming Languages and Operating Systems, pages 180-192, Oct. 1987.
- 15
-
K.D. Cooper, M.W. Hall, and K. Kennedy.
A methodology for procedure cloning.
Computer Languages, 19(2), April 1993.
- 16
-
J. C. Dehnert, P. Y.-T. Hsu, and J. P. Bratt.
Overlapped loop support in the cydra 5.
In Third International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS III), pages 26-38,
April 1989.
- 17
-
M. Dubois, L. Barroso, Y.-S. Chen, and K. Oner.
Scalability problems in multiprocessors with private caches.
In Proceedings of Parallel Architecture and Languages Europe
'92, pages 211-230, June 1992.
- 18
-
M. Dubois, C. Scheurich, and F. A. Briggs.
Synchronization, coherence, and event ordering in multiprocessors.
Computer, 21(2):9-21, February 1988.
- 19
-
S. J. Eggers and T. E. Jeremiassen.
Eliminating false sharing.
In Proceedings of the 1991 International Conference on Parallel
Processing, volume I, pages 377-381, August 1991.
- 20
-
M. Berry et al.
The perfect club benchmarks: Effective performance evaluation of
supercomputers.
Technical Report CSRD 827, Center for Supercomputing Research and
Development, Illinois, May 1989.
- 21
-
J. Ferrante, V. Sarkar, and W. Thrash.
On estimating and enhancing cache effectiveness.
In Fourth Workshop on Languages and Compilers for Parallel
Computing, Aug 1991.
- 22
-
K. Gallivan, W. Jalby, U. Meier, and A. Sameh.
The impact of hierarchical memory systems on linear algebra algorithm
design.
Technical Report UIUCSRD 625, University of Illinios, 1987.
- 23
-
D. Gannon and W. Jalby.
The influence of memory hierarchy on algorithm organization:
Programming FFTs on a vector multiprocessor.
In The Characteristics of Parallel Algorithms. MIT Press, 1987.
- 24
-
D. Gannon, W. Jalby, and K. Gallivan.
Strategies for cache and local memory management by global program
transformation.
Journal of Parallel and Distributed Computing, 5:587-616,
1988.
- 25
-
A. George, J. Liu, and E. Ng.
User's guide for SPARSPAK: Waterloo sparse linear equations
package.
Technical Report CS-78-30, Department of Computer Science, University
of Waterloo, 1980.
- 26
-
K. Gharachorloo, A. Gupta, and J. Hennessy.
Performance evaluation of memory consistency models for shared-memory
multiprocessors.
In Fourth International Conference on Architectural Support for
Programming Languages and Operating Systems, pages 245-257, April 1991.
- 27
-
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy.
Memory consistency and event ordering in scalable shared-memory
multiprocessors.
In Proceedings of the 17th Annual International Symposium on
Computer Architecture, pages 15-26, May 1990.
- 28
-
A. J. Goldberg.
Multiprocessor Performance Debugging and Memory Bottlenecks.
PhD thesis, Stanford University, August 1992.
- 29
-
S. R. Goldschmidt and H. Davis.
Tango introduction and tutorial.
Technical Report CSL-TR-90-410, Stanford University, 1990.
- 30
-
G. H. Golub and C. F. Van Loan.
Matrix Computations.
Johns Hopkins University Press, 1989.
- 31
-
E. Gornish, E. Granston, and A. Veidenbaum.
Compiler-Directed Data Prefetching in Multiprocessors with Memory
Hierarchies.
In International Conference on Supercomputing, 1990.
- 32
-
E. H. Gornish.
Compile time analysis for data prefetching.
Master's thesis, University of Illinois at Urbana-Champaign, December
1989.
- 33
-
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber.
Comparative evaluation of latency reducing and tolerating techniques.
In Proceedings of the 18th Annual International Symposium on
Computer Architecture, pages 254-263, May 1991.
- 34
-
M. Gupta.
Automatic Data Partitioning on Distributed Memory
Multicomputers.
PhD thesis, College of Engineering, University of Illinois at
Urbana-Champaign, September 1992.
- 35
-
M. Gupta and P. Banerjee.
Demonstration of automatic data partitioning techniques for
parallelizing compilers on multicomputers.
IEEE Transactions on Parallel and Distributed Systems,
3(2):179-193, March 1992.
- 36
-
R. H. Halstead, Jr. and T. Fujita.
MASA: A multithreaded processor architecture for parallel symbolic
computing.
In Proceedings of the 15th Annual International Symposium on
Computer Architecture, pages 443-451, June 1988.
- 37
-
L. J. Hendren.
Parallelizing Programs with Recursive Data Structures.
PhD thesis, Cornell University, January 1990.
- 38
-
S. Hiranandani, K. Kennedy, and C. Tseng.
Compiling fortran d for mimd distributed-memory machines.
Communications of the ACM, 35(8):66-80, August 1992.
- 39
-
R. A. Iannucci.
Toward a dataflow/von Neumann hybrid architecture.
In Proc. Int. Symp. Comput. Arch., pages 131-140, June 1988.
- 40
-
N. P. Jouppi.
Improving direct-mapped cache performance by the addition of a small
fully-associative cache and prefetch buffers.
In Proceedings of the 17th Annual International Symposium on
Computer Architecture, pages 364-373, May 1990.
- 41
-
Kendall Square Research.
Kendall Square Research 1 (KSR1) Technical Summary, 1992.
- 42
-
A. C. Klaiber and H. M. Levy.
Architecture for software-controlled data prefetching.
In Proceedings of the 18th Annual International Symposium on
Computer Architecture, pages 43-63, May 1991.
- 43
-
C. Koelbel, P. Mehrotra, and J. Van Rosendale.
Supporting shared data structures on distributed memory machines.
In Proceedings of the Second ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, March 1990.
- 44
-
J. S. Kowalik, editor.
Parallel MIMD Computation : The HEP Supercomputer and Its
Applications.
MIT Press, 1985.
- 45
-
D. Kroft.
Lockup-free instruction fetch/prefetch cache organization.
In Proceedings of the 8th Annual International Symposium on
Computer Architecture, pages 81-85, 1981.
- 46
-
J. Kubiatowicz, D. Chaiken, and A. Agarwal.
Closing the window of vulnerability in multiphase memory
transactions.
In Proceedings of the Fifth International Conference on
Architectural Support for Programming Languages and Operating Systems, pages
274-284, October 1992.
- 47
-
D. J. Kuck, E. S. Davidson, D. H. Lawrie, and A. H. Sameh.
Experimental Parallel Computing Architectures: Volume 1 -
Special Topics in Supercomputing, chapter Parallel Supercomputing Today and
the Cedar Approach, pages 1-23.
North-Holland, New York, 1987.
- 48
-
M. S. Lam.
Software pipelining: An effective scheduling technique for vliw
machines.
In Proc. ACM SIGPLAN 88 Conference on Programming Language
Design and Implementation, pages 318-328, June 1988.
- 49
-
M. S. Lam, E. E. Rothberg, and M. E. Wolf.
The cache performance and optimizations of blocked algorithms.
In Proceedings of the Fourth International Conference on
Architectural Support for Programming Languages and Operating Systems, pages
63-74, April 1991.
- 50
-
L. Lamport.
How to make a multiprocessor computer that correctly executes
multiprocess programs.
IEEE Transactions on Computers, C-28(9):241-248, September
1979.
- 51
-
W. Landi, B. G. Ryder, and S. Zhang.
Interprocedural modification side effect analysis with pointer
aliasing.
In Proceedings of the SIGPLAN '93 Conference on Programming
Language Design and Implementation, pages 56-67, June 1993.
- 52
-
J. P. Laudon.
Architectural and Implementation Tradeoffs for Multiple-Context
Processors.
PhD thesis, Stanford University, Stanford, California, 1994.
In preparation.
- 53
-
R. L. Lee.
The Effectiveness of Caches and Data Prefetch Buffers in
Large-Scale Shared Memory Multiprocessors.
PhD thesis, Department of Computer Science, University of Illinois at
Urbana-Champaign, May 1987.
- 54
-
D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, M. Horowitz, and
M. Lam.
The Stanford DASH multiprocessor.
IEEE Computer, 25(3):63-79, March 1992.
- 55
-
D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, Mark Horowitz,
and Monica Lam.
Design of Scalable Shared-Memory Multiprocessors: The DASH
Approach.
In Proceedings of COMPCON'90, pages 62-67, 1990.
- 56
-
J. Li and M. Chen.
The data alignment phase in compiling programs for distributed-memory
machines.
Journal of Parallel and Distributed Computing, 13(2):213-221,
October 1991.
- 57
-
E. Lusk, R. Overbeek, et al.
Portable Programs for Parallel Processors.
Holt, Rinehart and Winston, Inc., 1987.
- 58
-
D. E. Maydan.
Accurate Analysis of Array References.
PhD thesis, Stanford University, September 1992.
- 59
-
J. D. McDonald and D. Baganoff.
Vectorization of a particle simulation method for hypersonic rarified
flow.
In AIAA Thermodynamics, Plasmadynamics and Lasers Conference,
June 1988.
- 60
-
A. C. McKeller and E. G. Coffman.
The organization of matrices and matrix operations in a paged
multiprogramming environment.
CACM, 12(3):153-165, 1969.
- 61
-
T. Mowry and A. Gupta.
Tolerating latency through software-controlled prefetching in
shared-memory multiprocessors.
Journal of Parallel and Distributed Computing, 12(2):87-106,
1991.
- 62
-
T. C. Mowry, M. S. Lam, and A. Gupta.
Design and evaluation of a compiler algorithm for prefetching.
In Proceedings of the Fifth International Conference on
Architectural Support for Programming Languages and Operating Systems,
volume 27, pages 62-73, October 1992.
- 63
-
G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder,
K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss.
The IBM research parallel processor prototype (RP3): Introduction
and architecture.
In Proceedings of the 1985 International Conference on Parallel
Processing, pages 764-771, 1985.
- 64
-
A. K. Porterfield.
Software Methods for Improvement of Cache Performance on
Supercomputer Applications.
PhD thesis, Department of Computer Science, Rice University, May
1989.
- 65
-
B. R. Rau and C. D. Glaeser.
Some Scheduling Techniques and an Easily Schedulable Horizontal
Architecture for High Performance Scientific Computing.
In Proceedings of the 14th Annual Workshop on Microprogramming,
pages 183-198, October 1981.
- 66
-
A. Rogers and K. Li.
Software support for speculative loads.
In Proceedings of the Fifth International Conference on
Architectural Support for Programming Languages and Operating Systems,
volume 27, pages 38-50, October 1992.
- 67
-
A. Rogers and K. Pingali.
Process decomposition through locality of reference.
In Proceedings of the SIGPLAN '89 Conference on Program
Language Design and Implementation, June 1989.
- 68
-
J. Rose.
Locusroute: A parallel global router for standard cells.
In Design Automation Conference, pages 189-195, June 1988.
- 69
-
E. Rothberg and A. Gupta.
Techniques for improving the performance of sparse factorization on
multiprocessor workstations.
In Proceedings of Supercomputing '90, November 1990.
- 70
-
C. Scheurich and M. Dubois.
Lockup-free caches in high-performance multiprocessors.
Journal of Parallel and Distributed Computing, 11(1):25-36,
January 1991.
- 71
-
J. P. Singh and J. L. Hennessy.
Finding and exploiting parallelism in an ocean simulation program:
Experience, results and implications.
Journal of Parallel and Distributed Computing, 15(1):27-48,
1992.
- 72
-
J. P. Singh, W.-D. Weber, and A. Gupta.
Splash: Stanford parallel applications for shared memory.
Technical Report CSL-TR-91-469, Stanford University, April 1991.
- 73
-
B. J. Smith.
Architecture and applications of the HEP multiprocessor computer
system.
SPIE, 298:241-248, 1981.
- 74
-
M. D. Smith.
Tracing with pixie.
Technical Report CSL-TR-91-497, Stanford University, November 1991.
- 75
-
M. D. Smith.
Support for Speculative Execution in High-Performance
Processors.
PhD thesis, Stanford University, November 1992.
- 76
-
L. Soule and A. Gupta.
Parallel Distributed-Time Logic Simulation.
IEEE Design and Test of Computers, 6(6):32-48, December 1989.
- 77
-
SPEC.
The SPEC Benchmark Report.
Waterside Associates, Fremont, CA, January 1990.
- 78
-
G. L. Steele.
Proposal for alignment and distribution directives in HPF.
Draft presented at HPF Forum meeting, June 1992.
- 79
-
P. Stenstrom, F. Dahlgren, and L. Lundberg.
A lockup-free multiprocessor cache design.
In Proceedings of the 1991 International Conference on Parallel
Processing, volume I, pages 246-250, 1991.
- 80
-
S. W. K. Tjiang and J. L. Hennessy.
Sharlit: A tool for building optimizers.
In SIGPLAN Conference on Programming Language Design and
Implementation, 1992.
- 81
-
J. Torrellas, M. S. Lam, and J. L. Hennessy.
Shared data placement optimizations to reduce multiprocessor cache
miss rates.
In Proceedings of the 1990 International Conference on Parallel
Processing, volume II, pages 266-270, August 1990.
- 82
-
P.-S. Tseng.
A Parallelizing Compiler for Distributed Memory Parallel
Computers.
PhD thesis, School of Computer Science, Carnegie Mellon University,
May 1989.
- 83
-
D. M. Tullsen and S. J. Eggers.
Limitations of cache prefetching on a bus-based multiprocessor.
In Proceedings of the 20th Annual International Symposium on
Computer Architecture, pages 278-288, May 1993.
- 84
-
W.-D. Weber.
Scalable Directories for Cache-Coherent Shared-Memory
Multiprocessors.
PhD thesis, Stanford University, January 1993.
- 85
-
W.-D. Weber and A. Gupta.
Exploring the benefits of multiple hardware contexts in a
multiprocessor architecture: Preliminary results.
In Proceedings of the 16th Annual International Symposium on
Computer Architecture, pages 273-280, June 1989.
- 86
-
M. E. Wolf.
Improving Locality and Parallelism in Nested Loops.
PhD thesis, Stanford University, August 1992.
- 87
-
M. E. Wolf and M. S. Lam.
A data locality optimizing algorithm.
In Proceedings of the SIGPLAN '91 Conference on Programming
Language Design and Implementation, pages 30-44, June 1991.
- 88
-
H. Zima, H.-J. Bast, and M. Gerndt.
SUPERB: A tool for semi-automatic MIMD/SIMD parallelization.
Parallel Computing, 6:1-18, 1988.