Next: About this document Up: Tolerating Latency Through Software-Controlled Previous: Future Work

References

1
W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. Automatic program transformations for virtual memory computers. Proc. of the 1979 National Computer Conference, pages 969-974, June 1979.

2
S. Adve and M. Hill. Weak ordering - A new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 2-14, May 1990.

3
A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz. April: A processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104-114, May 1990.

4
S. P. Amarasinghe and M. S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 126-138, June 1993.

5
J. M. Anderson and M. S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 112-125, June 1993.

6
J. Archibald and J.-L. Baer. Cache coherence protocols: Evaluation using a multiprocessor simulation model. ACM Transactions on Computer Systems, 4(4):273-298, 1986.

7
J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing '91, 1991.

8
D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. Technical Report RNR-91-002, NASA Ames Research Center, August 1991.

9
D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40-52, April 1991.

10
A. Carle, K. Kennedy, U. Kremer, and J. Mellor-Crummey. Automatic data layout for distributed-memory machines in the D programming environment. In Proceedings of AP'93 International Workshop on Automatic Distributed Memory Parallelization, Automatic Data Distribution and Automatic Parallel Performance Prediction, Saarbrücken, Germany, March 1993.

11
B. Chapman, P. Hehrota, and H. Zima. Programming in vienna fortran. In Third Workshop on Compilers for Parallel Computers, pages 121-160, July 1992.

12
S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng. Automatic array alignment in data-parallel programs. In Proceedings of the Twentieth Annual ACM Symposium on the Principles of Programming Languages, January 1993.

13
W. Y. Chen, S. A. Mahlke, P. P. Chang, and W. W. Hwu. Data access microarchitectures for superscalar processors with compiler-assisted data prefetching. In Proceedings of Microcomputing 24, 1991.

14
R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman. A vliw architecture for a trace scheduling compiler. In Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 180-192, Oct. 1987.

15
K.D. Cooper, M.W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2), April 1993.

16
J. C. Dehnert, P. Y.-T. Hsu, and J. P. Bratt. Overlapped loop support in the cydra 5. In Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 26-38, April 1989.

17
M. Dubois, L. Barroso, Y.-S. Chen, and K. Oner. Scalability problems in multiprocessors with private caches. In Proceedings of Parallel Architecture and Languages Europe '92, pages 211-230, June 1992.

18
M. Dubois, C. Scheurich, and F. A. Briggs. Synchronization, coherence, and event ordering in multiprocessors. Computer, 21(2):9-21, February 1988.

19
S. J. Eggers and T. E. Jeremiassen. Eliminating false sharing. In Proceedings of the 1991 International Conference on Parallel Processing, volume I, pages 377-381, August 1991.

20
M. Berry et al. The perfect club benchmarks: Effective performance evaluation of supercomputers. Technical Report CSRD 827, Center for Supercomputing Research and Development, Illinois, May 1989.

21
J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In Fourth Workshop on Languages and Compilers for Parallel Computing, Aug 1991.

22
K. Gallivan, W. Jalby, U. Meier, and A. Sameh. The impact of hierarchical memory systems on linear algebra algorithm design. Technical Report UIUCSRD 625, University of Illinios, 1987.

23
D. Gannon and W. Jalby. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. In The Characteristics of Parallel Algorithms. MIT Press, 1987.

24
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5:587-616, 1988.

25
A. George, J. Liu, and E. Ng. User's guide for SPARSPAK: Waterloo sparse linear equations package. Technical Report CS-78-30, Department of Computer Science, University of Waterloo, 1980.

26
K. Gharachorloo, A. Gupta, and J. Hennessy. Performance evaluation of memory consistency models for shared-memory multiprocessors. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 245-257, April 1991.

27
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15-26, May 1990.

28
A. J. Goldberg. Multiprocessor Performance Debugging and Memory Bottlenecks. PhD thesis, Stanford University, August 1992.

29
S. R. Goldschmidt and H. Davis. Tango introduction and tutorial. Technical Report CSL-TR-90-410, Stanford University, 1990.

30
G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.

31
E. Gornish, E. Granston, and A. Veidenbaum. Compiler-Directed Data Prefetching in Multiprocessors with Memory Hierarchies. In International Conference on Supercomputing, 1990.

32
E. H. Gornish. Compile time analysis for data prefetching. Master's thesis, University of Illinois at Urbana-Champaign, December 1989.

33
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 254-263, May 1991.

34
M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, College of Engineering, University of Illinois at Urbana-Champaign, September 1992.

35
M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179-193, March 1992.

36
R. H. Halstead, Jr. and T. Fujita. MASA: A multithreaded processor architecture for parallel symbolic computing. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 443-451, June 1988.

37
L. J. Hendren. Parallelizing Programs with Recursive Data Structures. PhD thesis, Cornell University, January 1990.

38
S. Hiranandani, K. Kennedy, and C. Tseng. Compiling fortran d for mimd distributed-memory machines. Communications of the ACM, 35(8):66-80, August 1992.

39
R. A. Iannucci. Toward a dataflow/von Neumann hybrid architecture. In Proc. Int. Symp. Comput. Arch., pages 131-140, June 1988.

40
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 364-373, May 1990.

41
Kendall Square Research. Kendall Square Research 1 (KSR1) Technical Summary, 1992.

42
A. C. Klaiber and H. M. Levy. Architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 43-63, May 1991.

43
C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory machines. In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, March 1990.

44
J. S. Kowalik, editor. Parallel MIMD Computation : The HEP Supercomputer and Its Applications. MIT Press, 1985.

45
D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages 81-85, 1981.

46
J. Kubiatowicz, D. Chaiken, and A. Agarwal. Closing the window of vulnerability in multiphase memory transactions. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 274-284, October 1992.

47
D. J. Kuck, E. S. Davidson, D. H. Lawrie, and A. H. Sameh. Experimental Parallel Computing Architectures: Volume 1 - Special Topics in Supercomputing, chapter Parallel Supercomputing Today and the Cedar Approach, pages 1-23. North-Holland, New York, 1987.

48
M. S. Lam. Software pipelining: An effective scheduling technique for vliw machines. In Proc. ACM SIGPLAN 88 Conference on Programming Language Design and Implementation, pages 318-328, June 1988.

49
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, April 1991.

50
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241-248, September 1979.

51
W. Landi, B. G. Ryder, and S. Zhang. Interprocedural modification side effect analysis with pointer aliasing. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 56-67, June 1993.

52
J. P. Laudon. Architectural and Implementation Tradeoffs for Multiple-Context Processors. PhD thesis, Stanford University, Stanford, California, 1994. In preparation.

53
R. L. Lee. The Effectiveness of Caches and Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1987.

54
D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63-79, March 1992.

55
D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, Mark Horowitz, and Monica Lam. Design of Scalable Shared-Memory Multiprocessors: The DASH Approach. In Proceedings of COMPCON'90, pages 62-67, 1990.

56
J. Li and M. Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of Parallel and Distributed Computing, 13(2):213-221, October 1991.

57
E. Lusk, R. Overbeek, et al. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987.

58
D. E. Maydan. Accurate Analysis of Array References. PhD thesis, Stanford University, September 1992.

59
J. D. McDonald and D. Baganoff. Vectorization of a particle simulation method for hypersonic rarified flow. In AIAA Thermodynamics, Plasmadynamics and Lasers Conference, June 1988.

60
A. C. McKeller and E. G. Coffman. The organization of matrices and matrix operations in a paged multiprogramming environment. CACM, 12(3):153-165, 1969.

61
T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, 1991.

62
T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, volume 27, pages 62-73, October 1992.

63
G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss. The IBM research parallel processor prototype (RP3): Introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing, pages 764-771, 1985.

64
A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Department of Computer Science, Rice University, May 1989.

65
B. R. Rau and C. D. Glaeser. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183-198, October 1981.

66
A. Rogers and K. Li. Software support for speculative loads. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, volume 27, pages 38-50, October 1992.

67
A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proceedings of the SIGPLAN '89 Conference on Program Language Design and Implementation, June 1989.

68
J. Rose. Locusroute: A parallel global router for standard cells. In Design Automation Conference, pages 189-195, June 1988.

69
E. Rothberg and A. Gupta. Techniques for improving the performance of sparse factorization on multiprocessor workstations. In Proceedings of Supercomputing '90, November 1990.

70
C. Scheurich and M. Dubois. Lockup-free caches in high-performance multiprocessors. Journal of Parallel and Distributed Computing, 11(1):25-36, January 1991.

71
J. P. Singh and J. L. Hennessy. Finding and exploiting parallelism in an ocean simulation program: Experience, results and implications. Journal of Parallel and Distributed Computing, 15(1):27-48, 1992.

72
J. P. Singh, W.-D. Weber, and A. Gupta. Splash: Stanford parallel applications for shared memory. Technical Report CSL-TR-91-469, Stanford University, April 1991.

73
B. J. Smith. Architecture and applications of the HEP multiprocessor computer system. SPIE, 298:241-248, 1981.

74
M. D. Smith. Tracing with pixie. Technical Report CSL-TR-91-497, Stanford University, November 1991.

75
M. D. Smith. Support for Speculative Execution in High-Performance Processors. PhD thesis, Stanford University, November 1992.

76
L. Soule and A. Gupta. Parallel Distributed-Time Logic Simulation. IEEE Design and Test of Computers, 6(6):32-48, December 1989.

77
SPEC. The SPEC Benchmark Report. Waterside Associates, Fremont, CA, January 1990.

78
G. L. Steele. Proposal for alignment and distribution directives in HPF. Draft presented at HPF Forum meeting, June 1992.

79
P. Stenstrom, F. Dahlgren, and L. Lundberg. A lockup-free multiprocessor cache design. In Proceedings of the 1991 International Conference on Parallel Processing, volume I, pages 246-250, 1991.

80
S. W. K. Tjiang and J. L. Hennessy. Sharlit: A tool for building optimizers. In SIGPLAN Conference on Programming Language Design and Implementation, 1992.

81
J. Torrellas, M. S. Lam, and J. L. Hennessy. Shared data placement optimizations to reduce multiprocessor cache miss rates. In Proceedings of the 1990 International Conference on Parallel Processing, volume II, pages 266-270, August 1990.

82
P.-S. Tseng. A Parallelizing Compiler for Distributed Memory Parallel Computers. PhD thesis, School of Computer Science, Carnegie Mellon University, May 1989.

83
D. M. Tullsen and S. J. Eggers. Limitations of cache prefetching on a bus-based multiprocessor. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 278-288, May 1993.

84
W.-D. Weber. Scalable Directories for Cache-Coherent Shared-Memory Multiprocessors. PhD thesis, Stanford University, January 1993.

85
W.-D. Weber and A. Gupta. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 273-280, June 1989.

86
M. E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Stanford University, August 1992.

87
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30-44, June 1991.

88
H. Zima, H.-J. Bast, and M. Gerndt. SUPERB: A tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6:1-18, 1988.


tcm@
Sat Jun 25 15:13:04 PDT 1994