References

Next: About this document Up: Tolerating Latency Through Software-Controlled Previous: Future Work

References

1: W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. Automatic program transformations for virtual memory computers. Proc. of the 1979 National Computer Conference, pages 969-974, June 1979.
2: S. Adve and M. Hill. Weak ordering - A new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 2-14, May 1990.
3: A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz. April: A processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104-114, May 1990.
4: S. P. Amarasinghe and M. S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 126-138, June 1993.
5: J. M. Anderson and M. S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 112-125, June 1993.
6: J. Archibald and J.-L. Baer. Cache coherence protocols: Evaluation using a multiprocessor simulation model. ACM Transactions on Computer Systems, 4(4):273-298, 1986.
7: J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing '91, 1991.
8: D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. Technical Report RNR-91-002, NASA Ames Research Center, August 1991.
9: D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40-52, April 1991.
10: A. Carle, K. Kennedy, U. Kremer, and J. Mellor-Crummey. Automatic data layout for distributed-memory machines in the D programming environment. In Proceedings of AP'93 International Workshop on Automatic Distributed Memory Parallelization, Automatic Data Distribution and Automatic Parallel Performance Prediction, Saarbrücken, Germany, March 1993.
11: B. Chapman, P. Hehrota, and H. Zima. Programming in vienna fortran. In Third Workshop on Compilers for Parallel Computers, pages 121-160, July 1992.
12: S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng. Automatic array alignment in data-parallel programs. In Proceedings of the Twentieth Annual ACM Symposium on the Principles of Programming Languages, January 1993.
13: W. Y. Chen, S. A. Mahlke, P. P. Chang, and W. W. Hwu. Data access microarchitectures for superscalar processors with compiler-assisted data prefetching. In Proceedings of Microcomputing 24, 1991.
14: R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman. A vliw architecture for a trace scheduling compiler. In Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 180-192, Oct. 1987.
15: K.D. Cooper, M.W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2), April 1993.
16: J. C. Dehnert, P. Y.-T. Hsu, and J. P. Bratt. Overlapped loop support in the cydra 5. In Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 26-38, April 1989.
17: M. Dubois, L. Barroso, Y.-S. Chen, and K. Oner. Scalability problems in multiprocessors with private caches. In Proceedings of Parallel Architecture and Languages Europe '92, pages 211-230, June 1992.
18: M. Dubois, C. Scheurich, and F. A. Briggs. Synchronization, coherence, and event ordering in multiprocessors. Computer, 21(2):9-21, February 1988.
19: S. J. Eggers and T. E. Jeremiassen. Eliminating false sharing. In Proceedings of the 1991 International Conference on Parallel Processing, volume I, pages 377-381, August 1991.
20: M. Berry et al. The perfect club benchmarks: Effective performance evaluation of supercomputers. Technical Report CSRD 827, Center for Supercomputing Research and Development, Illinois, May 1989.
21: J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In Fourth Workshop on Languages and Compilers for Parallel Computing, Aug 1991.
22: K. Gallivan, W. Jalby, U. Meier, and A. Sameh. The impact of hierarchical memory systems on linear algebra algorithm design. Technical Report UIUCSRD 625, University of Illinios, 1987.
23: D. Gannon and W. Jalby. The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. In The Characteristics of Parallel Algorithms. MIT Press, 1987.
24: D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5:587-616, 1988.
25: A. George, J. Liu, and E. Ng. User's guide for SPARSPAK: Waterloo sparse linear equations package. Technical Report CS-78-30, Department of Computer Science, University of Waterloo, 1980.
26: K. Gharachorloo, A. Gupta, and J. Hennessy. Performance evaluation of memory consistency models for shared-memory multiprocessors. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 245-257, April 1991.
27: K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15-26, May 1990.
28: A. J. Goldberg. Multiprocessor Performance Debugging and Memory Bottlenecks. PhD thesis, Stanford University, August 1992.
29: S. R. Goldschmidt and H. Davis. Tango introduction and tutorial. Technical Report CSL-TR-90-410, Stanford University, 1990.
30: G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.
31: E. Gornish, E. Granston, and A. Veidenbaum. Compiler-Directed Data Prefetching in Multiprocessors with Memory Hierarchies. In International Conference on Supercomputing, 1990.
32: E. H. Gornish. Compile time analysis for data prefetching. Master's thesis, University of Illinois at Urbana-Champaign, December 1989.
33: A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 254-263, May 1991.
34: M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, College of Engineering, University of Illinois at Urbana-Champaign, September 1992.
35: M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179-193, March 1992.
36: R. H. Halstead, Jr. and T. Fujita. MASA: A multithreaded processor architecture for parallel symbolic computing. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 443-451, June 1988.
37: L. J. Hendren. Parallelizing Programs with Recursive Data Structures. PhD thesis, Cornell University, January 1990.
38: S. Hiranandani, K. Kennedy, and C. Tseng. Compiling fortran d for mimd distributed-memory machines. Communications of the ACM, 35(8):66-80, August 1992.
39: R. A. Iannucci. Toward a dataflow/von Neumann hybrid architecture. In Proc. Int. Symp. Comput. Arch., pages 131-140, June 1988.
40: N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 364-373, May 1990.
41: Kendall Square Research. Kendall Square Research 1 (KSR1) Technical Summary, 1992.
42: A. C. Klaiber and H. M. Levy. Architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 43-63, May 1991.
43: C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory machines. In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, March 1990.
44: J. S. Kowalik, editor. Parallel MIMD Computation : The HEP Supercomputer and Its Applications. MIT Press, 1985.
45: D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages 81-85, 1981.
46: J. Kubiatowicz, D. Chaiken, and A. Agarwal. Closing the window of vulnerability in multiphase memory transactions. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 274-284, October 1992.
47: D. J. Kuck, E. S. Davidson, D. H. Lawrie, and A. H. Sameh. Experimental Parallel Computing Architectures: Volume 1 - Special Topics in Supercomputing, chapter Parallel Supercomputing Today and the Cedar Approach, pages 1-23. North-Holland, New York, 1987.
48: M. S. Lam. Software pipelining: An effective scheduling technique for vliw machines. In Proc. ACM SIGPLAN 88 Conference on Programming Language Design and Implementation, pages 318-328, June 1988.
49: M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, April 1991.
50: L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241-248, September 1979.
51: W. Landi, B. G. Ryder, and S. Zhang. Interprocedural modification side effect analysis with pointer aliasing. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 56-67, June 1993.
52: J. P. Laudon. Architectural and Implementation Tradeoffs for Multiple-Context Processors. PhD thesis, Stanford University, Stanford, California, 1994. In preparation.
53: R. L. Lee. The Effectiveness of Caches and Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1987.
54: D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63-79, March 1992.
55: D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, Mark Horowitz, and Monica Lam. Design of Scalable Shared-Memory Multiprocessors: The DASH Approach. In Proceedings of COMPCON'90, pages 62-67, 1990.
56: J. Li and M. Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of Parallel and Distributed Computing, 13(2):213-221, October 1991.
57: E. Lusk, R. Overbeek, et al. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987.
58: D. E. Maydan. Accurate Analysis of Array References. PhD thesis, Stanford University, September 1992.
59: J. D. McDonald and D. Baganoff. Vectorization of a particle simulation method for hypersonic rarified flow. In AIAA Thermodynamics, Plasmadynamics and Lasers Conference, June 1988.
60: A. C. McKeller and E. G. Coffman. The organization of matrices and matrix operations in a paged multiprogramming environment. CACM, 12(3):153-165, 1969.
61: T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, 1991.
62: T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, volume 27, pages 62-73, October 1992.
63: G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss. The IBM research parallel processor prototype (RP3): Introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing, pages 764-771, 1985.
64: A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Department of Computer Science, Rice University, May 1989.
65: B. R. Rau and C. D. Glaeser. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183-198, October 1981.
66: A. Rogers and K. Li. Software support for speculative loads. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, volume 27, pages 38-50, October 1992.
67: A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proceedings of the SIGPLAN '89 Conference on Program Language Design and Implementation, June 1989.
68: J. Rose. Locusroute: A parallel global router for standard cells. In Design Automation Conference, pages 189-195, June 1988.
69: E. Rothberg and A. Gupta. Techniques for improving the performance of sparse factorization on multiprocessor workstations. In Proceedings of Supercomputing '90, November 1990.
70: C. Scheurich and M. Dubois. Lockup-free caches in high-performance multiprocessors. Journal of Parallel and Distributed Computing, 11(1):25-36, January 1991.
71: J. P. Singh and J. L. Hennessy. Finding and exploiting parallelism in an ocean simulation program: Experience, results and implications. Journal of Parallel and Distributed Computing, 15(1):27-48, 1992.
72: J. P. Singh, W.-D. Weber, and A. Gupta. Splash: Stanford parallel applications for shared memory. Technical Report CSL-TR-91-469, Stanford University, April 1991.
73: B. J. Smith. Architecture and applications of the HEP multiprocessor computer system. SPIE, 298:241-248, 1981.
74: M. D. Smith. Tracing with pixie. Technical Report CSL-TR-91-497, Stanford University, November 1991.
75: M. D. Smith. Support for Speculative Execution in High-Performance Processors. PhD thesis, Stanford University, November 1992.
76: L. Soule and A. Gupta. Parallel Distributed-Time Logic Simulation. IEEE Design and Test of Computers, 6(6):32-48, December 1989.
77: SPEC. The SPEC Benchmark Report. Waterside Associates, Fremont, CA, January 1990.
78: G. L. Steele. Proposal for alignment and distribution directives in HPF. Draft presented at HPF Forum meeting, June 1992.
79: P. Stenstrom, F. Dahlgren, and L. Lundberg. A lockup-free multiprocessor cache design. In Proceedings of the 1991 International Conference on Parallel Processing, volume I, pages 246-250, 1991.
80: S. W. K. Tjiang and J. L. Hennessy. Sharlit: A tool for building optimizers. In SIGPLAN Conference on Programming Language Design and Implementation, 1992.
81: J. Torrellas, M. S. Lam, and J. L. Hennessy. Shared data placement optimizations to reduce multiprocessor cache miss rates. In Proceedings of the 1990 International Conference on Parallel Processing, volume II, pages 266-270, August 1990.
82: P.-S. Tseng. A Parallelizing Compiler for Distributed Memory Parallel Computers. PhD thesis, School of Computer Science, Carnegie Mellon University, May 1989.
83: D. M. Tullsen and S. J. Eggers. Limitations of cache prefetching on a bus-based multiprocessor. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 278-288, May 1993.
84: W.-D. Weber. Scalable Directories for Cache-Coherent Shared-Memory Multiprocessors. PhD thesis, Stanford University, January 1993.
85: W.-D. Weber and A. Gupta. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 273-280, June 1989.
86: M. E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Stanford University, August 1992.
87: M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30-44, June 1991.
88: H. Zima, H.-J. Bast, and M. Gerndt. SUPERB: A tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6:1-18, 1988.

tcm@
Sat Jun 25 15:13:04 PDT 1994