Fast T3D Array Transposes based on the CMU direct deposit model

This documentation page is an executive summary about the computer architecture research that lead to the CMU direct deposit model of communication and the fast transpose routines in libtp.a. It contains a performance characterization for 4k by 4k transposes on different partition sizes on the 512 processor Cray T3D installed at the Pittsburgh Supercomputer Center.

Memory access pattern in transposes

The following graph shows the local memory accesses patterns involved in global transposes. Such transposes are used in distributed 2D-FFTs. Shown is the memory layout for an array A[1024][1024] declared in C.

The transpose routines can choose between more contiguous loads with strided stores (depicted in red), or strided loads with more contiguous stores (depicted in blue). The alpha processor used in the Cray T3D has good support for strided stores (efficient write-back queue) and relatively poor support for strided loads (cache, and read-ahead logic don't help much). Therefore our routines prefer contiguous loads.

Performance characterization

Scalability is given for a 1024 by 1024 transpose of complexes as used in our 2D FFT-benchmark.

Direct deposit message passing (CMU)

Buffer-packing transpose with PVM 3.3 (Cray PVM)

Craft 6.2 (Intrinsic)

Craft 6.2 (Worksharing)

Summary:

The first graph outlines the absolute communication performance achieved on transpose like transfers. The direct deposit model starts at 40 MByte/sec with 4 processors and scales gradually down to 30 Mbyte/sec on 256 processors as more processors get involved and the messages become smaller. The transpose intrinsic of CRAFT is programmed in a similar way but must rely on at least one additional copy since the transfer rates are about half of direct deposit. The PVM implementation has gather, scatter into contigous blocks and other additional copies. Furthermore the constant overhead per message makes PVM communication completely unpractical for small message sizes of 4k by 4k transposes on larger machines. For worksharing CRAFT misses to detect the regular transpose and issues code for a very inefficient single element pulling.

Note: The break down to 22 MByte per processor in the direct deposit for 512 element is attributed to irregularities in the T3Ds current routing table. These routes render our congestion control scheme useless. So far Cray Research seemed very reluctant to give us details why routes are not following a proper e-cube scheme.

The second graph shows how communication performance affects scalability of the transpose operations. The direct deposit and the CRAFT intrinsic transpose scale fine up to 256 or even 512 nodes. The PVM performance falls apart above 32 or 64 processors, while worksharing does not scale far beyond 8 processor.

Note that on the log-log scale the quarter inch between the two curves means a factor of two performance difference.

Code examples of Direct Deposit, PVM and CRAFT versions

Deposit and PVM, common declarations:


#define P  256
#define NM 4096+4
#define N 4096
#define RM (NM/P)
#define R (N/P)

#pragma _CRI cache_align a,b

static doublecomplex a[RM][NM];     /* input matrix */
static double fill[4];
static doublecomplex b[RM][NM];     /* input matrix */

Direct deposit implementation:

/* Pseudocode, real routines contain a number of memory system 
 * optimization such as unrolled loops, folded constants etc... 
 */
barrier();
for (j=0; j<P; j++) {
  barrier();                        /* congestion control */
  j0=comm_schedule[j];
  i0=my_cell_id;
  b0=shmem_init(b,a,1,j0);
  for (l=0; l<R; l++) {
    for (m=0; m<R; m++) {           /* copy loop w. remote stores */
      b0[m][i0*R+l].r=a[l][j0*R+m].r;
      b0[m][i0*R+l].i=a[l][j0*R+m].i;
    }
  }
}
barrier();
shmem_udcflush();                   /* restore cache consistency */

PVM 3.3 implementation

/* Pseudocode, real routines contain a number of memory system 
 * optimization such as unrolled loops, folded constants etc... 
 */
for (j=0; j<P; j++) {                /* pack buffers */
  for (l=0; l<R; l++) {
    for (m=0; m<R; m++) {
      af[j][m*R+l].r=a[l][j*R+m].r;
      af[j][m*R+l].i=a[l][j*R+m].i;
    }
  }
}
for (j=0; j<P; j++) {                /* do sends */
  pvm_initsend(0);
  pvm_pkdouble(af[j],R*R*2,1);
  pvm_send(j,0);
}
for (j=0; j<P; j++) {                /* do receives */
  pvm_recv(j,0);
  pvm_upkdouble(bf[j],R*R*2,1);
}
for (j=0; j<P; j++) {                /* unpack buffers */
  for (m=0; m<R; m++)
    for (l=0; l<R; l++) {
      a[m][j*R+l].r=bf[j][m*R+l].r;
      a[m][j*R+l].i=bf[j][m*R+l].i;
    }
  }
}

CRAFT, common declarations:

      real a(N,N)
      real b(N,N)

      intrinsic transpose

cdir$ cache_align a,b
cdir$ shared a(:,:block)       
cdir$ shared b(:,:block)

CRAFT worksharing transpose code:

      do i = 1,N 
cdir$ doshared (j) on b(i,j)       
         do j = 1,N 
            b(i,j)=a(j,i)
         enddo
      enddo

CRAFT worksharing transpose code:

      b= TRANSPOSE(a)

Reference to further literature on the topic

T. Stricker and T. Gross. Optimizing Memory System Performance for Communication in Parallel Computers . Reprint from proceedings of ISCA'95, June 1995. abstract. postscript, compressed

T. Stricker, J. Stichnoth, D. O'Hallaron, S. Hinrichs, and T. Gross. Decoupling Synchronization and Data Transfer in Message Passing Systems of Parallel Computers Reprint from proceedings of ICS 95, July 1995. abstract. postscript, compressed

S. Hinrichs, C. Kosak, D. O'Hallaron, T. Stricker, and R. Take. An Architecture for Optimal All-to-All Personalized Communication. Reprint from proceedings of SPAA '94, January 1994. abstract. postscript, compressed, Extended and updated version of paper appears as CMU technical report CMU-CS-94-140. postscript, compressed,