In this section we provide timings that characterize the performance of our parallel explicit wave propagation code on the Cray T3D. We are currently using the code to study the earthquake-induced dynamics of the San Fernando Valley in Southern California. The San Fernando simulations involve meshes of up to 77 million tetrahedra and 40 million equations. The largest mesh corresponds to the case of a lowest shear wave velocity of 220 m/s and a maximum frequency of 2 Hz; the code requires nearly 16 Gb of memory and takes 6 hours to execute for 40,000 times steps on 256 processors of the Cray T3D at the Pittsburgh Supercomputing Center (PSC). Results of a typical simulation, in which the basin was subjected to a vertically-incident plane wave Ricker pulse with a central frequency of 1 Hz, indicate a factor of amplification of eight in the softer parts of the basin compared to the maximum displacement on rock. This suggests substantially greater damage in these regions. A typical result is shown in Figure 7, which depicts the amplification induced by the soft soil. Simulations of this type are essential to better predict the local site effects within soft basins such as those on which Los Angeles, San Francisco, Mexico City, and Tokyo are situated.
Figure 7:
Surface distribution of ground motion amplification factors
in the San Fernando Valley. The amplification factors have been
calculated by comparing the surface to the bedrock motion.
The relevant scenario for assessing the performance of our earthquake simulations as the number of processors increases is one in which the problem size increases proportionally, because unstructured PDE problems are typically memory-bound rather than compute-bound. Given a certain number of processors, we typically aim at full use of their memory; as the number of processors increases, we take advantage of their additional memory by increasing the problem size. In order to study the performance of our earthquake ground motion simulation code with increasing problem size, we have generated a sequence of increasingly-finer meshes for the San Fernando Basin. These meshes are labeled sf10, sf5, sf2, and sf1, and correspond to earthquake excitation periods of 10, 5, 2, and 1 second, respectively. Additionally, the mesh sf1b corresponds to a geological model that includes much softer soil in the top 30m, and thus necessitates an even finer mesh. Note that mesh resolution varies with the inverse cube of period, so that halving the period results in a factor of eight increase in the number of nodes. Characteristics of the five meshes are given in Table 1.
Table 1: Characteristics of San Fernando Basin meshes.
Our timings include computation and communication but exclude I/O. We exclude I/O time because in our current implementation it is serial and unoptimized, and because the T3D has a slow I/O system. I/O time involves the time at the beginning of the program to input the file produced by Parcel, as well as the time to output results every tenth time step to disk. With the availability of the Cray T3E at PSC, we plan to address parallel I/O in the future.
We begin with a traditional speedup histogram, for which the problem size is fixed and the number of processors is increased. Figure 8 shows the total time, as well as the relative time spent for communication and computation, for an earthquake ground motion simulation, as a function of the number of processors.
Figure 8: Timings in seconds
on a Cray T3D as a function of number of processors (PEs), excluding
I/O. The breakdown of computation and communication is shown. The mesh
is sf2, and 6000 time steps are carried out.
The mesh used for these timings is sf2. On 16 processors, the
time spent for communication is 5% of the time spent for
computation, which is quite good for such a highly irregular
problem. There are about 24,000 nodes per processor, which results in
about half the memory on each processor being used. As the number of
processors doubles, the percentage of time spent communicating
relative to computing increases, as expected. For 128 processors, the
communication time has increased to one-fifth of the total
time. However, we are only utilizing of the local
memory on a processor; practical simulations will generally exhibit
performance more like the left bar of Figure 8.
We can quantify the decrease in computation to communication ratio for
a regular mesh. Suppose there are nodes on a processor,
where P is the number of processors. Suppose further that the
regular grid is partitioned into cubic subdomains of equal size, one
to a processor. Since computation for an explicit method such as
Equation 6 is proportional to the volume of nodes in a
cube (subdomain), and communication is proportional to the number of
nodes on the surface of the cube, the computation to communication
ratio is proportional to
,
i.e. the ratio of total nodes to surface nodes of the cube.
Thus, for fixed N, the ratio is inversely proportional to
, at least for cubically-partitioned regular grids
with large enough numbers of nodes per processor. Clearly, it is in
our interest to keep
as large as possible, if we want to
minimize communication time.
Let us extend this analysis to explicit methods on unstructured
meshes. Suppose remains constant for increasing N and
P, i.e. the number of nodes per processor remains constant. Now suppose
that we have a partitioner that guarantees that the number of
interface nodes remains roughly constant as N and P
increase proportionally.
Then we can expect that the computation to communication ratio will
remain constant as the problem size increases.
In this case, we have a method
that scales linearly: the amount of time required to solve a problem
that is doubled in size is unchanged if we double the number of
processors. How close do we come to this ideal situation? First, we
plot the log of the computation to communication ratio against the log
of the number of processors, using the data in Figure 8. A
least-squares fit yields a line with slope
. For a regular grid
with perfect partitioners, we have seen in the previous paragraph that
this slope should be
. This suggests that the idealized
analysis is roughly applicable here.
Next, let us attempt to hold the number of nodes per processor roughly
constant, and examine the aggregate performance of the machine as the
problem size increases. It is difficult to maintain a constant value
of , since processors are available in powers of two on the
T3D. However, we can still draw conclusions about scalability. Figure
9 shows the aggregate performance of our code
on the T3D
Figure 9: Aggregate performance on Cray T3D
as a function of number of processors (PEs). Rate measured for
matrix-vector (MV) product operations (which account for 80% of the
total running time and all of the communication) during 6000 times
steps.
in megaflops per second, as a function of number of processors (and, implicitly, problem size). Megaflops are those that are sustained by matrix-vector product operations (which account for 80% of the total running time and all of the communication) during a San Fernando simulation, exclusive of I/O. This figure shows nearly ideal scalability, which is defined as the single processor performance multiplied by the number of processors. These results show that excellent performance is achievable, despite the highly heterogeneous mesh. This behavior requires a partitioner that keeps the number of interface nodes relatively constant for problem size that increases concomitantly with number of processors.
An even better measure of scalability is to chart the time taken per
time step per node. If the algorithm/implementation/hardware
combination is scalable, we expect that the time taken will not change
with increasing problem size. Not only must the partitioner produce
``scalable'' partitions for this to happen, but in addition the PDE
solver must scale linearly with N. This happens when the work
per time step is .
This is obvious from the
iteration of Equation
6--vector sums, diagonal matrix inversions, and sparse
matrix-vector multiplies require
operations.
Figure 10 depicts the trend in unit wall clock time as the number of processors is increased. Unit wall clock time is measured as microseconds per time step per average number of node per processor, which includes all computations and communications for all time steps, but excludes disk I/O. As we have said above, for a truly scalable algorithm/implementation/hardware system, this number should remain constant as problem size increases with increasing processors.
Figure: T3D wall-clock time in microseconds per
time step per average number of nodes per processor (PE), as a
function of number of processors. This figure is based on an entire
6000 time step simulation, exclusive of I/O. The sf1b result is
based on a damping scheme in which in Equation
5 so that only one matrix-vector product is
performed at each time step.
The figure demonstrates that we are close to this ideal. Ultimately, wall clock time per node per time step is the most meaningful measure of scalable performance for our application, since it is a direct indicator of the ability to solve our ultimate target problems, which are an order of magnitude larger than the San Fernando Basin problem we have described in this paper.