Line-fit Benchmark
Figure 6 shows the performance (in elements per second) achieved by the four implementations on the line-fit benchmark, with the minimum, average, and maximum times plotted for each point. The error bars are only noticeable for the JIT and native code implementations at large problem sizes, where there appeared to be significant variations in the time taken by garbage collection during a run. We treat the current VCODE interpreter (vinterp) implementation as the base case, and discuss the results for each of the Java implementations (native, JIT, and JDK) in turn, concentrating on the results for the PC platform.
Figure 6: Performance of NESL line-fit benchmark using different intermediate language implementations on a 120 MHz 486 PC (left) and a SPARCstation 5/85 (right).
The VCODE interpreter is clearly the fastest of the four implementations of VCODE tested here, as we would expect for a special-purpose implementation. Its relative performance compared to the other implementations is best at small vector sizes, and its absolute performance falls off after about 16,000 elements, when the PC's 256 kB L2 cache can no longer hold two double-precision floating-point vectors.
The native methods implementation approaches the performance of the VCODE interpreter for large problem sizes, but does not quite reach it. There are two probable causes for this performance difference. First, the VCODE interpreter is linked to a CVL library whose loops have been unrolled by hand, reducing the loop overhead. Second, the Java requirement that every element of an array be initialized when the array is allocated causes an extra loop over the data that CVL does not need to perform. Note that there is no clearly observable cache effect for the native methods implementation, possibly because it is masked by the additional memory activity of the Java interpreter.
The JIT compiler achieves approximately half the performance of the VCODE interpreter for problems bigger than about 1,000 elements. The additional slowdown compared to the native methods implementation is probably due to the requirement that every array operation in Java must check for valid indices. The JIT compiler must therefore generate extra conditionals in the inner loop of vector code. There are techniques for guaranteeing valid indices without requiring these extra conditionals, such as performing loop-bounds analysis or exploiting virtual memory mechanisms for protection purposes [1], but to our knowledge these optimizations are not performed by any current JIT compiler. Note that we were unable to measure any extra compilation overhead incurred by the JIT compiler; this null result can probably be attributed entirely to the poor resolution of the PC clock.
Finally, the JDK interpreter achieves about one tenth the performance of the VCODE interpreter, and is four to six times slower than the JIT compiler.
As well as graphing the results of the line-fit benchmark, we can also use them to calculate the constant overhead and the asymptotic time per element of each implementation. This is only possible because the benchmark executes a fixed number of VCODE operations--and hence should have a fixed interpretive overhead--for all problem sizes. The resulting figures for the constant overhead and the time per element are shown in Table 3.
Using the overhead and the asymptotic time per element we can now understand why the performance curves for the JIT compiler and the native methods implementation cross. The JDK interpreter used by the native methods implementation has a higher constant overhead than the JIT compiler, and this overhead dominates performance for small problem sizes. However, for large problem sizes the per-element speed advantage of the native vector methods outweighs the higher overhead of the JDK interpreter, and the native methods implementation is faster.
120 MHz 486 PC SPARCstation 5/85 Implementation Overhead Per-elt. Overhead Per-elt. JDK 8200 72 8500 60 JIT 2200 12 N/A N/A Native 5700 8.6 6900 7.1 Vinterp 690 7.1 610 4.7 Table 3: Constant overhead and asymptotic time per element (in microseconds) for NESL line-fit benchmark using different intermediate language implementations on a PC and a SPARCstation. We can also calculate the percentage of the total running time due to the constant overhead, as shown in Figure 7. The disparity between the speed of native methods and the speed of the Java interpreter is reflected in the top curve; because the native methods are much faster than the interpreter, the impact of the fixed interpretive overhead is bigger. The JDK and JIT compiler implementations have similar results to that of the VCODE interpreter, indicating that the ratios of their interpretation and execution costs are comparable, and the problem size at which they achieve half of their peak performance () is similar. Put another way, the performance difference between the specialized VCODE interpreter and the general-purpose Java interpreter and JIT compiler is roughly the same as the performance difference between the machine-specific C code in CVL and the Java vector methods in VcodeEmulation being executed by the Java interpreter and JIT compiler.
Figure 7: Percentage of interpretive overhead in NESL line-fit benchmark for different intermediate language implementations on a 120 MHz 486 PC (left) and a SPARCstation 5/85 (right).
The shapes of the performance curves on the SPARCstation are generally similar to those on the PC, although the cache effect for the VCODE interpreter is much more pronounced and happens at around 500 elements, due to the SPARCstation's much smaller cache. This general similarity of the results for the two platforms is true for all three benchmarks, suggesting that there are no architecture-dependent effects skewing the results. The two platforms are also comparable in terms of absolute speed.