DL_POLY Benchmarks DL_POLY Benchmarks

D L_P O L Y

PERFORMANCE of VARIOUS COMPUTERS in COMPUTATIONAL CHEMISTRY

M.F. Guest

Contents


ABSTRACT

This report compares the performance of a number of different computer systems using the DL_POLY software. The comparison involves twenty seven computers, including scientific workstations from IBM, Sun, Hewlett Packard, Digital and Silicon Graphics, and Pentium-based PCs. The benchmark suite consists of a set of six typical MD simulations detailed below.

1. INTRODUCTION

Workstations that have been benchmarked, include those from

We should stress from the outset that our access to much of the hardware evaluated herein has been at best short lived, and has often involved the temporary loan or donation of machines as part of one of the hardware evaluation exercises run at the Daresbury Laboratory. In many cases these machines were not optimally configured in terms of either memory, or high speed disk, and consideration of the results presented here should be viewed in that light.

Following an introductory evaluation of hardware based on the SPEC Benchmarks (section 3), we present in Sections 4, and 5 results using the DL_POLY simulation program [1].

Note that the present results are taken from a more detailed report on computational chemistry benchmarks; the associated MS powerpoint presentation is also available.

2. The SPEC BENCHMARKS

One of the most useful indicators of CPU performance is provided by the SPEC (``Systems Performance Evaluation Corporation'') benchmarks. This benchmark suite contains non-tuned application-based code to measure processor speed for both integer (SPECint) and floating point (SPECfp) arithmetic. While earlier versions of the suite (e.g. SPECmark89) had certain well-advertised flaws, the more recent offerings, SPECfp95 and SPECint95 have become industry standards in measuring primarily the performance of a system's processor, memory architecture, operating system and compiler.

SPECfp95 is derived from the results of ten floating-point benchmarks compiled with aggressive optimization. It is the geometric mean of ten normalized ratios (one for each floating-point benchmark). SPECint95 is derived from the results of eight integer benchmarks compiled with aggressive optimization. It is the geometric mean of eight normalized ratios (one for each integer benchmark) Note that the level of optimization is not mandated. While highly aggressive optimization is permitted, results derived from benchmarks compiled with conservative optimization (as in SPECbase) can be submitted.

SPECfp95 and SPECint95 results for many of the CPUs discussed in this paper are given in Table 1. It is clear that no single CPU has dominated the SPECfp95 ratings over the recent past, that is until the arrival of the EV6/21264 from Digital. Thus until July 1998, the P2SC/160 CPU in the IBM RS/6000-397 exhibited the most impressive SPECfp95 rating; with a value of 26.6, the P2SC was marginally faster than the HP PA-9000/C240, 1.2 times faster than the HP PA-9000/C200 and SUN Enterprise HPC4500/336, 1.3 times faster than the DEC Alpha 8400/5-625 and SUN Ultra-2/300, and 1.4 times faster than the R10k-based SGI Origin2000/195. This picture has changed quite drastically with the arrival of the EV6. With SPECfp95 ratings of 47.7 (in the 8400/6-575) and 58.7 in the Compaq DS20, the EV6 alpha is seen to be more than twice as fast as the other leading processors of Table 1. The following points should be noted regarding the values specified in this Table;

  1. SPEC ratings for the Compaq XP1000 6/450 have been derived by extrapolating values from the Compaq AlphaServer DS20 (500 MHz); the values quoted have not been confirmed, and should be viewed with this in mind.
  2. Values quoted for the Cray T3E systems are again estimates based on extrapolations from the corresponding Digital EV5 specifications. No SPEC figures have ever been submitted by Cray/SGI for the T3D or T3E series.

Using the Compaq AlphaServer DS20 value of 58.7 to normalises the SPECfp ratings, we would expect the DS20 and ES40 to be somewhat ahead of the other EV6-based machines (the Compaq Alpha GS140 and Alpha 8400/6-575), given the more optimal memory subsystem involved. These four machines appear far superior to the remainder; based on this performance metric, the DS20 is seen to be 1.95 times the power3-based 200 MHz IBM RS/6000-43P/260 and 2.3 times the HP PA-9000/C240 and 250 MHz R10k-based SGI Origin2000. All other CPUs are projected to be significantly less than half the speed.

While the EV6-based machines from Compaq/DEC are also seen to dominate the SPECint95 ratings, a quite different ordering of the processors from IBM (Power3, Power2SC and Power2) and SGI (both R10k and R8k) is seen compared to the SPECfp95 ratings; both are now slower than those from DEC (EV5-based) and SUN. We also note that the Specint95 ratings suggest that Pentium II/400 is 1.7 times slower than the Compaq AlphaServer DS20, while the SPECfp95 ratings point to the Pentium being a factor of 4.5 times slower.

When considering the present benchmarking results, there are several factors we wish to consider in assessing the usefulness of the SPEC ratings;

i. Do the SPECfp95 values provide a reliable metric for evaluating the capabilities of hardware in computational chemistry? If so, we would expect to find a close mapping of the ratios for the various chemistry benchmarks onto the SPECfp ratios;

ii. Does any particular CPU consistently ``underperform'' based on the SPECfp criteria? - this would manifest itself as the ratios from the chemistry benchmarks falling below the SPECfp ratios. In particular we shall look for indicators of the memory problems of the SGI O2-R10k impacting on the benchmarks.

We will attempt to address these issues below. Finally, we note that A SPEC FAQ describing the SPEC benchmark suite and the SPEC consortium is periodically posted to comp.benchmarks, and can be found on the WWW at

http://www.specbench.org/spec/faq

An excellent summary of the SPEC benchmarks that is periodically updated is available via anonymous ftp from ftp.cs.toronto.edu in the file /pub/spectable More SPEC-related information is available at the SPEC WWW site,

http://www.specbench.org

and at the Performance Database Web site,

http://performance.netlib.org/performance/html/spec.html#specsite.

3. THE DL_POLY BENCHMARK

The benchmark summarised below is designed to reflect the typical range of simulations undertaken by the molecular dynamicist. It includes 6 calculations carried out using the DL_POLY molecular dynamics code, and includes the following functionality;

The data presented in Table 2 is collected under control of the UNIX command time where available, and includes CPU time (both user and system), total elapsed time and Efficiency, measured as CPU versus elapsed. The total user CPU timings of Table 2 refer to the summed user CPU timings over all 6 calculations of the benchmark. Note that in contrast to the QC benchmark, little I/O is performed by the DL_POLY calculations, so that efficiency should always be high assuming the benchmarks were conducted on a dedicated resource.

The total CPU timings of Table 2 suggest that the Digital/Compaq Alpha CPU and, to a lesser extent, the SGI 250 MHz R10k, are dominant. The Compaq AlphaServer GS140 is the optimum CPU (13.9 minutes), slightly faster than the AlphaServer DS20 and DS40 (14.3 and 14.5 minutes respectively) and the Compaq XP1000 6/450 (15.6 minutes).The EV6-based GS140 outperforms the EV5-based DEC Alpha 8400/5-625 (19.8 mins.) and the Alpha PW/600AU (20.2 mins.) by a factor of 1.45, and the SGI Origin2000/250 (21.6 mins.) by a factor of 1.55. These are followed by the SGI Octane/250 (24.5 mins.), the DEC Alpha PW/433AU (28.1 mins.), SGI Origin2000/195 (29.9 mins.) and SGI PChall-R10k/195 (33.1 mins). The leading 11 CPUs are from either Digital/Compaq or Silicon Graphics. Note that the incorporation of DL-POLY into the benchmark suite came after the availability of the Alpha 8400/6-575.

When considering the performance of the CPUs from SUN, IBM and Hewlett Packard, we would note the following:

4. SUMMARY

Based on the published SPECfp95 ratings, and normalising with respect to the Compaq AlphaServer DS20 value of 58.7, we would expect (see section 1) the Alpha DS20 and ES40 to be somewhat ahead of the other EV6-based machines (the Compaq Alpha GS140 and Alpha 8400/6-575), with these four machines appear far superior to the remainder. The DS20 is seen to be 1.95 times the power3-based 200 MHz IBM RS/6000-43P/260 and 2.3 times the HP PA-9000/C240 and 250 MHz R10k-based SGI Origin2000. All other CPUs are projected to be significantly less than half the speed.

While the EV6-based machines from Compaq/DEC are also seen to dominate the SPECint95 ratings, a quite different ordering of the processors from IBM (Power3, Power2SC and Power2) and SGI (both R10k and R8k) is seen compared to the SPECfp95 ratings; both are now slower than those from DEC (EV5-based) and SUN. We also note that the Specint95 ratings suggest that Pentium II/400 is 1.7 times slower than the Compaq AlphaServer DS20, while the SPECfp95 ratings point to the Pentium being a factor of 4.5 times slower.

When analysing the results, we wish to consider based on the present evaluation exercise, (i) do the SPECfp95 values provide a reliable metric for evaluating the capabilities of hardware in computational chemistry? If so, we would expect to find a close mapping of the ratios for the various benchmarks onto the SPECfp95 ratios, (ii) does any particular CPU consistently ``underperform'' based on the SPECfp criteria? - this would manifest itself as the ratios from the benchmarks falling below the SPECfp ratios, and (iii) do the ``simple'' Matrix and Chemistry Kernel benchmarks lead to the same conclusions as the GAMESS-UK and DL_POLY benchmarks? To these ends an approximate Performance Index (PI) has been devised for each machine, based on an average value of the Matrix-97, Chemistry Kernels and DL_POLY benchmarks. A full discussion is to be found in the paper on computational chemistry benchmarks. Here we present the conclusions only.

5. REFERENCES

1
DL_POLY is a parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and T.R. Forester under the auspices of the Engineering and Physical Sciences Research Council (EPSRC) for the EPSRC's Collaborative Computational Project for the Computer Simulation of Condensed Phases (CCP5) and the Molecular Simulation Group (MSG) at Daresbury Laboratory. The package is the property of the Central Laboratory of the Research Councils.

2
In theory the O2-R10k should have outperformed the corresponding Indigo2; with better memory bandwidth, superior I/O and more tightly coupled integration it should have done well. However SGI made a design decision which has seriously impaired the performance of the O2 in some application areas. It took about 3 months for this "flaw" to be fully identified. Until December 1996, SGI were claiming that the O2 R10k would perform in the region of 10, 12 or even 15 SPECfp95. Indeed on some benchmarks it does indeed achieve performance that matches these figures. However, the O2 has a Unified Memory Architecture which uses main system memory as memory for the graphics display and operations. Despite the impressive bandwidth figures for the O2, it does seem that the O2 memory architecture severely impedes the performance of the R10k processor, particularly when compared with the Octane, Indigo2 and Origin systems. This is shown by the SPEC comparisons of Table 1; we suspect that the two main factors limiting performance in the memory subsystem are the main memory speed and the CRIME chip.

The CRIME chip, which acts as the memory interface between the memory and the three drains on it - the CPU (800 MByte/second), I/O engine (500 MByte/sec) and the monitor display (700 MByte/second) - is probably the main bottleneck. This chip was designed to work as a built in memory controller, but the design was biased toward the R5k; it can't work directly with the R10k because the R5k expects 32 byte cache refills while the R10k wants to have 64 or 128 byte refills. Therefore SGI supply a custom ASIC with the R10k daughter board. This interfaces the R10k's level 2 cache with the CRIME chip. Performance problems are caused by the ASIC having to break each 128 byte cache refill operation into 4, 32 byte refills. The net impact of this effect is that the O2 R10k will only work well with problems that fit into the L2 cache (1 MByte). Not surprisingly, the memory intensive SPECfp95 figures are badly affected, although the impact on less memory intensive applications is not so severe. It should be noted that this type of incident is very rare; chips often fail to deliver but not system architectures designed for existing chips.

3
see the following recently opened web page to obtain Pentium Pro Optimized BLAS and FFTs for Intel Linux: http://www.cs.utk.edu/ ghenry/distrib


Table 1. SPECfp95 and SPECint95. Absolute Values and Values Relative to the Compaq AlphaServer DS20.
Machine SPECfp95 SPECint95 Relative Values (%)
SPECfp95 SPECint95
Compaq Alpha DS20 58.70 27.70 100% 100%
Compaq Alpha ES40 57.70 27.30 98% 99%
Compaq XP1000 6/450 52.80 24.90 90% 90%
DEC Alpha 8400/6-575 47.70 30.30 81% 109%
Compaq Alpha GS140 45.20 27.80 77% 100%
IBM RS/6000-43P 30.10 13.10 51% 47%
IBM RS/6000-397 26.60 8.61 45% 31%
HP PA-9000/C240 25.40 17.30 43% 62%
SGI Onyx2 IR2/250 24.50 14.70 42% 53%
SGI Origin2000/250 24.50 14.70 42% 53%
SUN HPC4500/336 21.90 15.00 37% 54%
HP PA-9000/C200 21.40 14.20 36% 51%
DEC Alpha PW/600AU 21.30 16.30 36% 59%
DEC Alpha 8400/5-625 20.80 18.40 35% 66%
DEC Alpha 500/5-500 20.40 15.00 35% 54%
SGI Octane/250 20.30 13.60 35% 49%
SUN Ultra-2/300 20.20 12.30 34% 44%
SGI Origin2000/195 19.00 9.48 32% 34%
SUN Ultra30/300 18.30 12.10 31% 44%
DEC Alpha PW/433AU 18.10 13.90 31% 50%
IBM RS/6000-595 17.60 6.17 31% 22%
SGI Octane/195 17.40 9.40 30% 34%
HP PA-9000/C160 16.30 10.40 28% 38%
SGI Origin200 15.60 8.59 27% 31%
SGI Octane/175 15.50 8.40 26% 30%
DEC Alpha 500/5-400 14.10 12.30 24% 44%
SGI PChall-R10k/195 13.80 8.85 24% 32%
DEC Alpha 600/5-333 13.20 9.23 22% 33%
Pentium II/400 13.00 16.00 22% 58%
DEC Alpha 8400/5-300 12.40 7.43 21% 27%
DEC Alpha 600/5-266 11.80 7.91 20% 29%
SUN Ultra-2/200 11.10 7.67 19% 28%
IBM RS/6000-590 10.40 3.33 18% 12%
IBM RS/6000-3CT 10.20 3.42 17% 12%
SUN Ultra-1/170 9.06 5.56 15% 20%
DEC Alpha 2100/5-250 8.39 5.96 14% 22%
Pentium II/300 8.15 11.60 14% 42%
SUN Ultra-1/140 7.90 4.66 13% 17%
SGI O2/R10k-SC 7.83 9.02 13% 33%
Pentium II/266 7.68 10.80 13% 39%
IBM RS/6000-3BT 7.50 3.14 13% 11%
Pentium Pro/200 6.75 8.09 11% 29%
HP PA-9000/J200 6.32 3.52 11% 13%
DEC Alpha 250/4-266 6.27 5.18 11% 19%
DEC AXP/3000-700 5.71 3.66 10% 13%
SGI O2/R5k-SC 5.42 4.82 9% 17%
Pentium 233 MMX 5.21 - 9% -
SGI Indy-R5k 4.78 4.32 8% 16%
HP PA-9000/735-125 4.61 3.97 8% 14%
HP PA-9000/735 4.06 3.22 7% 12%
DEC AXP/3000-500 3.65 2.15 6% 8%
HP PA-9000/715-100 3.47 2.89 6% 10%
IBM PowerPC-43P 3.20 3.59 5% 13%
IBM PowerPC-250 2.32 1.82 4% 7%
SUN SPARC 10/41 1.38 1.13 2% 4%
MPP node
IBM SP2/160Thin 25.80 8.61 44% 31%
HP PA-9000/V2200 22.10 13.80 38% 50%
Cray T3E/1200 21.30 18.40 36% 66%
Cray T3E/900 17.25 13.60 29% 49%
IBM SP2/120Thin 16.60 5.61 28% 20%
IBM SP2/66Thin 9.35 3.31 16% 12%

Table 2. The DL_POLY Benchmark: Total CPU and Elapsed Time (minutes) for Calculations 1-6 (see text) and Performance relative to the Compaq Alpha DS20.
Machine CPU Time Elapsed Relative
User System Time Performance (%)
Compaq Alpha GS140 13.9 0.0 13.9 103%
Compaq Alpha DS20 14.3 0.0 14.3 100%
Compaq Alpha ES40 14.5 0.0 14.5 98%
Compaq XP1000 6/450 15.6 0.0 15.6 92%
DEC Alpha 8400/5-625 19.7 0.1 19.9 72%
DEC Alpha PW/600AU 20.2 0.0 20.2 71%
SGI Origin2000/250 21.6 0.0 21.7 66%
SGI Octane/250 24.5 0.0 24.5 59%
DEC Alpha PW/433AU 28.1 0.0 28.8 51%
SGI Origin2000/195 29.9 0.0 30.8 48%
SGI PChall-R10k/195 33.0 0.1 33.2 43%
HP PA-9000/V2200 33.5 0.1 33.6 43%
IBM RS/6000-43P 35.8 0.0 35.7 40%
HP PA-9000/C240 36.2 0.0 36.5 40%
DEC Alpha 8400/5-300 37.2 0.1 37.2 38%
SGI Octane/175 40.4 0.1 40.5 35%
Cray T3E/1200 41.2 0.6 42.6 34%
Pentium II/400 (pgi) 50.4 0.0 50.5 28%
Cray T3E/900 51.0 0.7 52.3 28%
SUN HPC4500/336 62.6 0.0 62.7 23%
Pentium II/300 (pgi) 65.6 0.0 65.7 22%
IBM SP2/120Thin 67.5 0.0 68.1 21%
Pentium II/300 (abs) 72.1 0.0 72.1 20%
Pentium II/266 (pgi) 76.4 0.0 76.5 19%
SGI O2/R5k-SC 80.5 0.2 83.5 18%
Pentium II/266 (abs) 83.8 0.0 83.8 17%
IBM RS/6000-59H 107.7 0.0 108.3 13%

(+) Version 2.11 of the DL_POLY Code


M.F. Guest / W. Smith
Jun 6 22:02:03 BST 1999